Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
@ 2019-02-15 22:08 Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
                   ` (31 more replies)
  0 siblings, 32 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset produces physically contiguous memory by moving in-use pages
without allocating any new pages. It targets two scenarios that complements
khugepaged use cases: 1) avoiding page reclaim and memory compaction when the
system is under memory pressure because this patchset does not allocate any new
pages, 2) generating pages larger than 2^MAX_ORDER without changing the buddy
allocator.

To demonstrate its use, I add very basic 1GB THP support and enable promoting
512 2MB THPs to a 1GB THP in my patchset. Promoting 512 4KB pages to a 2MB
THP is also implemented.

The patches are on top of v5.0-rc5. They are posted as part of my upcoming
LSF/MM proposal.

Motivation 
---- 

The goal of this patchset is to provide alternative way of generating physically
contiguous memory and making it available as arbitrary sized large pages. This
patchset generates physically contiguous memory/arbitrary size pages after pages
are allocated by moving virtually-contiguous pages to become physically
contiguous at any size, thus it does not require changes to memory allocators.
On the other hand, it works only for moveable pages, so it also faces the same
fragmentation issues as memory compaction, i.e., if non-moveable pages spread
across the entire memory, this patchset can only generate contiguity between
any two non-moveable pages. 

Large pages and physically contiguous memory are important to devices, such as
GPUs, FPGAs, NICs and RDMA controllers, because they can often achieve better
performance when operating on large pages. The same can be said of CPU
performance, of course, but there is an important difference: GPUs and
high-throughput devices often take a more severe performance hit, in the event
of a TLB miss and subsequent page table walks, as compared to a CPU. The effect
is sufficiently large that such devices *really* want a highly reliable way to
allocate large pages to minimize the number of potential TLB misses and the time
spent on the induced page table walks. 

Vendors (like Oracle, Mellanox, IBM, NVIDIA) are interested in generating
physically contiguous memory beyond THP sizes and looking for solutions [1],[2],[3].
This patchset provides an alternative approach, compared to allocating
physically contiguous memory at page allocation time, to generating physically
contiguous memory after pages are allocated. This approach can avoid page
reclaim and memory compaction, which happen during the process of page
allocation, but still produces comparable physically contiguous memory. 

In terms of THPs, it helps, but we are interested in even larger contiguous
ranges (or page size support) to further reduce the address translation overheads.
With this patchset, we can generate pages larger than PMD-level THPs without
requiring MAX_ORDER changes in the buddy allocators. 


Patch structure 
---- 

The patchset I developed to generate physically contiguous memory/arbitrary
sized pages merely moves pages around. There are three components in this
patchset:

1) a new page migration mechanism, called exchange pages, that exchanges the
content of two in-use pages instead of performing two back-to-back page
migration. It saves on overheads and avoids page reclaim and memory compaction
in the page allocation path, although it is not strictly required if enough
free memory is available in the system.

2) a new mechanism that utilizes both page migration and exchange pages to
produce physically contiguous memory/arbitrary sized pages without allocating
any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
physically contiguous memory out of each VMA, which is virtually contiguous.
A simple range tree is used to ensure no two VMAs are overlapping with each
other in the physical address space.

3) a use case of the new physically contiguous memory producing mechanism that
generates 1GB THPs by migrating and exchanging pages and promoting 512
contiguous 2MB THPs to a 1GB THP, although even larger physically contiguous
memory ranges can be generated. The 1GB THP implement is very basic, which can
handle 1GB THP faults when buddy allocator is modified to allocate 1GB pages,
support 1GB THP split to 2MB THP and in-place promotion from 2MB THP to 1GB THP,
and PMD/PTE-mapped 1GB THP. These are not fully tested.


[1] https://lwn.net/Articles/736170/ 
[2] https://lwn.net/Articles/753167/ 
[3] https://blogs.nvidia.com/blog/2018/06/08/worlds-fastest-exascale-ai-supercomputer-summit/ 

Zi Yan (31):
  mm: migrate: Add exchange_pages to exchange two lists of pages.
  mm: migrate: Add THP exchange support.
  mm: migrate: Add tmpfs exchange support.
  mm: add mem_defrag functionality.
  mem_defrag: split a THP if either src or dst is THP only.
  mm: Make MAX_ORDER configurable in Kconfig for buddy allocator.
  mm: deallocate pages with order > MAX_ORDER.
  mm: add pagechain container for storing multiple pages.
  mm: thp: 1GB anonymous page implementation.
  mm: proc: add 1GB THP kpageflag.
  mm: debug: print compound page order in dump_page().
  mm: stats: Separate PMD THP and PUD THP stats.
  mm: thp: 1GB THP copy on write implementation.
  mm: thp: handling 1GB THP reference bit.
  mm: thp: add 1GB THP split_huge_pud_page() function.
  mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time.
  mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: 1GB THP support in try_to_unmap().
  mm: thp: split 1GB THPs at page reclaim.
  mm: thp: 1GB zero page shrinker.
  mm: thp: 1GB THP follow_p*d_page() support.
  mm: support 1GB THP pagemap support.
  sysctl: add an option to only print the head page virtual address.
  mm: thp: add a knob to enable/disable 1GB THPs.
  mm: thp: promote PTE-mapped THP to PMD-mapped THP.
  mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages.
  mm: vmstats: add page promotion stats.
  mm: madvise: add madvise options to split PMD and PUD THPs.
  mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support.
  sysctl: toggle to promote PUD-mapped 1GB THP or not.

 arch/x86/Kconfig                       |   15 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 arch/x86/include/asm/pgalloc.h         |   69 +
 arch/x86/include/asm/pgtable.h         |   20 +
 arch/x86/include/asm/sparsemem.h       |    4 +-
 arch/x86/mm/pgtable.c                  |   38 +
 drivers/base/node.c                    |    3 +
 fs/exec.c                              |    4 +
 fs/proc/meminfo.c                      |    2 +
 fs/proc/page.c                         |    2 +
 fs/proc/task_mmu.c                     |   47 +-
 include/asm-generic/pgtable.h          |  110 +
 include/linux/huge_mm.h                |   78 +-
 include/linux/khugepaged.h             |    1 +
 include/linux/ksm.h                    |    5 +
 include/linux/mem_defrag.h             |   60 +
 include/linux/memcontrol.h             |    5 +
 include/linux/mm.h                     |   34 +
 include/linux/mm_types.h               |    5 +
 include/linux/mmu_notifier.h           |   13 +
 include/linux/mmzone.h                 |    1 +
 include/linux/page-flags.h             |   79 +-
 include/linux/pagechain.h              |   73 +
 include/linux/rmap.h                   |   10 +-
 include/linux/sched/coredump.h         |    4 +
 include/linux/swap.h                   |    2 +
 include/linux/syscalls.h               |    3 +
 include/linux/vm_event_item.h          |   33 +
 include/uapi/asm-generic/mman-common.h |   15 +
 include/uapi/linux/kernel-page-flags.h |    2 +
 kernel/events/uprobes.c                |    4 +-
 kernel/fork.c                          |   14 +
 kernel/sysctl.c                        |  101 +-
 mm/Makefile                            |    2 +
 mm/compaction.c                        |   17 +-
 mm/debug.c                             |    8 +-
 mm/exchange.c                          |  878 +++++++
 mm/filemap.c                           |    8 +
 mm/gup.c                               |   60 +-
 mm/huge_memory.c                       | 3360 ++++++++++++++++++++----
 mm/hugetlb.c                           |    4 +-
 mm/internal.h                          |   46 +
 mm/khugepaged.c                        |    7 +-
 mm/ksm.c                               |   39 +-
 mm/madvise.c                           |  121 +
 mm/mem_defrag.c                        | 1941 ++++++++++++++
 mm/memcontrol.c                        |   13 +
 mm/memory.c                            |   55 +-
 mm/migrate.c                           |   14 +-
 mm/mmap.c                              |   29 +
 mm/page_alloc.c                        |  108 +-
 mm/page_vma_mapped.c                   |  129 +-
 mm/pgtable-generic.c                   |   78 +-
 mm/rmap.c                              |  283 +-
 mm/swap.c                              |   38 +
 mm/swap_slots.c                        |    2 +
 mm/swapfile.c                          |    4 +-
 mm/userfaultfd.c                       |    2 +-
 mm/util.c                              |    7 +
 mm/vmscan.c                            |   55 +-
 mm/vmstat.c                            |   32 +
 61 files changed, 7452 insertions(+), 745 deletions(-)
 create mode 100644 include/linux/mem_defrag.h
 create mode 100644 include/linux/pagechain.h
 create mode 100644 mm/exchange.c
 create mode 100644 mm/mem_defrag.c

--
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-17 11:29   ` Matthew Wilcox
  2019-02-21 21:10   ` Jerome Glisse
  2019-02-15 22:08 ` [RFC PATCH 02/31] mm: migrate: Add THP exchange support Zi Yan
                   ` (30 subsequent siblings)
  31 siblings, 2 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

In stead of using two migrate_pages(), a single exchange_pages() would
be sufficient and without allocating new pages.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/ksm.h |   5 +
 mm/Makefile         |   1 +
 mm/exchange.c       | 846 ++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h       |   6 +
 mm/ksm.c            |  35 ++
 mm/migrate.c        |   4 +-
 6 files changed, 895 insertions(+), 2 deletions(-)
 create mode 100644 mm/exchange.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..87c5b943a73c 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,7 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+void ksm_exchange_page(struct page *to_page, struct page *from_page);
 
 #else  /* !CONFIG_KSM */
 
@@ -86,6 +87,10 @@ static inline void rmap_walk_ksm(struct page *page,
 static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 {
 }
+static inline void ksm_exchange_page(struct page *to_page,
+				struct page *from_page)
+{
+}
 #endif /* CONFIG_MMU */
 #endif /* !CONFIG_KSM */
 
diff --git a/mm/Makefile b/mm/Makefile
index d210cc9d6f80..1574ea5743e4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -43,6 +43,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += exchange.o
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
diff --git a/mm/exchange.c b/mm/exchange.c
new file mode 100644
index 000000000000..a607348cc6f4
--- /dev/null
+++ b/mm/exchange.c
@@ -0,0 +1,846 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016 NVIDIA, Zi Yan <ziy@nvidia.com>
+ *
+ * Exchange two in-use pages. Page flags and page->mapping are exchanged
+ * as well. Only anonymous pages are supported.
+ */
+
+#include <linux/syscalls.h>
+#include <linux/migrate.h>
+#include <linux/security.h>
+#include <linux/cpuset.h>
+#include <linux/hugetlb.h>
+#include <linux/mm_inline.h>
+#include <linux/page_idle.h>
+#include <linux/page-flags.h>
+#include <linux/ksm.h>
+#include <linux/memcontrol.h>
+#include <linux/balloon_compaction.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h> /* buffer_migrate_page  */
+#include <linux/backing-dev.h>
+
+
+#include "internal.h"
+
+struct exchange_page_info {
+	struct page *from_page;
+	struct page *to_page;
+
+	struct anon_vma *from_anon_vma;
+	struct anon_vma *to_anon_vma;
+
+	struct list_head list;
+};
+
+struct page_flags {
+	unsigned int page_error :1;
+	unsigned int page_referenced:1;
+	unsigned int page_uptodate:1;
+	unsigned int page_active:1;
+	unsigned int page_unevictable:1;
+	unsigned int page_checked:1;
+	unsigned int page_mappedtodisk:1;
+	unsigned int page_dirty:1;
+	unsigned int page_is_young:1;
+	unsigned int page_is_idle:1;
+	unsigned int page_swapcache:1;
+	unsigned int page_writeback:1;
+	unsigned int page_private:1;
+	unsigned int __pad:3;
+};
+
+
+static void exchange_page(char *to, char *from)
+{
+	u64 tmp;
+	int i;
+
+	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
+		tmp = *((u64 *)(from + i));
+		*((u64 *)(from + i)) = *((u64 *)(to + i));
+		*((u64 *)(to + i)) = tmp;
+	}
+}
+
+static inline void exchange_highpage(struct page *to, struct page *from)
+{
+	char *vfrom, *vto;
+
+	vfrom = kmap_atomic(from);
+	vto = kmap_atomic(to);
+	exchange_page(vto, vfrom);
+	kunmap_atomic(vto);
+	kunmap_atomic(vfrom);
+}
+
+static void __exchange_gigantic_page(struct page *dst, struct page *src,
+				int nr_pages)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < nr_pages; ) {
+		cond_resched();
+		exchange_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+static void exchange_huge_page(struct page *dst, struct page *src)
+{
+	int i;
+	int nr_pages;
+
+	if (PageHuge(src)) {
+		/* hugetlbfs page */
+		struct hstate *h = page_hstate(src);
+
+		nr_pages = pages_per_huge_page(h);
+
+		if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+			__exchange_gigantic_page(dst, src, nr_pages);
+			return;
+		}
+	} else {
+		/* thp page */
+		VM_BUG_ON(!PageTransHuge(src));
+		nr_pages = hpage_nr_pages(src);
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		cond_resched();
+		exchange_highpage(dst + i, src + i);
+	}
+}
+
+/*
+ * Copy the page to its new location without polluting cache
+ */
+static void exchange_page_flags(struct page *to_page, struct page *from_page)
+{
+	int from_cpupid, to_cpupid;
+	struct page_flags from_page_flags, to_page_flags;
+	struct mem_cgroup *to_memcg = page_memcg(to_page),
+					  *from_memcg = page_memcg(from_page);
+
+	from_cpupid = page_cpupid_xchg_last(from_page, -1);
+
+	from_page_flags.page_error = TestClearPageError(from_page);
+	from_page_flags.page_referenced = TestClearPageReferenced(from_page);
+	from_page_flags.page_uptodate = PageUptodate(from_page);
+	ClearPageUptodate(from_page);
+	from_page_flags.page_active = TestClearPageActive(from_page);
+	from_page_flags.page_unevictable = TestClearPageUnevictable(from_page);
+	from_page_flags.page_checked = PageChecked(from_page);
+	ClearPageChecked(from_page);
+	from_page_flags.page_mappedtodisk = PageMappedToDisk(from_page);
+	ClearPageMappedToDisk(from_page);
+	from_page_flags.page_dirty = PageDirty(from_page);
+	ClearPageDirty(from_page);
+	from_page_flags.page_is_young = test_and_clear_page_young(from_page);
+	from_page_flags.page_is_idle = page_is_idle(from_page);
+	clear_page_idle(from_page);
+	from_page_flags.page_swapcache = PageSwapCache(from_page);
+	from_page_flags.page_writeback = test_clear_page_writeback(from_page);
+
+
+	to_cpupid = page_cpupid_xchg_last(to_page, -1);
+
+	to_page_flags.page_error = TestClearPageError(to_page);
+	to_page_flags.page_referenced = TestClearPageReferenced(to_page);
+	to_page_flags.page_uptodate = PageUptodate(to_page);
+	ClearPageUptodate(to_page);
+	to_page_flags.page_active = TestClearPageActive(to_page);
+	to_page_flags.page_unevictable = TestClearPageUnevictable(to_page);
+	to_page_flags.page_checked = PageChecked(to_page);
+	ClearPageChecked(to_page);
+	to_page_flags.page_mappedtodisk = PageMappedToDisk(to_page);
+	ClearPageMappedToDisk(to_page);
+	to_page_flags.page_dirty = PageDirty(to_page);
+	ClearPageDirty(to_page);
+	to_page_flags.page_is_young = test_and_clear_page_young(to_page);
+	to_page_flags.page_is_idle = page_is_idle(to_page);
+	clear_page_idle(to_page);
+	to_page_flags.page_swapcache = PageSwapCache(to_page);
+	to_page_flags.page_writeback = test_clear_page_writeback(to_page);
+
+	/* set to_page */
+	if (from_page_flags.page_error)
+		SetPageError(to_page);
+	if (from_page_flags.page_referenced)
+		SetPageReferenced(to_page);
+	if (from_page_flags.page_uptodate)
+		SetPageUptodate(to_page);
+	if (from_page_flags.page_active) {
+		VM_BUG_ON_PAGE(from_page_flags.page_unevictable, from_page);
+		SetPageActive(to_page);
+	} else if (from_page_flags.page_unevictable)
+		SetPageUnevictable(to_page);
+	if (from_page_flags.page_checked)
+		SetPageChecked(to_page);
+	if (from_page_flags.page_mappedtodisk)
+		SetPageMappedToDisk(to_page);
+
+	/* Move dirty on pages not done by migrate_page_move_mapping() */
+	if (from_page_flags.page_dirty)
+		SetPageDirty(to_page);
+
+	if (from_page_flags.page_is_young)
+		set_page_young(to_page);
+	if (from_page_flags.page_is_idle)
+		set_page_idle(to_page);
+
+	/* set from_page */
+	if (to_page_flags.page_error)
+		SetPageError(from_page);
+	if (to_page_flags.page_referenced)
+		SetPageReferenced(from_page);
+	if (to_page_flags.page_uptodate)
+		SetPageUptodate(from_page);
+	if (to_page_flags.page_active) {
+		VM_BUG_ON_PAGE(to_page_flags.page_unevictable, from_page);
+		SetPageActive(from_page);
+	} else if (to_page_flags.page_unevictable)
+		SetPageUnevictable(from_page);
+	if (to_page_flags.page_checked)
+		SetPageChecked(from_page);
+	if (to_page_flags.page_mappedtodisk)
+		SetPageMappedToDisk(from_page);
+
+	/* Move dirty on pages not done by migrate_page_move_mapping() */
+	if (to_page_flags.page_dirty)
+		SetPageDirty(from_page);
+
+	if (to_page_flags.page_is_young)
+		set_page_young(from_page);
+	if (to_page_flags.page_is_idle)
+		set_page_idle(from_page);
+
+	/*
+	 * Copy NUMA information to the new page, to prevent over-eager
+	 * future migrations of this same page.
+	 */
+	page_cpupid_xchg_last(to_page, from_cpupid);
+	page_cpupid_xchg_last(from_page, to_cpupid);
+
+	ksm_exchange_page(to_page, from_page);
+	/*
+	 * Please do not reorder this without considering how mm/ksm.c's
+	 * get_ksm_page() depends upon ksm_migrate_page() and PageSwapCache().
+	 */
+	ClearPageSwapCache(to_page);
+	ClearPageSwapCache(from_page);
+	if (from_page_flags.page_swapcache)
+		SetPageSwapCache(to_page);
+	if (to_page_flags.page_swapcache)
+		SetPageSwapCache(from_page);
+
+
+#ifdef CONFIG_PAGE_OWNER
+	/* exchange page owner  */
+	BUILD_BUG();
+#endif
+	/* exchange mem cgroup  */
+	to_page->mem_cgroup = from_memcg;
+	from_page->mem_cgroup = to_memcg;
+
+}
+
+/*
+ * Replace the page in the mapping.
+ *
+ * The number of remaining references must be:
+ * 1 for anonymous pages without a mapping
+ * 2 for pages with a mapping
+ * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
+ */
+
+static int exchange_page_move_mapping(struct address_space *to_mapping,
+			struct address_space *from_mapping,
+			struct page *to_page, struct page *from_page,
+			struct buffer_head *to_head,
+			struct buffer_head *from_head,
+			enum migrate_mode mode,
+			int to_extra_count, int from_extra_count)
+{
+	int to_expected_count = 1 + to_extra_count,
+		from_expected_count = 1 + from_extra_count;
+	unsigned long from_page_index = from_page->index;
+	unsigned long to_page_index = to_page->index;
+	int to_swapbacked = PageSwapBacked(to_page),
+		from_swapbacked = PageSwapBacked(from_page);
+	struct address_space *to_mapping_value = to_page->mapping;
+	struct address_space *from_mapping_value = from_page->mapping;
+
+	VM_BUG_ON_PAGE(to_mapping != page_mapping(to_page), to_page);
+	VM_BUG_ON_PAGE(from_mapping != page_mapping(from_page), from_page);
+
+	if (!to_mapping) {
+		/* Anonymous page without mapping */
+		if (page_count(to_page) != to_expected_count)
+			return -EAGAIN;
+	}
+
+	if (!from_mapping) {
+		/* Anonymous page without mapping */
+		if (page_count(from_page) != from_expected_count)
+			return -EAGAIN;
+	}
+
+	/* both are anonymous pages  */
+	if (!from_mapping && !to_mapping) {
+		/* from_page  */
+		from_page->index = to_page_index;
+		from_page->mapping = to_mapping_value;
+
+		ClearPageSwapBacked(from_page);
+		if (to_swapbacked)
+			SetPageSwapBacked(from_page);
+
+
+		/* to_page  */
+		to_page->index = from_page_index;
+		to_page->mapping = from_mapping_value;
+
+		ClearPageSwapBacked(to_page);
+		if (from_swapbacked)
+			SetPageSwapBacked(to_page);
+	} else if (!from_mapping && to_mapping) {
+		/* from is anonymous, to is file-backed  */
+		struct zone *from_zone, *to_zone;
+		void **to_pslot;
+		int dirty;
+
+		from_zone = page_zone(from_page);
+		to_zone = page_zone(to_page);
+
+		xa_lock_irq(&to_mapping->i_pages);
+
+		to_pslot = radix_tree_lookup_slot(&to_mapping->i_pages,
+			page_index(to_page));
+
+		to_expected_count += 1 + page_has_private(to_page);
+		if (page_count(to_page) != to_expected_count ||
+			radix_tree_deref_slot_protected(to_pslot,
+				&to_mapping->i_pages.xa_lock) != to_page) {
+			xa_unlock_irq(&to_mapping->i_pages);
+			return -EAGAIN;
+		}
+
+		if (!page_ref_freeze(to_page, to_expected_count)) {
+			xa_unlock_irq(&to_mapping->i_pages);
+			pr_debug("cannot freeze page count\n");
+			return -EAGAIN;
+		}
+
+		if (mode == MIGRATE_ASYNC && to_head &&
+				!buffer_migrate_lock_buffers(to_head, mode)) {
+			page_ref_unfreeze(to_page, to_expected_count);
+			xa_unlock_irq(&to_mapping->i_pages);
+
+			pr_debug("cannot lock buffer head\n");
+			return -EAGAIN;
+		}
+
+		if (!page_ref_freeze(from_page, from_expected_count)) {
+			page_ref_unfreeze(to_page, to_expected_count);
+			xa_unlock_irq(&to_mapping->i_pages);
+
+			return -EAGAIN;
+		}
+		/*
+		 * Now we know that no one else is looking at the page:
+		 * no turning back from here.
+		 */
+		ClearPageSwapBacked(from_page);
+		ClearPageSwapBacked(to_page);
+
+		/* from_page  */
+		from_page->index = to_page_index;
+		from_page->mapping = to_mapping_value;
+		/* to_page  */
+		to_page->index = from_page_index;
+		to_page->mapping = from_mapping_value;
+
+		if (to_swapbacked)
+			__SetPageSwapBacked(from_page);
+		else
+			VM_BUG_ON_PAGE(PageSwapCache(to_page), to_page);
+
+		if (from_swapbacked)
+			__SetPageSwapBacked(to_page);
+		else
+			VM_BUG_ON_PAGE(PageSwapCache(from_page), from_page);
+
+		dirty = PageDirty(to_page);
+
+		radix_tree_replace_slot(&to_mapping->i_pages,
+				to_pslot, from_page);
+
+		/* move cache reference */
+		page_ref_unfreeze(to_page, to_expected_count - 1);
+		page_ref_unfreeze(from_page, from_expected_count + 1);
+
+		xa_unlock(&to_mapping->i_pages);
+
+		/*
+		 * If moved to a different zone then also account
+		 * the page for that zone. Other VM counters will be
+		 * taken care of when we establish references to the
+		 * new page and drop references to the old page.
+		 *
+		 * Note that anonymous pages are accounted for
+		 * via NR_FILE_PAGES and NR_ANON_MAPPED if they
+		 * are mapped to swap space.
+		 */
+		if (to_zone != from_zone) {
+			__dec_node_state(to_zone->zone_pgdat, NR_FILE_PAGES);
+			__inc_node_state(from_zone->zone_pgdat, NR_FILE_PAGES);
+			if (PageSwapBacked(to_page) && !PageSwapCache(to_page)) {
+				__dec_node_state(to_zone->zone_pgdat, NR_SHMEM);
+				__inc_node_state(from_zone->zone_pgdat, NR_SHMEM);
+			}
+			if (dirty && mapping_cap_account_dirty(to_mapping)) {
+				__dec_node_state(to_zone->zone_pgdat, NR_FILE_DIRTY);
+				__dec_zone_state(to_zone, NR_ZONE_WRITE_PENDING);
+				__inc_node_state(from_zone->zone_pgdat, NR_FILE_DIRTY);
+				__inc_zone_state(from_zone, NR_ZONE_WRITE_PENDING);
+			}
+		}
+		local_irq_enable();
+
+	} else {
+		/* from is file-backed to is anonymous: fold this to the case above */
+		/* both are file-backed  */
+		VM_BUG_ON(1);
+	}
+
+	return MIGRATEPAGE_SUCCESS;
+}
+
+static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
+				enum migrate_mode mode)
+{
+	int rc = -EBUSY;
+	struct address_space *to_page_mapping, *from_page_mapping;
+	struct buffer_head *to_head = NULL, *to_bh = NULL;
+
+	VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
+	VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
+
+	/* copy page->mapping not use page_mapping()  */
+	to_page_mapping = page_mapping(to_page);
+	from_page_mapping = page_mapping(from_page);
+
+	/* from_page has to be anonymous page  */
+	VM_BUG_ON(from_page_mapping);
+	VM_BUG_ON(PageWriteback(from_page));
+	/* writeback has to finish */
+	BUG_ON(PageWriteback(to_page));
+
+
+	/* to_page is anonymous  */
+	if (!to_page_mapping) {
+exchange_mappings:
+		/* actual page mapping exchange */
+		rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
+					to_page, from_page, NULL, NULL, mode, 0, 0);
+	} else {
+		if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) {
+
+			if (!page_has_buffers(to_page))
+				goto exchange_mappings;
+
+			to_head = page_buffers(to_page);
+
+			rc = exchange_page_move_mapping(to_page_mapping,
+					from_page_mapping, to_page, from_page,
+					to_head, NULL, mode, 0, 0);
+
+			if (rc != MIGRATEPAGE_SUCCESS)
+				return rc;
+
+			/*
+			 * In the async case, migrate_page_move_mapping locked the buffers
+			 * with an IRQ-safe spinlock held. In the sync case, the buffers
+			 * need to be locked now
+			 */
+			if (mode != MIGRATE_ASYNC)
+				VM_BUG_ON(!buffer_migrate_lock_buffers(to_head, mode));
+
+			ClearPagePrivate(to_page);
+			set_page_private(from_page, page_private(to_page));
+			set_page_private(to_page, 0);
+			/* transfer private page count  */
+			put_page(to_page);
+			get_page(from_page);
+
+			to_bh = to_head;
+			do {
+				set_bh_page(to_bh, from_page, bh_offset(to_bh));
+				to_bh = to_bh->b_this_page;
+
+			} while (to_bh != to_head);
+
+			SetPagePrivate(from_page);
+
+			to_bh = to_head;
+		} else if (!to_page_mapping->a_ops->migratepage) {
+			/* fallback_migrate_page  */
+			if (PageDirty(to_page)) {
+				if (mode != MIGRATE_SYNC)
+					return -EBUSY;
+				return writeout(to_page_mapping, to_page);
+			}
+			if (page_has_private(to_page) &&
+				!try_to_release_page(to_page, GFP_KERNEL))
+				return -EAGAIN;
+
+			goto exchange_mappings;
+		}
+	}
+	/* actual page data exchange  */
+	if (rc != MIGRATEPAGE_SUCCESS)
+		return rc;
+
+
+	if (PageHuge(from_page) || PageTransHuge(from_page))
+		exchange_huge_page(to_page, from_page);
+	else
+		exchange_highpage(to_page, from_page);
+	rc = 0;
+
+	/*
+	 * 1. buffer_migrate_page:
+	 *   private flag should be transferred from to_page to from_page
+	 *
+	 * 2. anon<->anon, fallback_migrate_page:
+	 *   both have none private flags or to_page's is cleared.
+	 */
+	VM_BUG_ON(!((page_has_private(from_page) && !page_has_private(to_page)) ||
+				(!page_has_private(from_page) && !page_has_private(to_page))));
+
+	exchange_page_flags(to_page, from_page);
+
+	if (to_bh) {
+		VM_BUG_ON(to_bh != to_head);
+		do {
+			unlock_buffer(to_bh);
+			put_bh(to_bh);
+			to_bh = to_bh->b_this_page;
+
+		} while (to_bh != to_head);
+	}
+
+	return rc;
+}
+
+static int unmap_and_exchange(struct page *from_page,
+		struct page *to_page, enum migrate_mode mode)
+{
+	int rc = -EAGAIN;
+	struct anon_vma *from_anon_vma = NULL;
+	struct anon_vma *to_anon_vma = NULL;
+	int from_page_was_mapped = 0;
+	int to_page_was_mapped = 0;
+	int from_page_count = 0, to_page_count = 0;
+	int from_map_count = 0, to_map_count = 0;
+	unsigned long from_flags, to_flags;
+	pgoff_t from_index, to_index;
+	struct address_space *from_mapping, *to_mapping;
+
+	if (!trylock_page(from_page)) {
+		if (mode == MIGRATE_ASYNC)
+			goto out;
+		lock_page(from_page);
+	}
+
+	if (!trylock_page(to_page)) {
+		if (mode == MIGRATE_ASYNC)
+			goto out_unlock;
+		lock_page(to_page);
+	}
+
+	/* from_page is supposed to be an anonymous page */
+	VM_BUG_ON_PAGE(PageWriteback(from_page), from_page);
+
+	if (PageWriteback(to_page)) {
+		/*
+		 * Only in the case of a full synchronous migration is it
+		 * necessary to wait for PageWriteback. In the async case,
+		 * the retry loop is too short and in the sync-light case,
+		 * the overhead of stalling is too much
+		 */
+		if (mode != MIGRATE_SYNC) {
+			rc = -EBUSY;
+			goto out_unlock_both;
+		}
+		wait_on_page_writeback(to_page);
+	}
+
+	if (PageAnon(from_page) && !PageKsm(from_page))
+		from_anon_vma = page_get_anon_vma(from_page);
+
+	if (PageAnon(to_page) && !PageKsm(to_page))
+		to_anon_vma = page_get_anon_vma(to_page);
+
+	from_page_count = page_count(from_page);
+	from_map_count = page_mapcount(from_page);
+	to_page_count = page_count(to_page);
+	to_map_count = page_mapcount(to_page);
+	from_flags = from_page->flags;
+	to_flags = to_page->flags;
+	from_mapping = from_page->mapping;
+	to_mapping = to_page->mapping;
+	from_index = from_page->index;
+	to_index = to_page->index;
+
+	/*
+	 * Corner case handling:
+	 * 1. When a new swap-cache page is read into, it is added to the LRU
+	 * and treated as swapcache but it has no rmap yet.
+	 * Calling try_to_unmap() against a page->mapping==NULL page will
+	 * trigger a BUG.  So handle it here.
+	 * 2. An orphaned page (see truncate_complete_page) might have
+	 * fs-private metadata. The page can be picked up due to memory
+	 * offlining.  Everywhere else except page reclaim, the page is
+	 * invisible to the vm, so the page can not be migrated.  So try to
+	 * free the metadata, so the page can be freed.
+	 */
+	if (!from_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(from_page), from_page);
+		if (page_has_private(from_page)) {
+			try_to_free_buffers(from_page);
+			goto out_unlock_both;
+		}
+	} else if (page_mapped(from_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(from_page) && !PageKsm(from_page) &&
+					   !from_anon_vma, from_page);
+		try_to_unmap(from_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		from_page_was_mapped = 1;
+	}
+
+	if (!to_page->mapping) {
+		VM_BUG_ON_PAGE(PageAnon(to_page), to_page);
+		if (page_has_private(to_page)) {
+			try_to_free_buffers(to_page);
+			goto out_unlock_both_remove_from_migration_pte;
+		}
+	} else if (page_mapped(to_page)) {
+		/* Establish migration ptes */
+		VM_BUG_ON_PAGE(PageAnon(to_page) && !PageKsm(to_page) &&
+						!to_anon_vma, to_page);
+		try_to_unmap(to_page,
+			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		to_page_was_mapped = 1;
+	}
+
+	if (!page_mapped(from_page) && !page_mapped(to_page))
+		rc = exchange_from_to_pages(to_page, from_page, mode);
+
+
+	if (to_page_was_mapped) {
+		/* swap back to_page->index to be compatible with
+		 * remove_migration_ptes(), which assumes both from_page and to_page
+		 * below have the same index.
+		 */
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(to_page->index, to_index);
+
+		remove_migration_ptes(to_page,
+			rc == MIGRATEPAGE_SUCCESS ? from_page : to_page, false);
+
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(to_page->index, to_index);
+	}
+
+out_unlock_both_remove_from_migration_pte:
+	if (from_page_was_mapped) {
+		/* swap back from_page->index to be compatible with
+		 * remove_migration_ptes(), which assumes both from_page and to_page
+		 * below have the same index.
+		 */
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(from_page->index, from_index);
+
+		remove_migration_ptes(from_page,
+			rc == MIGRATEPAGE_SUCCESS ? to_page : from_page, false);
+
+		if (rc == MIGRATEPAGE_SUCCESS)
+			swap(from_page->index, from_index);
+	}
+
+out_unlock_both:
+	if (to_anon_vma)
+		put_anon_vma(to_anon_vma);
+	unlock_page(to_page);
+out_unlock:
+	/* Drop an anon_vma reference if we took one */
+	if (from_anon_vma)
+		put_anon_vma(from_anon_vma);
+	unlock_page(from_page);
+out:
+	return rc;
+}
+
+/*
+ * Exchange pages in the exchange_list
+ *
+ * Caller should release the exchange_list resource.
+ *
+ */
+static int exchange_pages(struct list_head *exchange_list,
+			enum migrate_mode mode,
+			int reason)
+{
+	struct exchange_page_info *one_pair, *one_pair2;
+	int failed = 0;
+
+	list_for_each_entry_safe(one_pair, one_pair2, exchange_list, list) {
+		struct page *from_page = one_pair->from_page;
+		struct page *to_page = one_pair->to_page;
+		int rc;
+		int retry = 0;
+
+again:
+		if (page_count(from_page) == 1) {
+			/* page was freed from under us. So we are done  */
+			ClearPageActive(from_page);
+			ClearPageUnevictable(from_page);
+
+			mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+					page_is_file_cache(from_page),
+					-hpage_nr_pages(from_page));
+			put_page(from_page);
+
+			if (page_count(to_page) == 1) {
+				ClearPageActive(to_page);
+				ClearPageUnevictable(to_page);
+				put_page(to_page);
+				mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+						page_is_file_cache(to_page),
+						-hpage_nr_pages(to_page));
+			} else
+				goto putback_to_page;
+
+			continue;
+		}
+
+		if (page_count(to_page) == 1) {
+			/* page was freed from under us. So we are done  */
+			ClearPageActive(to_page);
+			ClearPageUnevictable(to_page);
+
+			mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+					page_is_file_cache(to_page),
+					-hpage_nr_pages(to_page));
+			put_page(to_page);
+
+			mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+					page_is_file_cache(from_page),
+					-hpage_nr_pages(from_page));
+			putback_lru_page(from_page);
+			continue;
+		}
+
+		/* TODO: compound page not supported */
+		/* to_page can be file-backed page  */
+		if (PageCompound(from_page) ||
+			page_mapping(from_page)
+			) {
+			++failed;
+			goto putback;
+		}
+
+		rc = unmap_and_exchange(from_page, to_page, mode);
+
+		if (rc == -EAGAIN && retry < 3) {
+			++retry;
+			goto again;
+		}
+
+		if (rc != MIGRATEPAGE_SUCCESS)
+			++failed;
+
+putback:
+		mod_node_page_state(page_pgdat(from_page), NR_ISOLATED_ANON +
+				page_is_file_cache(from_page),
+				-hpage_nr_pages(from_page));
+
+		putback_lru_page(from_page);
+putback_to_page:
+		mod_node_page_state(page_pgdat(to_page), NR_ISOLATED_ANON +
+				page_is_file_cache(to_page),
+				-hpage_nr_pages(to_page));
+
+		putback_lru_page(to_page);
+	}
+	return failed;
+}
+
+int exchange_two_pages(struct page *page1, struct page *page2)
+{
+	struct exchange_page_info page_info;
+	LIST_HEAD(exchange_list);
+	int err = -EFAULT;
+	int pagevec_flushed = 0;
+
+	VM_BUG_ON_PAGE(PageTail(page1), page1);
+	VM_BUG_ON_PAGE(PageTail(page2), page2);
+
+	if (!(PageLRU(page1) && PageLRU(page2)))
+		return -EBUSY;
+
+retry_isolate1:
+	if (!get_page_unless_zero(page1))
+		return -EBUSY;
+	err = isolate_lru_page(page1);
+	put_page(page1);
+	if (err) {
+		if (!pagevec_flushed) {
+			migrate_prep();
+			pagevec_flushed = 1;
+			goto retry_isolate1;
+		}
+		return err;
+	}
+	mod_node_page_state(page_pgdat(page1),
+			NR_ISOLATED_ANON + page_is_file_cache(page1),
+			hpage_nr_pages(page1));
+
+retry_isolate2:
+	if (!get_page_unless_zero(page2)) {
+		putback_lru_page(page1);
+		return -EBUSY;
+	}
+	err = isolate_lru_page(page2);
+	put_page(page2);
+	if (err) {
+		if (!pagevec_flushed) {
+			migrate_prep();
+			pagevec_flushed = 1;
+			goto retry_isolate2;
+		}
+		return err;
+	}
+	mod_node_page_state(page_pgdat(page2),
+			NR_ISOLATED_ANON + page_is_file_cache(page2),
+			hpage_nr_pages(page2));
+
+	page_info.from_page = page1;
+	page_info.to_page = page2;
+	INIT_LIST_HEAD(&page_info.list);
+	list_add(&page_info.list, &exchange_list);
+
+
+	return exchange_pages(&exchange_list, MIGRATE_SYNC, 0);
+
+}
diff --git a/mm/internal.h b/mm/internal.h
index f4a7bb02decf..77e205c423ce 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -543,4 +543,10 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 
 void setup_zone_pageset(struct zone *zone);
 extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+bool buffer_migrate_lock_buffers(struct buffer_head *head,
+							enum migrate_mode mode);
+int writeout(struct address_space *mapping, struct page *page);
+extern int exchange_two_pages(struct page *page1, struct page *page2);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 6c48ad13b4c9..dc1ec06b71a0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2665,6 +2665,41 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
 		set_page_stable_node(oldpage, NULL);
 	}
 }
+
+void ksm_exchange_page(struct page *to_page, struct page *from_page)
+{
+	struct stable_node *to_stable_node, *from_stable_node;
+
+	VM_BUG_ON_PAGE(!PageLocked(to_page), to_page);
+	VM_BUG_ON_PAGE(!PageLocked(from_page), from_page);
+
+	to_stable_node = page_stable_node(to_page);
+	from_stable_node = page_stable_node(from_page);
+	if (to_stable_node) {
+		VM_BUG_ON_PAGE(to_stable_node->kpfn != page_to_pfn(from_page),
+					from_page);
+		to_stable_node->kpfn = page_to_pfn(to_page);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+	}
+	if (from_stable_node) {
+		VM_BUG_ON_PAGE(from_stable_node->kpfn != page_to_pfn(to_page),
+					to_page);
+		from_stable_node->kpfn = page_to_pfn(from_page);
+		/*
+		 * newpage->mapping was set in advance; now we need smp_wmb()
+		 * to make sure that the new stable_node->kpfn is visible
+		 * to get_ksm_page() before it can see that oldpage->mapping
+		 * has gone stale (or that PageSwapCache has been cleared).
+		 */
+		smp_wmb();
+	}
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/mm/migrate.c b/mm/migrate.c
index d4fd680be3b0..b8c79aa62134 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -701,7 +701,7 @@ EXPORT_SYMBOL(migrate_page);
 
 #ifdef CONFIG_BLOCK
 /* Returns true if all buffers are successfully locked */
-static bool buffer_migrate_lock_buffers(struct buffer_head *head,
+bool buffer_migrate_lock_buffers(struct buffer_head *head,
 							enum migrate_mode mode)
 {
 	struct buffer_head *bh = head;
@@ -849,7 +849,7 @@ int buffer_migrate_page_norefs(struct address_space *mapping,
 /*
  * Writeback a page to clean the dirty state
  */
-static int writeout(struct address_space *mapping, struct page *page)
+int writeout(struct address_space *mapping, struct page *page)
 {
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 02/31] mm: migrate: Add THP exchange support.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 03/31] mm: migrate: Add tmpfs " Zi Yan
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Add support for exchange between two THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/exchange.c | 47 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 38 insertions(+), 9 deletions(-)

diff --git a/mm/exchange.c b/mm/exchange.c
index a607348cc6f4..8cf286fc0f10 100644
--- a/mm/exchange.c
+++ b/mm/exchange.c
@@ -48,7 +48,8 @@ struct page_flags {
 	unsigned int page_swapcache:1;
 	unsigned int page_writeback:1;
 	unsigned int page_private:1;
-	unsigned int __pad:3;
+	unsigned int page_doublemap:1;
+	unsigned int __pad:2;
 };
 
 
@@ -125,20 +126,23 @@ static void exchange_huge_page(struct page *dst, struct page *src)
 static void exchange_page_flags(struct page *to_page, struct page *from_page)
 {
 	int from_cpupid, to_cpupid;
-	struct page_flags from_page_flags, to_page_flags;
+	struct page_flags from_page_flags = {0}, to_page_flags = {0};
 	struct mem_cgroup *to_memcg = page_memcg(to_page),
 					  *from_memcg = page_memcg(from_page);
 
 	from_cpupid = page_cpupid_xchg_last(from_page, -1);
 
-	from_page_flags.page_error = TestClearPageError(from_page);
+	from_page_flags.page_error = PageError(from_page);
+	if (from_page_flags.page_error)
+		ClearPageError(from_page);
 	from_page_flags.page_referenced = TestClearPageReferenced(from_page);
 	from_page_flags.page_uptodate = PageUptodate(from_page);
 	ClearPageUptodate(from_page);
 	from_page_flags.page_active = TestClearPageActive(from_page);
 	from_page_flags.page_unevictable = TestClearPageUnevictable(from_page);
 	from_page_flags.page_checked = PageChecked(from_page);
-	ClearPageChecked(from_page);
+	if (from_page_flags.page_checked)
+		ClearPageChecked(from_page);
 	from_page_flags.page_mappedtodisk = PageMappedToDisk(from_page);
 	ClearPageMappedToDisk(from_page);
 	from_page_flags.page_dirty = PageDirty(from_page);
@@ -148,18 +152,22 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 	clear_page_idle(from_page);
 	from_page_flags.page_swapcache = PageSwapCache(from_page);
 	from_page_flags.page_writeback = test_clear_page_writeback(from_page);
+	from_page_flags.page_doublemap = PageDoubleMap(from_page);
 
 
 	to_cpupid = page_cpupid_xchg_last(to_page, -1);
 
-	to_page_flags.page_error = TestClearPageError(to_page);
+	to_page_flags.page_error = PageError(to_page);
+	if (to_page_flags.page_error)
+		ClearPageError(to_page);
 	to_page_flags.page_referenced = TestClearPageReferenced(to_page);
 	to_page_flags.page_uptodate = PageUptodate(to_page);
 	ClearPageUptodate(to_page);
 	to_page_flags.page_active = TestClearPageActive(to_page);
 	to_page_flags.page_unevictable = TestClearPageUnevictable(to_page);
 	to_page_flags.page_checked = PageChecked(to_page);
-	ClearPageChecked(to_page);
+	if (to_page_flags.page_checked)
+		ClearPageChecked(to_page);
 	to_page_flags.page_mappedtodisk = PageMappedToDisk(to_page);
 	ClearPageMappedToDisk(to_page);
 	to_page_flags.page_dirty = PageDirty(to_page);
@@ -169,6 +177,7 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 	clear_page_idle(to_page);
 	to_page_flags.page_swapcache = PageSwapCache(to_page);
 	to_page_flags.page_writeback = test_clear_page_writeback(to_page);
+	to_page_flags.page_doublemap = PageDoubleMap(to_page);
 
 	/* set to_page */
 	if (from_page_flags.page_error)
@@ -195,6 +204,8 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 		set_page_young(to_page);
 	if (from_page_flags.page_is_idle)
 		set_page_idle(to_page);
+	if (from_page_flags.page_doublemap)
+		SetPageDoubleMap(to_page);
 
 	/* set from_page */
 	if (to_page_flags.page_error)
@@ -221,6 +232,8 @@ static void exchange_page_flags(struct page *to_page, struct page *from_page)
 		set_page_young(from_page);
 	if (to_page_flags.page_is_idle)
 		set_page_idle(from_page);
+	if (to_page_flags.page_doublemap)
+		SetPageDoubleMap(from_page);
 
 	/*
 	 * Copy NUMA information to the new page, to prevent over-eager
@@ -280,6 +293,7 @@ static int exchange_page_move_mapping(struct address_space *to_mapping,
 
 	VM_BUG_ON_PAGE(to_mapping != page_mapping(to_page), to_page);
 	VM_BUG_ON_PAGE(from_mapping != page_mapping(from_page), from_page);
+	VM_BUG_ON(PageCompound(from_page) != PageCompound(to_page));
 
 	if (!to_mapping) {
 		/* Anonymous page without mapping */
@@ -600,7 +614,6 @@ static int unmap_and_exchange(struct page *from_page,
 	to_mapping = to_page->mapping;
 	from_index = from_page->index;
 	to_index = to_page->index;
-
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -691,6 +704,23 @@ static int unmap_and_exchange(struct page *from_page,
 	return rc;
 }
 
+static bool can_be_exchanged(struct page *from, struct page *to)
+{
+	if (PageCompound(from) != PageCompound(to))
+		return false;
+
+	if (PageHuge(from) != PageHuge(to))
+		return false;
+
+	if (PageHuge(from) || PageHuge(to))
+		return false;
+
+	if (compound_order(from) != compound_order(to))
+		return false;
+
+	return true;
+}
+
 /*
  * Exchange pages in the exchange_list
  *
@@ -751,9 +781,8 @@ static int exchange_pages(struct list_head *exchange_list,
 			continue;
 		}
 
-		/* TODO: compound page not supported */
 		/* to_page can be file-backed page  */
-		if (PageCompound(from_page) ||
+		if (!can_be_exchanged(from_page, to_page) ||
 			page_mapping(from_page)
 			) {
 			++failed;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 03/31] mm: migrate: Add tmpfs exchange support.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 02/31] mm: migrate: Add THP exchange support Zi Yan
@ 2019-02-15 22:08 ` " Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 04/31] mm: add mem_defrag functionality Zi Yan
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

tmpfs uses the same migrate routine as anonymous pages, enabling
exchange pages for it is easy.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/exchange.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/exchange.c b/mm/exchange.c
index 8cf286fc0f10..851f1a99b48b 100644
--- a/mm/exchange.c
+++ b/mm/exchange.c
@@ -466,7 +466,10 @@ static int exchange_from_to_pages(struct page *to_page, struct page *from_page,
 		rc = exchange_page_move_mapping(to_page_mapping, from_page_mapping,
 					to_page, from_page, NULL, NULL, mode, 0, 0);
 	} else {
-		if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) {
+		/* shmem */
+		if (to_page_mapping->a_ops->migratepage == migrate_page)
+			goto exchange_mappings;
+		else if (to_page_mapping->a_ops->migratepage == buffer_migrate_page) {
 
 			if (!page_has_buffers(to_page))
 				goto exchange_mappings;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 04/31] mm: add mem_defrag functionality.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (2 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 03/31] mm: migrate: Add tmpfs " Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 05/31] mem_defrag: split a THP if either src or dst is THP only Zi Yan
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Create contiguous physical memory regions by migrating/exchanging pages.

1. It scans all VMAs in one process and determines an anchor pair
(VPN, PFN) for each VMA.
2. Then, it migrates/exchanges pages in each VMA to make
them aligned with the anchor pair.

At the end of day, a physically contiguous region should be created for
each VMA in the physical address range [PFN, PFN + VMA size).

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 fs/exec.c                              |    4 +
 include/linux/mem_defrag.h             |   60 +
 include/linux/mm.h                     |   12 +
 include/linux/mm_types.h               |    4 +
 include/linux/sched/coredump.h         |    3 +
 include/linux/syscalls.h               |    3 +
 include/linux/vm_event_item.h          |   23 +
 include/uapi/asm-generic/mman-common.h |    3 +
 kernel/fork.c                          |    9 +
 kernel/sysctl.c                        |   79 +-
 mm/Makefile                            |    1 +
 mm/compaction.c                        |   17 +-
 mm/huge_memory.c                       |    4 +
 mm/internal.h                          |   28 +
 mm/khugepaged.c                        |    1 +
 mm/madvise.c                           |   15 +
 mm/mem_defrag.c                        | 1782 ++++++++++++++++++++++++
 mm/memory.c                            |    7 +
 mm/mmap.c                              |   29 +
 mm/page_alloc.c                        |    4 +-
 mm/vmstat.c                            |   21 +
 22 files changed, 2096 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/mem_defrag.h
 create mode 100644 mm/mem_defrag.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..374c11e3cf80 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	scan_process_memory		__x64_sys_scan_process_memory
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index fb72d36f7823..b71b9d305d7d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1010,7 +1010,11 @@ static int exec_mmap(struct mm_struct *mm)
 {
 	struct task_struct *tsk;
 	struct mm_struct *old_mm, *active_mm;
+	int move_mem_defrag = current->mm ?
+		test_bit(MMF_VM_MEM_DEFRAG_ALL, &current->mm->flags):0;
 
+	if (move_mem_defrag && mm)
+		set_bit(MMF_VM_MEM_DEFRAG_ALL, &mm->flags);
 	/* Notify parent that we're no longer interested in the old VM */
 	tsk = current;
 	old_mm = current->mm;
diff --git a/include/linux/mem_defrag.h b/include/linux/mem_defrag.h
new file mode 100644
index 000000000000..43954a316752
--- /dev/null
+++ b/include/linux/mem_defrag.h
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * mem_defrag.h Memory defragmentation function.
+ *
+ * Copyright (C) 2019 Zi Yan <ziy@nvidia.com>
+ *
+ *
+ */
+#ifndef _LINUX_KMEM_DEFRAGD_H
+#define _LINUX_KMEM_DEFRAGD_H
+
+#include <linux/sched/coredump.h> /* MMF_VM_MEM_DEFRAG */
+
+#define MEM_DEFRAG_SCAN				0
+#define MEM_DEFRAG_MARK_SCAN_ALL	1
+#define MEM_DEFRAG_CLEAR_SCAN_ALL	2
+#define MEM_DEFRAG_DEFRAG			3
+#define MEM_DEFRAG_CONTIG_SCAN		5
+
+enum mem_defrag_action {
+	MEM_DEFRAG_FULL_STATS = 0,
+	MEM_DEFRAG_DO_DEFRAG,
+	MEM_DEFRAG_CONTIG_STATS,
+};
+
+extern int kmem_defragd_always;
+
+extern int __kmem_defragd_enter(struct mm_struct *mm);
+extern void __kmem_defragd_exit(struct mm_struct *mm);
+extern int memdefrag_madvise(struct vm_area_struct *vma,
+		     unsigned long *vm_flags, int advice);
+
+static inline int kmem_defragd_fork(struct mm_struct *mm,
+		struct mm_struct *oldmm)
+{
+	if (test_bit(MMF_VM_MEM_DEFRAG, &oldmm->flags))
+		return __kmem_defragd_enter(mm);
+	return 0;
+}
+
+static inline void kmem_defragd_exit(struct mm_struct *mm)
+{
+	if (test_bit(MMF_VM_MEM_DEFRAG, &mm->flags))
+		__kmem_defragd_exit(mm);
+}
+
+static inline int kmem_defragd_enter(struct vm_area_struct *vma,
+				   unsigned long vm_flags)
+{
+	if (!test_bit(MMF_VM_MEM_DEFRAG, &vma->vm_mm->flags))
+		if (((kmem_defragd_always ||
+		     ((vm_flags & VM_MEMDEFRAG))) &&
+		    !(vm_flags & VM_NOMEMDEFRAG)) ||
+			test_bit(MMF_VM_MEM_DEFRAG_ALL, &vma->vm_mm->flags))
+			if (__kmem_defragd_enter(vma->vm_mm))
+				return -ENOMEM;
+	return 0;
+}
+
+#endif /* _LINUX_KMEM_DEFRAGD_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80bb6408fe73..5bcc1b03372a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -251,13 +251,20 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_6	38	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
+#define VM_HIGH_ARCH_6	BIT(VM_HIGH_ARCH_BIT_6)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#define VM_MEMDEFRAG	VM_HIGH_ARCH_5	/* memory defrag */
+#define VM_NOMEMDEFRAG	VM_HIGH_ARCH_6	/* no memory defrag */
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
@@ -487,6 +494,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	vma->vm_mm = mm;
 	vma->vm_ops = &dummy_vm_ops;
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
+	vma->anchor_page_rb = RB_ROOT_CACHED;
+	vma->vma_create_jiffies = jiffies;
+	vma->vma_defrag_jiffies = 0;
 }
 
 static inline void vma_set_anonymous(struct vm_area_struct *vma)
@@ -2837,6 +2847,8 @@ static inline bool debug_guardpage_enabled(void) { return false; }
 static inline bool page_is_guard(struct page *page) { return false; }
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
+void free_anchor_pages(struct vm_area_struct *vma);
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..32549b255d25 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -328,6 +328,10 @@ struct vm_area_struct {
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	struct rb_root_cached anchor_page_rb;
+	unsigned long vma_create_jiffies; /* life time of the vma  */
+	unsigned long vma_modify_jiffies; /* being modified time of the vma */
+	unsigned long vma_defrag_jiffies; /* being defragged time of the vma */
 } __randomize_layout;
 
 struct core_thread {
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ecdc6542070f..52ad71db6687 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -76,5 +76,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
 				 MMF_DISABLE_THP_MASK)
+#define MMF_VM_MEM_DEFRAG	26	/* set mm is added to do mem defrag */
+#define MMF_VM_MEM_DEFRAG_ALL	27	/* all vmas in the mm will be mem defrag */
+
 
 #endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 257cccba3062..caa6f043b29a 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -926,6 +926,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
 			  unsigned mask, struct statx __user *buffer);
 asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
 			 int flags, uint32_t sig);
+asmlinkage long sys_scan_process_memory(pid_t pid, char __user *out_buf,
+			int buf_len, int action);
 
 /*
  * Architecture-specific system calls
@@ -1315,4 +1317,5 @@ static inline unsigned int ksys_personality(unsigned int personality)
 	return old;
 }
 
+
 #endif
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..6b32c8243616 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -109,6 +109,29 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
+#endif
+		/* MEM_DEFRAG STATS  */
+		MEM_DEFRAG_DEFRAG_NUM,
+		MEM_DEFRAG_SCAN_NUM,
+		MEM_DEFRAG_DST_FREE_PAGES,
+		MEM_DEFRAG_DST_ANON_PAGES,
+		MEM_DEFRAG_DST_FILE_PAGES,
+		MEM_DEFRAG_DST_NONLRU_PAGES,
+		MEM_DEFRAG_DST_FREE_PAGES_FAILED,
+		MEM_DEFRAG_DST_FREE_PAGES_OVERFLOW_FAILED,
+		MEM_DEFRAG_DST_ANON_PAGES_FAILED,
+		MEM_DEFRAG_DST_FILE_PAGES_FAILED,
+		MEM_DEFRAG_DST_NONLRU_PAGES_FAILED,
+		MEM_DEFRAG_SRC_ANON_PAGES_FAILED,
+		MEM_DEFRAG_SRC_COMP_PAGES_FAILED,
+		MEM_DEFRAG_DST_SPLIT_HUGEPAGES,
+#ifdef CONFIG_COMPACTION
+		/* memory compaction */
+		COMPACT_MIGRATE_PAGES,
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		/* thp collapse */
+		THP_COLLAPSE_MIGRATE_PAGES,
 #endif
 		NR_VM_EVENT_ITEMS
 };
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index e7ee32861d51..d1ec94a1970d 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -66,6 +66,9 @@
 #define MADV_WIPEONFORK 18		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 19		/* Undo MADV_WIPEONFORK */
 
+#define MADV_MEMDEFRAG	20		/* Worth backing with hugepages */
+#define MADV_NOMEMDEFRAG	21		/* Not worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b69248e6f0e0..dcefa978c232 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -92,6 +92,7 @@
 #include <linux/livepatch.h>
 #include <linux/thread_info.h>
 #include <linux/stackleak.h>
+#include <linux/mem_defrag.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -343,12 +344,16 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 	if (new) {
 		*new = *orig;
 		INIT_LIST_HEAD(&new->anon_vma_chain);
+		new->anchor_page_rb = RB_ROOT_CACHED;
+		new->vma_create_jiffies = jiffies;
+		new->vma_defrag_jiffies = 0;
 	}
 	return new;
 }
 
 void vm_area_free(struct vm_area_struct *vma)
 {
+	free_anchor_pages(vma);
 	kmem_cache_free(vm_area_cachep, vma);
 }
 
@@ -496,6 +501,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	if (retval)
 		goto out;
 	retval = khugepaged_fork(mm, oldmm);
+	if (retval)
+		goto out;
+	retval = kmem_defragd_fork(mm, oldmm);
 	if (retval)
 		goto out;
 
@@ -1044,6 +1052,7 @@ static inline void __mmput(struct mm_struct *mm)
 	exit_aio(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
+	kmem_defragd_exit(mm);
 	exit_mmap(mm);
 	mm_put_huge_zero_page(mm);
 	set_mm_exe_file(mm, NULL);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ba4d9e85feb8..6bf0be1af7e0 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -115,6 +115,13 @@ extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
 extern int sysctl_nr_trim_pages;
 #endif
 
+extern int kmem_defragd_always;
+extern int vma_scan_percentile;
+extern int vma_scan_threshold_type;
+extern int vma_no_repeat_defrag;
+extern int num_breakout_chunks;
+extern int defrag_size_threshold;
+
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
 static int sixty = 60;
@@ -1303,7 +1310,7 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= overcommit_kbytes_handler,
 	},
 	{
-		.procname	= "page-cluster", 
+		.procname	= "page-cluster",
 		.data		= &page_cluster,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
@@ -1691,6 +1698,58 @@ static struct ctl_table vm_table[] = {
 		.extra2		= (void *)&mmap_rnd_compat_bits_max,
 	},
 #endif
+	{
+		.procname	= "kmem_defragd_always",
+		.data		= &kmem_defragd_always,
+		.maxlen		= sizeof(kmem_defragd_always),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "vma_scan_percentile",
+		.data		= &vma_scan_percentile,
+		.maxlen		= sizeof(vma_scan_percentile),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
+	{
+		.procname	= "vma_scan_threshold_type",
+		.data		= &vma_scan_threshold_type,
+		.maxlen		= sizeof(vma_scan_threshold_type),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "vma_no_repeat_defrag",
+		.data		= &vma_no_repeat_defrag,
+		.maxlen		= sizeof(vma_no_repeat_defrag),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
+	{
+		.procname	= "num_breakout_chunks",
+		.data		= &num_breakout_chunks,
+		.maxlen		= sizeof(num_breakout_chunks),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
+		.procname	= "defrag_size_threshold",
+		.data		= &defrag_size_threshold,
+		.maxlen		= sizeof(defrag_size_threshold),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
 	{ }
 };
 
@@ -1807,7 +1866,7 @@ static struct ctl_table fs_table[] = {
 		.mode		= 0555,
 		.child		= inotify_table,
 	},
-#endif	
+#endif
 #ifdef CONFIG_EPOLL
 	{
 		.procname	= "epoll",
@@ -2252,12 +2311,12 @@ static int __do_proc_dointvec(void *tbl_data, struct ctl_table *table,
 	int *i, vleft, first = 1, err = 0;
 	size_t left;
 	char *kbuf = NULL, *p;
-	
+
 	if (!tbl_data || !table->maxlen || !*lenp || (*ppos && !write)) {
 		*lenp = 0;
 		return 0;
 	}
-	
+
 	i = (int *) tbl_data;
 	vleft = table->maxlen / sizeof(*i);
 	left = *lenp;
@@ -2483,7 +2542,7 @@ static int do_proc_douintvec(struct ctl_table *table, int write,
  * @ppos: file position
  *
  * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
- * values from/to the user buffer, treated as an ASCII string. 
+ * values from/to the user buffer, treated as an ASCII string.
  *
  * Returns 0 on success.
  */
@@ -2974,7 +3033,7 @@ static int do_proc_dointvec_ms_jiffies_conv(bool *negp, unsigned long *lvalp,
  * @ppos: file position
  *
  * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
- * values from/to the user buffer, treated as an ASCII string. 
+ * values from/to the user buffer, treated as an ASCII string.
  * The values read are assumed to be in seconds, and are converted into
  * jiffies.
  *
@@ -2996,8 +3055,8 @@ int proc_dointvec_jiffies(struct ctl_table *table, int write,
  * @ppos: pointer to the file position
  *
  * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
- * values from/to the user buffer, treated as an ASCII string. 
- * The values read are assumed to be in 1/USER_HZ seconds, and 
+ * values from/to the user buffer, treated as an ASCII string.
+ * The values read are assumed to be in 1/USER_HZ seconds, and
  * are converted into jiffies.
  *
  * Returns 0 on success.
@@ -3019,8 +3078,8 @@ int proc_dointvec_userhz_jiffies(struct ctl_table *table, int write,
  * @ppos: the current position in the file
  *
  * Reads/writes up to table->maxlen/sizeof(unsigned int) integer
- * values from/to the user buffer, treated as an ASCII string. 
- * The values read are assumed to be in 1/1000 seconds, and 
+ * values from/to the user buffer, treated as an ASCII string.
+ * The values read are assumed to be in 1/1000 seconds, and
  * are converted into jiffies.
  *
  * Returns 0 on success.
diff --git a/mm/Makefile b/mm/Makefile
index 1574ea5743e4..925f21c717db 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -44,6 +44,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 obj-y += init-mm.o
 obj-y += memblock.o
 obj-y += exchange.o
+obj-y += mem_defrag.o
 
 ifdef CONFIG_MMU
 	obj-$(CONFIG_ADVISE_SYSCALLS)	+= madvise.o
diff --git a/mm/compaction.c b/mm/compaction.c
index ef29490b0f46..54c4bfdbdbc3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,7 +50,7 @@ static inline void count_compact_events(enum vm_event_item item, long delta)
 #define pageblock_start_pfn(pfn)	block_start_pfn(pfn, pageblock_order)
 #define pageblock_end_pfn(pfn)		block_end_pfn(pfn, pageblock_order)
 
-static unsigned long release_freepages(struct list_head *freelist)
+unsigned long release_freepages(struct list_head *freelist)
 {
 	struct page *page, *next;
 	unsigned long high_pfn = 0;
@@ -58,7 +58,10 @@ static unsigned long release_freepages(struct list_head *freelist)
 	list_for_each_entry_safe(page, next, freelist, lru) {
 		unsigned long pfn = page_to_pfn(page);
 		list_del(&page->lru);
-		__free_page(page);
+		if (PageCompound(page))
+			__free_pages(page, compound_order(page));
+		else
+			__free_page(page);
 		if (pfn > high_pfn)
 			high_pfn = pfn;
 	}
@@ -1593,6 +1596,8 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
 		int err;
+		int num_migrated_pages = 0;
+		struct page *iter;
 
 		switch (isolate_migratepages(zone, cc)) {
 		case ISOLATE_ABORT:
@@ -1611,6 +1616,9 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 			;
 		}
 
+		list_for_each_entry(iter, &cc->migratepages, lru)
+			num_migrated_pages++;
+
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
 				MR_COMPACTION);
@@ -1618,6 +1626,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 		trace_mm_compaction_migratepages(cc->nr_migratepages, err,
 							&cc->migratepages);
 
+		list_for_each_entry(iter, &cc->migratepages, lru)
+			num_migrated_pages--;
+
+		count_vm_events(COMPACT_MIGRATE_PAGES, num_migrated_pages);
+
 		/* All pages were either migrated or will be released */
 		cc->nr_migratepages = 0;
 		if (err) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faf357eaf0ce..ffcae07a87d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,7 @@
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
 #include <linux/oom.h>
+#include <linux/mem_defrag.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -695,6 +696,9 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
 		return VM_FAULT_OOM;
+	/* Make it defrag  */
+	if (unlikely(kmem_defragd_enter(vma, vma->vm_flags)))
+		return VM_FAULT_OOM;
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
diff --git a/mm/internal.h b/mm/internal.h
index 77e205c423ce..4fe8d1a4d7bb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/tracepoint-defs.h>
+#include <linux/interval_tree.h>
 
 /*
  * The set of flags that only affect watermark checking and reclaim
@@ -549,4 +550,31 @@ bool buffer_migrate_lock_buffers(struct buffer_head *head,
 int writeout(struct address_space *mapping, struct page *page);
 extern int exchange_two_pages(struct page *page1, struct page *page2);
 
+struct anchor_page_info {
+	struct list_head list;
+	struct page *anchor_page;
+	unsigned long vaddr;
+	unsigned long start;
+	unsigned long end;
+};
+
+struct anchor_page_node {
+	struct interval_tree_node node;
+	unsigned long anchor_pfn; /* struct page can be calculated from pfn_to_page()  */
+	unsigned long anchor_vpn;
+};
+
+unsigned long release_freepages(struct list_head *freelist);
+
+void free_anchor_pages(struct vm_area_struct *vma);
+
+extern int exchange_two_pages(struct page *page1, struct page *page2);
+
+void expand(struct zone *zone, struct page *page,
+	int low, int high, struct free_area *area,
+	int migratetype);
+
+void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
+							unsigned int alloc_flags);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4f017339ddb2..aedaa9f75806 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -660,6 +660,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 		} else {
 			src_page = pte_page(pteval);
 			copy_user_highpage(page, src_page, address, vma);
+			count_vm_event(THP_COLLAPSE_MIGRATE_PAGES);
 			VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
 			release_pte_page(src_page);
 			/*
diff --git a/mm/madvise.c b/mm/madvise.c
index 21a7881a2db4..9cef96d633e8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -24,6 +24,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/mem_defrag.h>
 
 #include <asm/tlb.h>
 
@@ -616,6 +617,13 @@ static long madvise_remove(struct vm_area_struct *vma,
 	return error;
 }
 
+static long madvise_memdefrag(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end, int behavior)
+{
+	*prev = vma;
+	return memdefrag_madvise(vma, &vma->vm_flags, behavior);
+}
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Error injection support for memory error handling.
@@ -697,6 +705,9 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_FREE:
 	case MADV_DONTNEED:
 		return madvise_dontneed_free(vma, prev, start, end, behavior);
+	case MADV_MEMDEFRAG:
+	case MADV_NOMEMDEFRAG:
+		return madvise_memdefrag(vma, prev, start, end, behavior);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -731,6 +742,8 @@ madvise_behavior_valid(int behavior)
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
 #endif
+	case MADV_MEMDEFRAG:
+	case MADV_NOMEMDEFRAG:
 		return true;
 
 	default:
@@ -785,6 +798,8 @@ madvise_behavior_valid(int behavior)
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *  MADV_MEMDEFRAG - allow mem defrag running on this region.
+ *  MADV_NOMEMDEFRAG - no mem defrag here.
  *
  * return values:
  *  zero    - success
diff --git a/mm/mem_defrag.c b/mm/mem_defrag.c
new file mode 100644
index 000000000000..414909e1c19c
--- /dev/null
+++ b/mm/mem_defrag.c
@@ -0,0 +1,1782 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Memory defragmentation.
+ *
+ * Copyright (C) 2019 Zi Yan <ziy@nvidia.com>
+ *
+ * Two lists:
+ *   1) a mm list, representing virtual address spaces
+ *   2) a anon_vma list, representing the physical address space.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/mm.h>
+#include <linux/sched/mm.h>
+#include <linux/mm_inline.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/hashtable.h>
+#include <linux/mem_defrag.h>
+#include <linux/shmem_fs.h>
+#include <linux/syscalls.h>
+#include <linux/security.h>
+#include <linux/vmalloc.h>
+#include <linux/mman.h>
+#include <linux/vmstat.h>
+#include <linux/migrate.h>
+#include <linux/page-isolation.h>
+#include <linux/sort.h>
+
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+
+struct contig_stats {
+	int err;
+	unsigned long contig_pages;
+	unsigned long first_vaddr_in_chunk;
+	unsigned long first_paddr_in_chunk;
+};
+
+struct defrag_result_stats {
+	unsigned long aligned;
+	unsigned long migrated;
+	unsigned long src_pte_thp_failed;
+	unsigned long src_thp_dst_not_failed;
+	unsigned long src_not_present;
+	unsigned long dst_out_of_bound_failed;
+	unsigned long dst_pte_thp_failed;
+	unsigned long dst_thp_src_not_failed;
+	unsigned long dst_isolate_free_failed;
+	unsigned long dst_migrate_free_failed;
+	unsigned long dst_anon_failed;
+	unsigned long dst_file_failed;
+	unsigned long dst_non_lru_failed;
+	unsigned long dst_non_moveable_failed;
+	unsigned long not_defrag_vpn;
+};
+
+enum {
+	VMA_THRESHOLD_TYPE_TIME = 0,
+	VMA_THRESHOLD_TYPE_SIZE,
+};
+
+int num_breakout_chunks;
+int vma_scan_percentile = 100;
+int vma_scan_threshold_type = VMA_THRESHOLD_TYPE_TIME;
+int vma_no_repeat_defrag;
+int kmem_defragd_always;
+int defrag_size_threshold = 5;
+static DEFINE_SPINLOCK(kmem_defragd_mm_lock);
+
+#define MM_SLOTS_HASH_BITS 10
+static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
+
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+struct defrag_scan_control {
+	struct mm_struct *mm;
+	unsigned long scan_address;
+	char __user *out_buf;
+	int buf_len;
+	int used_len;
+	enum mem_defrag_action action;
+	bool scan_in_vma;
+	unsigned long vma_scan_threshold;
+};
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: kmem_defragd scan list headed in kmem_defragd_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+	struct hlist_node hash;
+	struct list_head mm_node;
+	struct mm_struct *mm;
+};
+
+/**
+ * struct kmem_defragd_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one kmem_defragd_scan instance of this cursor structure.
+ */
+struct kmem_defragd_scan {
+	struct list_head mm_head;
+	struct mm_slot *mm_slot;
+	unsigned long address;
+};
+
+static struct kmem_defragd_scan kmem_defragd_scan = {
+	.mm_head = LIST_HEAD_INIT(kmem_defragd_scan.mm_head),
+};
+
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+	if (!mm_slot_cache)	/* initialization failed */
+		return NULL;
+	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+	kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+
+	hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
+		if (mm == mm_slot->mm)
+			return mm_slot;
+
+	return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+				    struct mm_slot *mm_slot)
+{
+	mm_slot->mm = mm;
+	hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
+}
+
+static inline int kmem_defragd_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+int __kmem_defragd_enter(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+
+	mm_slot = alloc_mm_slot();
+	if (!mm_slot)
+		return -ENOMEM;
+
+	/* __kmem_defragd_exit() must not run from under us */
+	VM_BUG_ON_MM(kmem_defragd_test_exit(mm), mm);
+	if (unlikely(test_and_set_bit(MMF_VM_MEM_DEFRAG, &mm->flags))) {
+		free_mm_slot(mm_slot);
+		return 0;
+	}
+
+	spin_lock(&kmem_defragd_mm_lock);
+	insert_to_mm_slots_hash(mm, mm_slot);
+	/*
+	 * Insert just behind the scanning cursor, to let the area settle
+	 * down a little.
+	 */
+	list_add_tail(&mm_slot->mm_node, &kmem_defragd_scan.mm_head);
+	spin_unlock(&kmem_defragd_mm_lock);
+
+	atomic_inc(&mm->mm_count);
+
+	return 0;
+}
+
+void __kmem_defragd_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int free = 0;
+
+	spin_lock(&kmem_defragd_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot && kmem_defragd_scan.mm_slot != mm_slot) {
+		hash_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		free = 1;
+	}
+	spin_unlock(&kmem_defragd_mm_lock);
+
+	if (free) {
+		clear_bit(MMF_VM_MEM_DEFRAG, &mm->flags);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		/*
+		 * This is required to serialize against
+		 * kmem_defragd_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we
+		 * return all pagetables will be destroyed) until
+		 * kmem_defragd has finished working on the pagetables
+		 * under the mmap_sem.
+		 */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	}
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&kmem_defragd_mm_lock));
+
+	if (kmem_defragd_test_exit(mm)) {
+		/* free mm_slot */
+		hash_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+
+		/*
+		 * Not strictly needed because the mm exited already.
+		 *
+		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		 */
+
+		/* kmem_defragd_mm_lock actually not necessary for the below */
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static bool mem_defrag_vma_check(struct vm_area_struct *vma)
+{
+	if ((!test_bit(MMF_VM_MEM_DEFRAG_ALL, &vma->vm_mm->flags) &&
+			!(vma->vm_flags & VM_MEMDEFRAG) && !kmem_defragd_always) ||
+			(vma->vm_flags & VM_NOMEMDEFRAG))
+			return false;
+	if (shmem_file(vma->vm_file)) {
+		if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
+			return false;
+		return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
+				HPAGE_PMD_NR);
+	}
+	if (is_vm_hugetlb_page(vma))
+		return true;
+	if (!vma->anon_vma || vma->vm_ops)
+		return false;
+	if (is_vma_temporary_stack(vma))
+		return false;
+	return true;
+}
+
+static int do_vma_stat(struct mm_struct *mm, struct vm_area_struct *vma,
+		char *kernel_buf, int pos, int *remain_buf_len)
+{
+	int used_len;
+	int init_remain_len = *remain_buf_len;
+
+	if (!*remain_buf_len || !kernel_buf)
+		return -1;
+
+	used_len = scnprintf(kernel_buf + pos, *remain_buf_len, "%p, 0x%lx, 0x%lx, "
+						 "0x%lx, -1\n",
+						 mm, (unsigned long)vma+vma->vma_create_jiffies,
+						 vma->vm_start, vma->vm_end);
+
+	*remain_buf_len -= used_len;
+
+	if (*remain_buf_len == 1) {
+		*remain_buf_len = init_remain_len;
+		kernel_buf[pos] = '\0';
+		return -1;
+	}
+
+	return 0;
+}
+
+static inline int get_contig_page_size(struct page *page)
+{
+	int page_size = PAGE_SIZE;
+
+	if (PageCompound(page)) {
+		struct page *head_page = compound_head(page);
+		int compound_size = PAGE_SIZE<<compound_order(head_page);
+
+		if (head_page != page) {
+			VM_BUG_ON_PAGE(!PageTail(page), page);
+			page_size = compound_size - (page - head_page) * PAGE_SIZE;
+		} else
+			page_size = compound_size;
+	}
+
+	return page_size;
+}
+
+/*
+ * write one page stats to kernel_buf.
+ *
+ * If kernel_buf is not big enough, the page information will not be recorded
+ * at all.
+ *
+ */
+static int do_page_stat(struct mm_struct *mm, struct vm_area_struct *vma,
+		struct page *page, unsigned long vaddr,
+		char *kernel_buf, int pos, int *remain_buf_len,
+		enum mem_defrag_action action,
+		struct contig_stats *contig_stats,
+		bool scan_in_vma)
+{
+	int used_len;
+	struct anon_vma *anon_vma;
+	int init_remain_len = *remain_buf_len;
+	int end_note = -1;
+	unsigned long num_pages = page?(get_contig_page_size(page)/PAGE_SIZE):1;
+
+	if (!*remain_buf_len || !kernel_buf)
+		return -1;
+
+	if (action == MEM_DEFRAG_CONTIG_STATS) {
+		long long contig_pages;
+		unsigned long paddr = page?PFN_PHYS(page_to_pfn(page)):0;
+		bool last_entry = false;
+
+		if (!contig_stats->first_vaddr_in_chunk) {
+			contig_stats->first_vaddr_in_chunk = vaddr;
+			contig_stats->first_paddr_in_chunk = paddr;
+			contig_stats->contig_pages = 0;
+		}
+
+		/* scan_in_vma is set to true if buffer runs out while scanning a
+		 * vma. A corner case happens, when buffer runs out, then vma changes,
+		 * scan_address is reset to vm_start. Then, vma info is printed twice.
+		 */
+		if (vaddr == vma->vm_start && !scan_in_vma) {
+			used_len = scnprintf(kernel_buf + pos, *remain_buf_len,
+					"%p, 0x%lx, 0x%lx, 0x%lx, ",
+					mm, (unsigned long)vma+vma->vma_create_jiffies,
+					vma->vm_start, vma->vm_end);
+
+			*remain_buf_len -= used_len;
+
+			if (*remain_buf_len == 1) {
+				contig_stats->err = 1;
+				goto out_of_buf;
+			}
+			pos += used_len;
+		}
+
+		if (page) {
+			if (contig_stats->first_paddr_in_chunk) {
+				if (((long long)vaddr - contig_stats->first_vaddr_in_chunk) ==
+					((long long)paddr - contig_stats->first_paddr_in_chunk))
+					contig_stats->contig_pages += num_pages;
+				else {
+					/* output present contig chunk */
+					contig_pages = contig_stats->contig_pages;
+					goto output_contig_info;
+				}
+			} else { /* the previous chunk is not present pages */
+				/* output non-present contig chunk */
+				contig_pages = -(long long)contig_stats->contig_pages;
+				goto output_contig_info;
+			}
+		} else {
+			/* the previous chunk is not present pages */
+			if (!contig_stats->first_paddr_in_chunk) {
+				VM_BUG_ON(contig_stats->first_vaddr_in_chunk +
+						  contig_stats->contig_pages * PAGE_SIZE !=
+						  vaddr);
+				++contig_stats->contig_pages;
+			} else {
+				/* output present contig chunk */
+				contig_pages = contig_stats->contig_pages;
+
+				goto output_contig_info;
+			}
+		}
+
+check_last_entry:
+		/* if vaddr is the last page, we need to dump stats as well  */
+		if ((vaddr + num_pages * PAGE_SIZE) >= vma->vm_end) {
+			if (contig_stats->first_paddr_in_chunk)
+				contig_pages = contig_stats->contig_pages;
+			else
+				contig_pages = -(long long)contig_stats->contig_pages;
+			last_entry = true;
+		} else
+			return 0;
+output_contig_info:
+		if (last_entry)
+			used_len = scnprintf(kernel_buf + pos, *remain_buf_len,
+					"%lld, -1\n", contig_pages);
+		else
+			used_len = scnprintf(kernel_buf + pos, *remain_buf_len,
+					"%lld, ", contig_pages);
+
+		*remain_buf_len -= used_len;
+		if (*remain_buf_len == 1) {
+			contig_stats->err = 1;
+			goto out_of_buf;
+		} else {
+			pos += used_len;
+			if (last_entry) {
+				/* clear contig_stats  */
+				contig_stats->first_vaddr_in_chunk = 0;
+				contig_stats->first_paddr_in_chunk = 0;
+				contig_stats->contig_pages = 0;
+				return 0;
+			} else {
+				/* set new contig_stats  */
+				contig_stats->first_vaddr_in_chunk = vaddr;
+				contig_stats->first_paddr_in_chunk = paddr;
+				contig_stats->contig_pages = num_pages;
+				goto check_last_entry;
+			}
+		}
+		return 0;
+	}
+
+	used_len = scnprintf(kernel_buf + pos, *remain_buf_len,
+				"%p, %p, 0x%lx, 0x%lx, 0x%lx, 0x%llx",
+				 mm, vma, vma->vm_start, vma->vm_end,
+				 vaddr, page ? PFN_PHYS(page_to_pfn(page)) : 0);
+
+	*remain_buf_len -= used_len;
+	if (*remain_buf_len == 1)
+		goto out_of_buf;
+	pos += used_len;
+
+	if (page && PageAnon(page)) {
+		/* check page order  */
+		used_len = scnprintf(kernel_buf + pos, *remain_buf_len, ", %d",
+							 compound_order(page));
+		*remain_buf_len -= used_len;
+		if (*remain_buf_len == 1)
+			goto out_of_buf;
+		pos += used_len;
+
+		anon_vma = page_anon_vma(page);
+		if (!anon_vma)
+			goto end_of_stat;
+		anon_vma_lock_read(anon_vma);
+
+		do {
+			used_len = scnprintf(kernel_buf + pos, *remain_buf_len, ", %p",
+								 anon_vma);
+			*remain_buf_len -= used_len;
+			if (*remain_buf_len == 1) {
+				anon_vma_unlock_read(anon_vma);
+				goto out_of_buf;
+			}
+			pos += used_len;
+
+			anon_vma = anon_vma->parent;
+		} while (anon_vma != anon_vma->parent);
+
+		anon_vma_unlock_read(anon_vma);
+	}
+end_of_stat:
+	/* end of one page stat  */
+	used_len = scnprintf(kernel_buf + pos, *remain_buf_len, ", %d\n", end_note);
+	*remain_buf_len -= used_len;
+	if (*remain_buf_len == 1)
+		goto out_of_buf;
+
+	return 0;
+out_of_buf: /* revert incomplete data  */
+	*remain_buf_len = init_remain_len;
+	kernel_buf[pos] = '\0';
+	return -1;
+
+}
+
+static int isolate_free_page_no_wmark(struct page *page, unsigned int order)
+{
+	struct zone *zone;
+	int mt;
+
+	VM_BUG_ON(!PageBuddy(page));
+
+	zone = page_zone(page);
+	mt = get_pageblock_migratetype(page);
+
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	__ClearPageBuddy(page);
+	set_page_private(page, 0);
+
+	/*
+	 * Set the pageblock if the isolated page is at least half of a
+	 * pageblock
+	 */
+	if (order >= pageblock_order - 1) {
+		struct page *endpage = page + (1 << order) - 1;
+
+		for (; page < endpage; page += pageblock_nr_pages) {
+			int mt = get_pageblock_migratetype(page);
+
+			if (!is_migrate_isolate(mt) && !is_migrate_cma(mt)
+				&& mt != MIGRATE_HIGHATOMIC)
+				set_pageblock_migratetype(page,
+							  MIGRATE_MOVABLE);
+		}
+	}
+
+	return 1UL << order;
+}
+
+struct exchange_alloc_info {
+	struct list_head list;
+	struct page *src_page;
+	struct page *dst_page;
+};
+
+struct exchange_alloc_head {
+	struct list_head exchange_list;
+	struct list_head freelist;
+	struct list_head migratepage_list;
+	unsigned long num_freepages;
+};
+
+static int create_exchange_alloc_info(struct vm_area_struct *vma,
+		unsigned long scan_address, struct page *first_in_use_page,
+		int total_free_pages,
+		struct list_head *freelist,
+		struct list_head *exchange_list,
+		struct list_head *migratepage_list)
+{
+	struct page *in_use_page;
+	struct page *freepage;
+	struct exchange_alloc_info *one_pair;
+	int err;
+	int pagevec_flushed = 0;
+
+	down_read(&vma->vm_mm->mmap_sem);
+	in_use_page = follow_page(vma, scan_address,
+							FOLL_GET|FOLL_MIGRATION | FOLL_REMOTE);
+	up_read(&vma->vm_mm->mmap_sem);
+
+	freepage = list_first_entry_or_null(freelist, struct page, lru);
+
+	if (first_in_use_page != in_use_page ||
+		!freepage ||
+		PageCompound(in_use_page) != PageCompound(freepage) ||
+		compound_order(in_use_page) != compound_order(freepage)) {
+		if (in_use_page)
+			put_page(in_use_page);
+		return -EBUSY;
+	}
+	one_pair = kmalloc(sizeof(struct exchange_alloc_info),
+		GFP_KERNEL | __GFP_ZERO);
+
+	if (!one_pair) {
+		put_page(in_use_page);
+		return -ENOMEM;
+	}
+
+retry_isolate:
+	/* isolate in_use_page */
+	err = isolate_lru_page(in_use_page);
+	if (err) {
+		if (!pagevec_flushed) {
+			migrate_prep();
+			pagevec_flushed = 1;
+			goto retry_isolate;
+		}
+		put_page(in_use_page);
+		in_use_page = NULL;
+	}
+
+	if (in_use_page) {
+		put_page(in_use_page);
+		mod_node_page_state(page_pgdat(in_use_page),
+				NR_ISOLATED_ANON + page_is_file_cache(in_use_page),
+				hpage_nr_pages(in_use_page));
+		list_add_tail(&in_use_page->lru, migratepage_list);
+	}
+	/* fill info  */
+	one_pair->src_page = in_use_page;
+	one_pair->dst_page = freepage;
+	INIT_LIST_HEAD(&one_pair->list);
+
+	list_add_tail(&one_pair->list, exchange_list);
+
+	return 0;
+}
+
+static void free_alloc_info(struct list_head *alloc_info_list)
+{
+	struct exchange_alloc_info *item, *item2;
+
+	list_for_each_entry_safe(item, item2, alloc_info_list, list) {
+		list_del(&item->list);
+		kfree(item);
+	}
+}
+
+/*
+ * migrate callback: give a specific free page when it is called to guarantee
+ * contiguity after migration.
+ */
+static struct page *exchange_alloc(struct page *migratepage,
+				unsigned long data)
+{
+	struct exchange_alloc_head *head = (struct exchange_alloc_head *)data;
+	struct page *freepage = NULL;
+	struct exchange_alloc_info *info;
+
+	list_for_each_entry(info, &head->exchange_list, list) {
+		if (migratepage == info->src_page) {
+			freepage = info->dst_page;
+			/* remove it from freelist */
+			list_del(&freepage->lru);
+			if (PageTransHuge(freepage))
+				head->num_freepages -= HPAGE_PMD_NR;
+			else
+				head->num_freepages--;
+			break;
+		}
+	}
+
+	return freepage;
+}
+
+static void exchange_free(struct page *freepage, unsigned long data)
+{
+	struct exchange_alloc_head *head = (struct exchange_alloc_head *)data;
+
+	list_add_tail(&freepage->lru, &head->freelist);
+	if (PageTransHuge(freepage))
+		head->num_freepages += HPAGE_PMD_NR;
+	else
+		head->num_freepages++;
+}
+
+int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long start_addr, unsigned long end_addr,
+		struct page *anchor_page, unsigned long page_vaddr,
+		struct defrag_result_stats *defrag_stats)
+{
+	/*unsigned long va_pa_page_offset = (unsigned long)-1;*/
+	unsigned long scan_address;
+	unsigned long page_size = PAGE_SIZE;
+	int failed = 0;
+	int not_present = 0;
+	bool src_thp = false;
+
+	for (scan_address = start_addr; scan_address < end_addr;
+		 scan_address += page_size) {
+		struct page *scan_page;
+		unsigned long scan_phys_addr;
+		long long page_dist;
+
+		cond_resched();
+
+		down_read(&vma->vm_mm->mmap_sem);
+		scan_page = follow_page(vma, scan_address, FOLL_MIGRATION | FOLL_REMOTE);
+		up_read(&vma->vm_mm->mmap_sem);
+		scan_phys_addr = PFN_PHYS(scan_page ? page_to_pfn(scan_page) : 0);
+
+		page_size = PAGE_SIZE;
+
+		if (!scan_phys_addr) {
+			not_present++;
+			failed += 1;
+			defrag_stats->src_not_present += 1;
+			continue;
+		}
+
+		page_size = get_contig_page_size(scan_page);
+
+		/* PTE-mapped THP not allowed  */
+		if ((scan_page == compound_head(scan_page)) &&
+			PageTransHuge(scan_page) && !PageHuge(scan_page))
+			src_thp = true;
+
+		/* Allow THPs  */
+		if (PageCompound(scan_page) && !src_thp) {
+			count_vm_events(MEM_DEFRAG_SRC_COMP_PAGES_FAILED, page_size/PAGE_SIZE);
+			failed += (page_size/PAGE_SIZE);
+			defrag_stats->src_pte_thp_failed += (page_size/PAGE_SIZE);
+
+			defrag_stats->not_defrag_vpn = scan_address + page_size;
+			goto quit_defrag;
+			continue;
+		}
+
+		VM_BUG_ON(!anchor_page);
+
+		page_dist = ((long long)scan_address - page_vaddr) / PAGE_SIZE;
+
+		/* already in the contiguous pos  */
+		if (page_dist == (long long)(scan_page - anchor_page)) {
+			defrag_stats->aligned += (page_size/PAGE_SIZE);
+			continue;
+		} else { /* migrate pages according to the anchor pages in the vma.  */
+			struct page *dest_page = anchor_page + page_dist;
+			int page_drained = 0;
+			bool dst_thp = false;
+			int scan_page_order = src_thp?compound_order(scan_page):0;
+
+			if (!zone_spans_pfn(page_zone(anchor_page),
+					(page_to_pfn(anchor_page) + page_dist))) {
+				failed += 1;
+				defrag_stats->dst_out_of_bound_failed += 1;
+
+				defrag_stats->not_defrag_vpn = scan_address + page_size;
+				goto quit_defrag;
+				continue;
+			}
+
+retry_defrag:
+			/* migrate */
+			if (PageBuddy(dest_page)) {
+				struct zone *zone = page_zone(dest_page);
+				spinlock_t *zone_lock = &zone->lock;
+				unsigned long zone_lock_flags;
+				unsigned long free_page_order = 0;
+				int err = 0;
+				struct exchange_alloc_head exchange_alloc_head = {0};
+				int migratetype = get_pageblock_migratetype(dest_page);
+
+				INIT_LIST_HEAD(&exchange_alloc_head.exchange_list);
+				INIT_LIST_HEAD(&exchange_alloc_head.freelist);
+				INIT_LIST_HEAD(&exchange_alloc_head.migratepage_list);
+
+				count_vm_events(MEM_DEFRAG_DST_FREE_PAGES, 1<<scan_page_order);
+
+				/* lock page_zone(dest_page)->lock  */
+				spin_lock_irqsave(zone_lock, zone_lock_flags);
+
+				if (!PageBuddy(dest_page)) {
+					err = -EINVAL;
+					goto freepage_isolate_fail;
+				}
+
+				free_page_order = page_order(dest_page);
+
+				/* fail early if not enough free pages */
+				if (free_page_order < scan_page_order) {
+					err = -ENOMEM;
+					goto freepage_isolate_fail;
+				}
+
+				/* __isolate_free_page()  */
+				err = isolate_free_page_no_wmark(dest_page, free_page_order);
+				if (!err)
+					goto freepage_isolate_fail;
+
+				expand(zone, dest_page, scan_page_order, free_page_order,
+					&(zone->free_area[free_page_order]),
+					migratetype);
+
+				if (!is_migrate_isolate(migratetype))
+					__mod_zone_freepage_state(zone, -(1UL << scan_page_order),
+							migratetype);
+
+				prep_new_page(dest_page, scan_page_order,
+					__GFP_MOVABLE|(scan_page_order?__GFP_COMP:0), 0);
+
+				if (scan_page_order) {
+					VM_BUG_ON(!src_thp);
+					VM_BUG_ON(scan_page_order != HPAGE_PMD_ORDER);
+					prep_transhuge_page(dest_page);
+				}
+
+				list_add(&dest_page->lru, &exchange_alloc_head.freelist);
+
+freepage_isolate_fail:
+				spin_unlock_irqrestore(zone_lock, zone_lock_flags);
+
+				if (err < 0) {
+					failed += (page_size/PAGE_SIZE);
+					defrag_stats->dst_isolate_free_failed += (page_size/PAGE_SIZE);
+
+					defrag_stats->not_defrag_vpn = scan_address + page_size;
+					goto quit_defrag;
+					continue;
+				}
+
+				/* gather in-use pages
+				 * create a exchange_alloc_info structure, a list of
+				 * tuples, each like:
+				 * (in_use_page, free_page)
+				 *
+				 * so in exchange_alloc, the code needs to traverse the list
+				 * and find the tuple from in_use_page. Then return the
+				 * corresponding free page.
+				 *
+				 * This can guarantee the contiguity after migration.
+				 */
+
+				err = create_exchange_alloc_info(vma, scan_address, scan_page,
+							1<<free_page_order,
+							&exchange_alloc_head.freelist,
+							&exchange_alloc_head.exchange_list,
+							&exchange_alloc_head.migratepage_list);
+
+				if (err)
+					pr_debug("create_exchange_alloc_info error: %d\n", err);
+
+				exchange_alloc_head.num_freepages = 1<<scan_page_order;
+
+				/* migrate pags  */
+				err = migrate_pages(&exchange_alloc_head.migratepage_list,
+					exchange_alloc, exchange_free,
+					(unsigned long)&exchange_alloc_head,
+					MIGRATE_SYNC, MR_COMPACTION);
+
+				/* putback not migrated in_use_pagelist */
+				putback_movable_pages(&exchange_alloc_head.migratepage_list);
+
+				/* release free pages in freelist */
+				release_freepages(&exchange_alloc_head.freelist);
+
+				/* free allocated exchange info  */
+				free_alloc_info(&exchange_alloc_head.exchange_list);
+
+				count_vm_events(MEM_DEFRAG_DST_FREE_PAGES_FAILED,
+						exchange_alloc_head.num_freepages);
+
+				if (exchange_alloc_head.num_freepages) {
+					failed += exchange_alloc_head.num_freepages;
+					defrag_stats->dst_migrate_free_failed +=
+						exchange_alloc_head.num_freepages;
+				}
+				defrag_stats->migrated += ((1UL<<scan_page_order) -
+						exchange_alloc_head.num_freepages);
+
+			} else { /* exchange  */
+				int err = -EBUSY;
+
+				/* PTE-mapped THP not allowed  */
+				if ((dest_page == compound_head(dest_page)) &&
+					PageTransHuge(dest_page) && !PageHuge(dest_page))
+					dst_thp = true;
+
+				if (PageCompound(dest_page) && !dst_thp) {
+					failed += get_contig_page_size(dest_page);
+					defrag_stats->dst_pte_thp_failed += page_size/PAGE_SIZE;
+
+					defrag_stats->not_defrag_vpn = scan_address + page_size;
+					goto quit_defrag;
+				}
+
+				if (src_thp != dst_thp) {
+					failed += get_contig_page_size(scan_page);
+					if (src_thp && !dst_thp)
+						defrag_stats->src_thp_dst_not_failed +=
+							page_size/PAGE_SIZE;
+					else /* !src_thp && dst_thp */
+						defrag_stats->dst_thp_src_not_failed +=
+							page_size/PAGE_SIZE;
+
+					defrag_stats->not_defrag_vpn = scan_address + page_size;
+					goto quit_defrag;
+					/*continue;*/
+				}
+
+				/* free page on pcplist */
+				if (page_count(dest_page) == 0) {
+					/* not managed pages  */
+					if (!dest_page->flags) {
+						failed += 1;
+						defrag_stats->dst_out_of_bound_failed += 1;
+
+						defrag_stats->not_defrag_vpn = scan_address + page_size;
+						goto quit_defrag;
+					}
+					/* spill order-0 pages to buddy allocator from pcplist */
+					if (!page_drained) {
+						drain_all_pages(NULL);
+						page_drained = 1;
+						goto retry_defrag;
+					}
+				}
+
+				if (PageAnon(dest_page)) {
+					count_vm_events(MEM_DEFRAG_DST_ANON_PAGES,
+							1<<scan_page_order);
+
+					err = exchange_two_pages(scan_page, dest_page);
+					if (err) {
+						count_vm_events(MEM_DEFRAG_DST_ANON_PAGES_FAILED,
+								1<<scan_page_order);
+						failed += 1<<scan_page_order;
+						defrag_stats->dst_anon_failed += 1<<scan_page_order;
+					}
+				} else if (page_mapping(dest_page)) {
+					count_vm_events(MEM_DEFRAG_DST_FILE_PAGES,
+							1<<scan_page_order);
+
+					err = exchange_two_pages(scan_page, dest_page);
+					if (err) {
+						count_vm_events(MEM_DEFRAG_DST_FILE_PAGES_FAILED,
+								1<<scan_page_order);
+						failed += 1<<scan_page_order;
+						defrag_stats->dst_file_failed += 1<<scan_page_order;
+					}
+				} else if (!PageLRU(dest_page) && __PageMovable(dest_page)) {
+					err = -ENODEV;
+					count_vm_events(MEM_DEFRAG_DST_NONLRU_PAGES,
+							1<<scan_page_order);
+					failed += 1<<scan_page_order;
+					defrag_stats->dst_non_lru_failed += 1<<scan_page_order;
+					count_vm_events(MEM_DEFRAG_DST_NONLRU_PAGES_FAILED,
+							1<<scan_page_order);
+				} else {
+					err = -ENODEV;
+					failed += 1<<scan_page_order;
+					/* unmovable pages  */
+					defrag_stats->dst_non_moveable_failed += 1<<scan_page_order;
+				}
+
+				if (err == -EAGAIN)
+					goto retry_defrag;
+				if (!err)
+					defrag_stats->migrated += 1<<scan_page_order;
+				else {
+
+					defrag_stats->not_defrag_vpn = scan_address + page_size;
+					goto quit_defrag;
+				}
+
+			}
+		}
+	}
+quit_defrag:
+	return failed;
+}
+
+struct anchor_page_node *get_anchor_page_node_from_vma(struct vm_area_struct *vma,
+	unsigned long address)
+{
+	struct interval_tree_node *prev_vma_node;
+
+	prev_vma_node = interval_tree_iter_first(&vma->anchor_page_rb,
+		address, address);
+
+	if (!prev_vma_node)
+		return NULL;
+
+	return container_of(prev_vma_node, struct anchor_page_node, node);
+}
+
+unsigned long get_undefragged_area(struct zone *zone, struct vm_area_struct *vma,
+	unsigned long start_addr, unsigned long end_addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct *scan_vma = NULL;
+	unsigned long vma_size = end_addr - start_addr;
+	bool first_vma = true;
+
+	for (scan_vma = mm->mmap; scan_vma; scan_vma = scan_vma->vm_next)
+		if (!RB_EMPTY_ROOT(&scan_vma->anchor_page_rb.rb_root))
+			break;
+	/* no defragged area */
+	if (!scan_vma)
+		return zone->zone_start_pfn;
+
+	scan_vma = mm->mmap;
+	while (scan_vma) {
+		if (!RB_EMPTY_ROOT(&scan_vma->anchor_page_rb.rb_root)) {
+			struct interval_tree_node *node = NULL, *next_node = NULL;
+			struct anchor_page_node *anchor_node, *next_anchor_node = NULL;
+			struct vm_area_struct *next_vma = scan_vma->vm_next;
+			unsigned long end_pfn;
+			/* each vma get one anchor range */
+			node = interval_tree_iter_first(&scan_vma->anchor_page_rb,
+				 scan_vma->vm_start, scan_vma->vm_end - 1);
+			if (!node) {
+				scan_vma = scan_vma->vm_next;
+				continue;
+			}
+
+			anchor_node = container_of(node, struct anchor_page_node, node);
+			end_pfn = (anchor_node->anchor_pfn +
+					((scan_vma->vm_end - scan_vma->vm_start)>>PAGE_SHIFT));
+
+			/* check space before first vma */
+			if (first_vma) {
+				first_vma = false;
+				if (zone->zone_start_pfn + vma_size < anchor_node->anchor_pfn)
+					return zone->zone_start_pfn;
+				/* remove existing anchor if new vma is much larger */
+				if (vma_size > (scan_vma->vm_end - scan_vma->vm_start)*2) {
+					first_vma = true;
+					interval_tree_remove(node, &scan_vma->anchor_page_rb);
+					kfree(anchor_node);
+					scan_vma = scan_vma->vm_next;
+					continue;
+				}
+			}
+
+			/* find next vma with anchor range */
+			for (next_vma = scan_vma->vm_next;
+				next_vma && RB_EMPTY_ROOT(&next_vma->anchor_page_rb.rb_root);
+				next_vma = next_vma->vm_next);
+
+			if (!next_vma)
+				return end_pfn;
+			else {
+				next_node = interval_tree_iter_first(&next_vma->anchor_page_rb,
+					 next_vma->vm_start, next_vma->vm_end - 1);
+				VM_BUG_ON(!next_node);
+				next_anchor_node = container_of(next_node,
+									struct anchor_page_node, node);
+				if (end_pfn + vma_size < next_anchor_node->anchor_pfn)
+					return end_pfn;
+			}
+			scan_vma = next_vma;
+		} else
+			scan_vma = scan_vma->vm_next;
+	}
+
+	return zone->zone_start_pfn;
+}
+
+/*
+ * anchor pages decide the va pa offset in each vma
+ *
+ */
+static int find_anchor_pages_in_vma(struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long start_addr)
+{
+	struct anchor_page_node *anchor_node;
+	unsigned long end_addr = vma->vm_end - PAGE_SIZE;
+	struct interval_tree_node *existing_anchor = NULL;
+	unsigned long scan_address = start_addr;
+	struct page *present_page = NULL;
+	struct zone *present_zone = NULL;
+	unsigned long alignment_size = PAGE_SIZE;
+
+	/* Out of range query  */
+	if (start_addr >= vma->vm_end || start_addr < vma->vm_start)
+		return -1;
+
+	/*
+	 * Clean up unrelated anchor infor
+	 *
+	 * VMA range can change and leave some anchor info out of range,
+	 * so clean it here.
+	 * It should be cleaned when vma is changed, but the code there
+	 * is too complicated.
+	 */
+	if (!RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root) &&
+		!interval_tree_iter_first(&vma->anchor_page_rb,
+		 vma->vm_start, vma->vm_end - 1)) {
+		struct interval_tree_node *node = NULL;
+
+		for (node = interval_tree_iter_first(&vma->anchor_page_rb,
+				0, (unsigned long)-1);
+			 node;) {
+			struct anchor_page_node *anchor_node = container_of(node,
+					struct anchor_page_node, node);
+			interval_tree_remove(node, &vma->anchor_page_rb);
+			node = interval_tree_iter_next(node, 0, (unsigned long)-1);
+			kfree(anchor_node);
+		}
+	}
+
+	/* no range at all  */
+	if (RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root))
+		goto insert_new_range;
+
+	/* look for first range has start_addr or after it */
+	existing_anchor = interval_tree_iter_first(&vma->anchor_page_rb,
+		start_addr, end_addr);
+
+	/* first range has start_addr or after it  */
+	if (existing_anchor) {
+		/* redundant range, do nothing */
+		if (existing_anchor->start == start_addr)
+			return 0;
+		else if (existing_anchor->start < start_addr &&
+				 existing_anchor->last >= start_addr){
+			return 0;
+		} else { /* a range after start_addr  */
+			struct anchor_page_node *existing_node = container_of(existing_anchor,
+				struct anchor_page_node, node);
+			VM_BUG_ON(!(existing_anchor->start > start_addr));
+			/* remove the existing range and insert a new one, since expanding
+			 * forward can make the range go beyond the zone limit
+			 */
+			interval_tree_remove(existing_anchor, &vma->anchor_page_rb);
+			kfree(existing_node);
+			VM_BUG_ON(!RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root));
+			goto insert_new_range;
+		}
+	} else {
+		struct interval_tree_node *prev_anchor = NULL, *cur_anchor;
+		/* there is a range before start_addr  */
+
+		/* find the range just before start_addr  */
+		for (cur_anchor = interval_tree_iter_first(&vma->anchor_page_rb,
+				vma->vm_start, start_addr - PAGE_SIZE);
+			 cur_anchor;
+			 prev_anchor = cur_anchor,
+			 cur_anchor = interval_tree_iter_next(cur_anchor,
+				vma->vm_start, start_addr - PAGE_SIZE));
+
+		interval_tree_remove(prev_anchor, &vma->anchor_page_rb);
+		prev_anchor->last = vma->vm_end;
+		interval_tree_insert(prev_anchor, &vma->anchor_page_rb);
+
+		goto out;
+	}
+
+insert_new_range: /* start_addr to end_addr  */
+	down_read(&vma->vm_mm->mmap_sem);
+	/* find the first present page and use it as the anchor page */
+	while (!present_page && scan_address < end_addr) {
+		present_page = follow_page(vma, scan_address,
+			FOLL_MIGRATION | FOLL_REMOTE);
+		scan_address += present_page?get_contig_page_size(present_page):PAGE_SIZE;
+	}
+	up_read(&vma->vm_mm->mmap_sem);
+
+	if (!present_page)
+		goto out;
+
+	anchor_node =
+			kmalloc(sizeof(struct anchor_page_node), GFP_KERNEL | __GFP_ZERO);
+	if (!anchor_node)
+		return -ENOMEM;
+
+	present_zone = page_zone(present_page);
+
+	anchor_node->node.start = start_addr;
+	anchor_node->node.last = end_addr;
+
+	anchor_node->anchor_vpn = start_addr>>PAGE_SHIFT;
+	anchor_node->anchor_pfn = get_undefragged_area(present_zone,
+			vma, start_addr, end_addr);
+
+	/* adjust VPN and PFN alignment according to VMA size */
+	if (vma->vm_end - vma->vm_start >= HPAGE_PUD_SIZE) {
+		if ((anchor_node->anchor_vpn & ((HPAGE_PUD_SIZE>>PAGE_SHIFT) - 1)) <
+			(anchor_node->anchor_pfn & ((HPAGE_PUD_SIZE>>PAGE_SHIFT) - 1)))
+			anchor_node->anchor_pfn += (HPAGE_PUD_SIZE>>PAGE_SHIFT);
+
+		anchor_node->anchor_pfn = (anchor_node->anchor_pfn & (PUD_MASK>>PAGE_SHIFT)) |
+			(anchor_node->anchor_vpn & ((HPAGE_PUD_SIZE>>PAGE_SHIFT) - 1));
+
+		alignment_size = HPAGE_PUD_SIZE;
+	} else if (vma->vm_end - vma->vm_start >= HPAGE_PMD_SIZE) {
+		if ((anchor_node->anchor_vpn & ((HPAGE_PMD_SIZE>>PAGE_SHIFT) - 1)) <
+			(anchor_node->anchor_pfn & ((HPAGE_PMD_SIZE>>PAGE_SHIFT) - 1)))
+			anchor_node->anchor_pfn += (HPAGE_PMD_SIZE>>PAGE_SHIFT);
+
+		anchor_node->anchor_pfn = (anchor_node->anchor_pfn & (PMD_MASK>>PAGE_SHIFT)) |
+			(anchor_node->anchor_vpn & ((HPAGE_PMD_SIZE>>PAGE_SHIFT) - 1));
+
+		alignment_size = HPAGE_PMD_SIZE;
+	} else
+		alignment_size = PAGE_SIZE;
+
+	/* move the range into the zone limit */
+	if (!(zone_spans_pfn(present_zone, anchor_node->anchor_pfn))) {
+		while (anchor_node->anchor_pfn >= zone_end_pfn(present_zone))
+			anchor_node->anchor_pfn -= alignment_size / PAGE_SHIFT;
+		while (anchor_node->anchor_pfn <  present_zone->zone_start_pfn)
+			anchor_node->anchor_pfn += alignment_size / PAGE_SHIFT;
+	}
+
+	interval_tree_insert(&anchor_node->node, &vma->anchor_page_rb);
+
+out:
+	return 0;
+}
+
+static inline bool is_stats_collection(enum mem_defrag_action action)
+{
+	switch (action) {
+	case MEM_DEFRAG_FULL_STATS:
+	case MEM_DEFRAG_CONTIG_STATS:
+		return true;
+	default:
+		return false;
+	}
+	return false;
+}
+
+/* comparator for sorting vma's lifetime */
+static int unsigned_long_cmp(const void *a, const void *b)
+{
+	const unsigned long *l = a, *r = b;
+
+	if (*l < *r)
+		return -1;
+	if (*l > *r)
+		return 1;
+	return 0;
+}
+
+/*
+ * scan all to-be-defragged VMA lifetime and calculate VMA defragmentation
+ * threshold.
+ */
+static void scan_all_vma_lifetime(struct defrag_scan_control *sc)
+{
+	struct mm_struct *mm = sc->mm;
+	struct vm_area_struct *vma = NULL;
+	unsigned long current_jiffies = jiffies; /* fix one jiffies  */
+	unsigned int num_vma = 0, index = 0;
+	unsigned long *vma_scan_list = NULL;
+
+	down_read(&mm->mmap_sem);
+	for (vma = find_vma(mm, 0); vma; vma = vma->vm_next)
+		/* only care about to-be-defragged vmas  */
+		if (mem_defrag_vma_check(vma))
+			++num_vma;
+
+	vma_scan_list = kmalloc(num_vma*sizeof(unsigned long),
+			GFP_KERNEL | __GFP_ZERO);
+
+	if (ZERO_OR_NULL_PTR(vma_scan_list)) {
+		sc->vma_scan_threshold = 1;
+		goto out;
+	}
+
+	for (vma = find_vma(mm, 0); vma; vma = vma->vm_next)
+		/* only care about to-be-defragged vmas  */
+		if (mem_defrag_vma_check(vma)) {
+			if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_TIME)
+				vma_scan_list[index] = current_jiffies - vma->vma_create_jiffies;
+			else if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_SIZE)
+				vma_scan_list[index] = vma->vm_end - vma->vm_start;
+			++index;
+			if (index >= num_vma)
+				break;
+		}
+
+	sort(vma_scan_list, num_vma, sizeof(unsigned long),
+		 unsigned_long_cmp, NULL);
+
+	index = (100 - vma_scan_percentile) * num_vma / 100;
+
+	sc->vma_scan_threshold = vma_scan_list[index];
+
+	kfree(vma_scan_list);
+out:
+	up_read(&mm->mmap_sem);
+}
+
+/*
+ * Scan single mm_struct.
+ * The function will down_read mmap_sem.
+ *
+ */
+static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
+{
+	struct mm_struct *mm = sc->mm;
+	struct vm_area_struct *vma = NULL;
+	unsigned long *scan_address = &sc->scan_address;
+	char *stats_buf = NULL;
+	int remain_buf_len = sc->buf_len;
+	int err = 0;
+	struct contig_stats contig_stats;
+
+
+	if (sc->out_buf &&
+		sc->buf_len) {
+		stats_buf = vzalloc(sc->buf_len);
+		if (!stats_buf)
+			goto breakouterloop;
+	}
+
+	/*down_read(&mm->mmap_sem);*/
+	if (unlikely(kmem_defragd_test_exit(mm)))
+		vma = NULL;
+	else {
+		/* get vma_scan_threshold  */
+		if (!sc->vma_scan_threshold)
+			scan_all_vma_lifetime(sc);
+
+		vma = find_vma(mm, *scan_address);
+	}
+
+	for (; vma; vma = vma->vm_next) {
+		unsigned long vstart, vend;
+		struct anchor_page_node *anchor_node = NULL;
+		int scanned_chunks = 0;
+
+
+		if (unlikely(kmem_defragd_test_exit(mm)))
+			break;
+		if (!mem_defrag_vma_check(vma)) {
+			/* collect contiguity stats for this VMA */
+			if (is_stats_collection(sc->action))
+				if (do_vma_stat(mm, vma, stats_buf, sc->buf_len - remain_buf_len,
+							&remain_buf_len))
+					goto breakouterloop;
+			*scan_address = vma->vm_end;
+			goto done_one_vma;
+		}
+
+
+		vstart = vma->vm_start;
+		vend = vma->vm_end;
+		if (vstart >= vend)
+			goto done_one_vma;
+		if (*scan_address > vend)
+			goto done_one_vma;
+		if (*scan_address < vstart)
+			*scan_address = vstart;
+
+		if (sc->action == MEM_DEFRAG_DO_DEFRAG) {
+			/* Check VMA size, skip if below the size threshold */
+			if (vma->vm_end - vma->vm_start <
+					defrag_size_threshold * HPAGE_PMD_SIZE)
+				goto done_one_vma;
+
+			/*
+			 * Check VMA lifetime or size, skip if below the lifetime/size
+			 * threshold derived from a percentile
+			 */
+			if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_TIME) {
+				if ((jiffies - vma->vma_create_jiffies) < sc->vma_scan_threshold)
+					goto done_one_vma;
+			} else if (vma_scan_threshold_type == VMA_THRESHOLD_TYPE_SIZE) {
+				if ((vma->vm_end - vma->vm_start) < sc->vma_scan_threshold)
+					goto done_one_vma;
+			}
+
+			/* Avoid repeated defrag if the vma has not been changed */
+			if (vma_no_repeat_defrag &&
+				vma->vma_defrag_jiffies > vma->vma_modify_jiffies)
+				goto done_one_vma;
+
+			/* vma contiguity stats collection */
+			if (remain_buf_len && stats_buf) {
+				int used_len;
+				int pos = sc->buf_len - remain_buf_len;
+
+				used_len = scnprintf(stats_buf + pos, remain_buf_len,
+							"vma: 0x%lx, 0x%lx, 0x%lx, -1\n",
+							(unsigned long)vma+vma->vma_create_jiffies,
+							vma->vm_start, vma->vm_end);
+
+				remain_buf_len -= used_len;
+
+				if (remain_buf_len == 1) {
+					stats_buf[pos] = '\0';
+					remain_buf_len = 0;
+				}
+			}
+
+			anchor_node = get_anchor_page_node_from_vma(vma, vma->vm_start);
+
+			if (!anchor_node) {
+				find_anchor_pages_in_vma(mm, vma, vma->vm_start);
+				anchor_node = get_anchor_page_node_from_vma(vma, vma->vm_start);
+
+				if (!anchor_node)
+					goto done_one_vma;
+			}
+		}
+
+		contig_stats = (struct contig_stats) {0};
+
+		while (*scan_address < vend) {
+			struct page *page;
+
+			cond_resched();
+
+			if (unlikely(kmem_defragd_test_exit(mm)))
+				goto breakouterloop;
+
+			if (is_stats_collection(sc->action)) {
+				down_read(&vma->vm_mm->mmap_sem);
+				page = follow_page(vma, *scan_address,
+						FOLL_MIGRATION | FOLL_REMOTE);
+
+				if (do_page_stat(mm, vma, page, *scan_address,
+							stats_buf, sc->buf_len - remain_buf_len,
+							&remain_buf_len, sc->action, &contig_stats,
+							sc->scan_in_vma)) {
+					/* reset scan_address to the beginning of the contig.
+					 * So next scan will get the whole contig.
+					 */
+					if (contig_stats.err) {
+						*scan_address = contig_stats.first_vaddr_in_chunk;
+						sc->scan_in_vma = true;
+					}
+					goto breakouterloop;
+				}
+				/* move to next address */
+				if (page)
+					*scan_address += get_contig_page_size(page);
+				else
+					*scan_address += PAGE_SIZE;
+				up_read(&vma->vm_mm->mmap_sem);
+			} else if (sc->action == MEM_DEFRAG_DO_DEFRAG) {
+				/* go to nearest 1GB aligned address  */
+				unsigned long defrag_end = min_t(unsigned long,
+							(*scan_address + HPAGE_PUD_SIZE) & HPAGE_PUD_MASK,
+							vend);
+				int defrag_result;
+
+				anchor_node = get_anchor_page_node_from_vma(vma, *scan_address);
+
+				/*  in case VMA size changes */
+				if (!anchor_node) {
+					find_anchor_pages_in_vma(mm, vma, *scan_address);
+					anchor_node = get_anchor_page_node_from_vma(vma, *scan_address);
+				}
+
+				if (!anchor_node)
+					goto done_one_vma;
+
+				/*
+				 * looping through the 1GB region and defrag 2MB range in each
+				 * iteration.
+				 */
+				while (*scan_address < defrag_end) {
+					unsigned long defrag_sub_chunk_end = min_t(unsigned long,
+							(*scan_address + HPAGE_PMD_SIZE) & HPAGE_PMD_MASK,
+							defrag_end);
+					struct defrag_result_stats defrag_stats = {0};
+continue_defrag:
+					if (!anchor_node) {
+						anchor_node = get_anchor_page_node_from_vma(vma,
+										*scan_address);
+						if (!anchor_node) {
+							find_anchor_pages_in_vma(mm, vma, *scan_address);
+							anchor_node = get_anchor_page_node_from_vma(vma,
+											*scan_address);
+						}
+						if (!anchor_node)
+							goto done_one_vma;
+					}
+
+					defrag_result = defrag_address_range(mm, vma, *scan_address,
+									defrag_sub_chunk_end,
+									pfn_to_page(anchor_node->anchor_pfn),
+									anchor_node->anchor_vpn<<PAGE_SHIFT,
+									&defrag_stats);
+
+					/*
+					 * collect defrag results to show the cause of page
+					 * migration/exchange failure
+					 */
+					if (remain_buf_len && stats_buf) {
+						int used_len;
+						int pos = sc->buf_len - remain_buf_len;
+
+						used_len = scnprintf(stats_buf + pos, remain_buf_len,
+							"[0x%lx, 0x%lx):%lu [alig:%lu, migrated:%lu, "
+							"src: not:%lu, src_thp_dst_not:%lu, src_pte_thp:%lu "
+							"dst: out_bound:%lu, dst_thp_src_not:%lu, "
+							"dst_pte_thp:%lu, isolate_free:%lu, "
+							"migrate_free:%lu, anon:%lu, file:%lu, "
+							"non-lru:%lu, non-moveable:%lu], "
+							"anchor: (%lx, %lx), range: [%lx, %lx], "
+							"vma: 0x%lx, not_defrag_vpn: %lx\n",
+							*scan_address, defrag_sub_chunk_end,
+							(defrag_sub_chunk_end - *scan_address)/PAGE_SIZE,
+							defrag_stats.aligned,
+							defrag_stats.migrated,
+							defrag_stats.src_not_present,
+							defrag_stats.src_thp_dst_not_failed,
+							defrag_stats.src_pte_thp_failed,
+							defrag_stats.dst_out_of_bound_failed,
+							defrag_stats.dst_thp_src_not_failed,
+							defrag_stats.dst_pte_thp_failed,
+							defrag_stats.dst_isolate_free_failed,
+							defrag_stats.dst_migrate_free_failed,
+							defrag_stats.dst_anon_failed,
+							defrag_stats.dst_file_failed,
+							defrag_stats.dst_non_lru_failed,
+							defrag_stats.dst_non_moveable_failed,
+							anchor_node->anchor_vpn,
+							anchor_node->anchor_pfn,
+							anchor_node->node.start,
+							anchor_node->node.last,
+							(unsigned long)vma+vma->vma_create_jiffies,
+							defrag_stats.not_defrag_vpn
+							);
+
+						remain_buf_len -= used_len;
+
+						if (remain_buf_len == 1) {
+							stats_buf[pos] = '\0';
+							remain_buf_len = 0;
+						}
+					}
+
+					/*
+					 * skip the page which cannot be defragged and restart
+					 * from the next page
+					 */
+					if (defrag_stats.not_defrag_vpn &&
+						defrag_stats.not_defrag_vpn < defrag_sub_chunk_end) {
+						VM_BUG_ON(defrag_sub_chunk_end != defrag_end &&
+							defrag_stats.not_defrag_vpn > defrag_sub_chunk_end);
+
+						*scan_address = defrag_stats.not_defrag_vpn;
+						defrag_stats.not_defrag_vpn = 0;
+						goto continue_defrag;
+					}
+
+					/* Done with current 2MB chunk */
+					*scan_address = defrag_sub_chunk_end;
+					scanned_chunks++;
+					/*
+					 * if the knob is set, break out of the defrag loop after
+					 * a preset number of 2MB chunks are defragged
+					 */
+					if (num_breakout_chunks && scanned_chunks >= num_breakout_chunks) {
+						scanned_chunks = 0;
+						goto breakouterloop;
+					}
+				}
+
+			}
+		}
+done_one_vma:
+		sc->scan_in_vma = false;
+		if (sc->action == MEM_DEFRAG_DO_DEFRAG)
+			vma->vma_defrag_jiffies = jiffies;
+	}
+breakouterloop:
+
+	/* copy stats to user space */
+	if (sc->out_buf &&
+		sc->buf_len) {
+		err = copy_to_user(sc->out_buf, stats_buf,
+				sc->buf_len - remain_buf_len);
+		sc->used_len = sc->buf_len - remain_buf_len;
+	}
+
+	if (stats_buf)
+		vfree(stats_buf);
+
+	/* 0: scan a vma complete, 1: scan a vma incomplete  */
+	return vma == NULL ? 0 : 1;
+}
+
+SYSCALL_DEFINE4(scan_process_memory, pid_t, pid, char __user *, out_buf,
+				int, buf_len, int, action)
+{
+	const struct cred *cred = current_cred(), *tcred;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int err = 0;
+	static struct defrag_scan_control defrag_scan_control = {0};
+
+	/* Find the mm_struct */
+	rcu_read_lock();
+	task = pid ? find_task_by_vpid(pid) : current;
+	if (!task) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+	get_task_struct(task);
+
+	/*
+	 * Check if this process has the right to modify the specified
+	 * process. The right exists if the process has administrative
+	 * capabilities, superuser privileges or the same
+	 * userid as the target process.
+	 */
+	tcred = __task_cred(task);
+	if (!uid_eq(cred->euid, tcred->suid) && !uid_eq(cred->euid, tcred->uid) &&
+	    !uid_eq(cred->uid,  tcred->suid) && !uid_eq(cred->uid,  tcred->uid) &&
+	    !capable(CAP_SYS_NICE)) {
+		rcu_read_unlock();
+		err = -EPERM;
+		goto out;
+	}
+	rcu_read_unlock();
+
+	err = security_task_movememory(task);
+	if (err)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+
+	if (!mm)
+		return -EINVAL;
+
+	switch (action) {
+	case MEM_DEFRAG_SCAN:
+	case MEM_DEFRAG_CONTIG_SCAN:
+		count_vm_event(MEM_DEFRAG_SCAN_NUM);
+		/*
+		 * We allow scanning one process's address space for multiple
+		 * iterations. When we change the scanned process, reset
+		 * defrag_scan_control's mm_struct
+		 */
+		if (!defrag_scan_control.mm ||
+			defrag_scan_control.mm != mm) {
+			defrag_scan_control = (struct defrag_scan_control){0};
+			defrag_scan_control.mm = mm;
+		}
+		defrag_scan_control.out_buf = out_buf;
+		defrag_scan_control.buf_len = buf_len;
+		if (action == MEM_DEFRAG_SCAN)
+			defrag_scan_control.action = MEM_DEFRAG_FULL_STATS;
+		else if (action == MEM_DEFRAG_CONTIG_SCAN)
+			defrag_scan_control.action = MEM_DEFRAG_CONTIG_STATS;
+		else {
+			err = -EINVAL;
+			break;
+		}
+
+		defrag_scan_control.used_len = 0;
+
+		if (unlikely(!access_ok(out_buf, buf_len))) {
+			err = -EFAULT;
+			break;
+		}
+
+		/* clear mm once it is fully scanned  */
+		if (!kmem_defragd_scan_mm(&defrag_scan_control) &&
+			!defrag_scan_control.used_len)
+			defrag_scan_control.mm = NULL;
+
+		err = defrag_scan_control.used_len;
+		break;
+	case MEM_DEFRAG_MARK_SCAN_ALL:
+		set_bit(MMF_VM_MEM_DEFRAG_ALL, &mm->flags);
+		__kmem_defragd_enter(mm);
+		break;
+	case MEM_DEFRAG_CLEAR_SCAN_ALL:
+		clear_bit(MMF_VM_MEM_DEFRAG_ALL, &mm->flags);
+		break;
+	case MEM_DEFRAG_DEFRAG:
+		count_vm_event(MEM_DEFRAG_DEFRAG_NUM);
+
+		if (!defrag_scan_control.mm ||
+			defrag_scan_control.mm != mm) {
+			defrag_scan_control = (struct defrag_scan_control){0};
+			defrag_scan_control.mm = mm;
+		}
+		defrag_scan_control.action = MEM_DEFRAG_DO_DEFRAG;
+
+		defrag_scan_control.out_buf = out_buf;
+		defrag_scan_control.buf_len = buf_len;
+
+		/* clear mm once it is fully defragged */
+		if (buf_len) {
+			if (!kmem_defragd_scan_mm(&defrag_scan_control) &&
+				!defrag_scan_control.used_len) {
+				defrag_scan_control.mm = NULL;
+			}
+			err = defrag_scan_control.used_len;
+		} else {
+			err = kmem_defragd_scan_mm(&defrag_scan_control);
+			if (err == 0)
+				defrag_scan_control.mm = NULL;
+		}
+		break;
+	default:
+		err = -EINVAL;
+		break;
+	}
+
+	mmput(mm);
+	return err;
+
+out:
+	put_task_struct(task);
+	return err;
+}
+
+static unsigned int kmem_defragd_scan_mm_slot(void)
+{
+	struct mm_slot *mm_slot;
+	int scan_status = 0;
+	struct defrag_scan_control defrag_scan_control = {0};
+
+	spin_lock(&kmem_defragd_mm_lock);
+	if (kmem_defragd_scan.mm_slot)
+		mm_slot = kmem_defragd_scan.mm_slot;
+	else {
+		mm_slot = list_entry(kmem_defragd_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		kmem_defragd_scan.address = 0;
+		kmem_defragd_scan.mm_slot = mm_slot;
+	}
+	spin_unlock(&kmem_defragd_mm_lock);
+
+	defrag_scan_control.mm = mm_slot->mm;
+	defrag_scan_control.scan_address = kmem_defragd_scan.address;
+	defrag_scan_control.action = MEM_DEFRAG_DO_DEFRAG;
+
+	scan_status = kmem_defragd_scan_mm(&defrag_scan_control);
+
+	kmem_defragd_scan.address = defrag_scan_control.scan_address;
+
+	spin_lock(&kmem_defragd_mm_lock);
+	VM_BUG_ON(kmem_defragd_scan.mm_slot != mm_slot);
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (kmem_defragd_test_exit(mm_slot->mm) || !scan_status) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * kmem_defragd runs here, kmem_defragd_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (mm_slot->mm_node.next != &kmem_defragd_scan.mm_head) {
+			kmem_defragd_scan.mm_slot = list_first_entry(
+				&mm_slot->mm_node,
+				struct mm_slot, mm_node);
+			kmem_defragd_scan.address = 0;
+		} else
+			kmem_defragd_scan.mm_slot = NULL;
+
+		if (kmem_defragd_test_exit(mm_slot->mm))
+			collect_mm_slot(mm_slot);
+		else if (!scan_status) {
+			list_del(&mm_slot->mm_node);
+			list_add_tail(&mm_slot->mm_node, &kmem_defragd_scan.mm_head);
+		}
+	}
+	spin_unlock(&kmem_defragd_mm_lock);
+
+	return 0;
+}
+
+int memdefrag_madvise(struct vm_area_struct *vma,
+		     unsigned long *vm_flags, int advice)
+{
+	switch (advice) {
+	case MADV_MEMDEFRAG:
+		*vm_flags &= ~VM_NOMEMDEFRAG;
+		*vm_flags |= VM_MEMDEFRAG;
+		/*
+		 * If the vma become good for kmem_defragd to scan,
+		 * register it here without waiting a page fault that
+		 * may not happen any time soon.
+		 */
+		if (kmem_defragd_enter(vma, *vm_flags))
+			return -ENOMEM;
+		break;
+	case MADV_NOMEMDEFRAG:
+		*vm_flags &= ~VM_MEMDEFRAG;
+		*vm_flags |= VM_NOMEMDEFRAG;
+		/*
+		 * Setting VM_NOMEMDEFRAG will prevent kmem_defragd from scanning
+		 * this vma even if we leave the mm registered in kmem_defragd if
+		 * it got registered before VM_NOMEMDEFRAG was set.
+		 */
+		break;
+	}
+
+	return 0;
+}
+
+
+void __init kmem_defragd_destroy(void)
+{
+	kmem_cache_destroy(mm_slot_cache);
+}
+
+int __init kmem_defragd_init(void)
+{
+	mm_slot_cache = kmem_cache_create("kmem_defragd_mm_slot",
+					  sizeof(struct mm_slot),
+					  __alignof__(struct mm_slot), 0, NULL);
+	if (!mm_slot_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+subsys_initcall(kmem_defragd_init);
\ No newline at end of file
diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..019036e87088 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,6 +69,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/dax.h>
 #include <linux/oom.h>
+#include <linux/mem_defrag.h>
 
 #include <asm/io.h>
 #include <asm/mmu_context.h>
@@ -2926,6 +2927,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
+	/* Make it defrag  */
+	if (unlikely(kmem_defragd_enter(vma, vma->vm_flags)))
+		goto oom;
 	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
 	if (!page)
 		goto oom;
@@ -3844,6 +3848,9 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	p4d_t *p4d;
 	vm_fault_t ret;
 
+	/* Zi: page faults modify vma */
+	vma->vma_modify_jiffies = jiffies;
+
 	pgd = pgd_offset(mm, address);
 	p4d = p4d_alloc(mm, pgd, address);
 	if (!p4d)
diff --git a/mm/mmap.c b/mm/mmap.c
index f901065c4c64..653dd99d5145 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -169,6 +169,28 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+void free_anchor_pages(struct vm_area_struct *vma)
+{
+	struct interval_tree_node *node;
+
+	if (!vma)
+		return;
+
+	if (RB_EMPTY_ROOT(&vma->anchor_page_rb.rb_root))
+		return;
+
+	for (node = interval_tree_iter_first(&vma->anchor_page_rb,
+				0, (unsigned long)-1);
+		 node;) {
+		struct anchor_page_node *anchor_node = container_of(node,
+					struct anchor_page_node, node);
+		interval_tree_remove(node, &vma->anchor_page_rb);
+		node = interval_tree_iter_next(node, 0, (unsigned long)-1);
+		kfree(anchor_node);
+	}
+
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -181,6 +203,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
+	free_anchor_pages(vma);
 	mpol_put(vma_policy(vma));
 	vm_area_free(vma);
 	return next;
@@ -725,10 +748,15 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	/*free_anchor_pages(vma);*/
+
+	vma->vma_modify_jiffies = jiffies;
+
 	if (next && !insert) {
 		struct vm_area_struct *exporter = NULL, *importer = NULL;
 
 		if (end >= next->vm_end) {
+			/*free_anchor_pages(next);*/
 			/*
 			 * vma expands, overlapping all the next, and
 			 * perhaps the one after too (mprotect case 6).
@@ -775,6 +803,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 				exporter = next->vm_next;
 
 		} else if (end > next->vm_start) {
+			/*free_anchor_pages(next);*/
 			/*
 			 * vma expands, overlapping part of the next:
 			 * mprotect case 5 shifting the boundary up.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 35fdde041f5c..a35605e0924a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1828,7 +1828,7 @@ void __init init_cma_reserved_pageblock(struct page *page)
  *
  * -- nyc
  */
-static inline void expand(struct zone *zone, struct page *page,
+inline void expand(struct zone *zone, struct page *page,
 	int low, int high, struct free_area *area,
 	int migratetype)
 {
@@ -1950,7 +1950,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	set_page_owner(page, order, gfp_flags);
 }
 
-static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
+void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags)
 {
 	int i;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 83b30edc2f7f..c18a42250a5c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1293,6 +1293,27 @@ const char * const vmstat_text[] = {
 	"swap_ra",
 	"swap_ra_hit",
 #endif
+	"memdefrag_defrag",
+	"memdefrag_scan",
+	"memdefrag_dest_free_pages",
+	"memdefrag_dest_anon_pages",
+	"memdefrag_dest_file_pages",
+	"memdefrag_dest_non_lru_pages",
+	"memdefrag_dest_free_pages_failed",
+	"memdefrag_dest_free_pages_overflow_failed",
+	"memdefrag_dest_anon_pages_failed",
+	"memdefrag_dest_file_pages_failed",
+	"memdefrag_dest_nonlru_pages_failed",
+	"memdefrag_src_anon_pages_failed",
+	"memdefrag_src_compound_pages_failed",
+	"memdefrag_dst_split_hugepage",
+#ifdef CONFIG_COMPACTION
+	"compact_migrate_pages",
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"thp_collapse_migrate_pages"
+#endif
+
 #endif /* CONFIG_VM_EVENTS_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 05/31] mem_defrag: split a THP if either src or dst is THP only.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (3 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 04/31] mm: add mem_defrag functionality Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 06/31] mm: Make MAX_ORDER configurable in Kconfig for buddy allocator Zi Yan
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

During the process of generating physically contiguous memory, it is
possible that we want to move a THP to a place with 512 base pages.
Exchange pages has not implemented the exchange of a THP and 512 base
pages. Instead, we can split the THP and exchange 512 base pages.
This increases the chance of creating a large contiguous region.
A split THP could be promoted back after all 512 pages are moved to the
destination or if none of its subpages is moved.
In-place THP promotion will be introduced later in this patch serie.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/internal.h   |   4 ++
 mm/mem_defrag.c | 155 +++++++++++++++++++++++++++++++++++++-----------
 mm/page_alloc.c |  45 ++++++++++++++
 3 files changed, 168 insertions(+), 36 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 4fe8d1a4d7bb..70a6ef603e5b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -574,6 +574,10 @@ void expand(struct zone *zone, struct page *page,
 	int low, int high, struct free_area *area,
 	int migratetype);
 
+int expand_free_page(struct zone *zone, struct page *buddy_head,
+	struct page *page, int buddy_order, int page_order,
+	struct free_area *area, int migratetype);
+
 void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags);
 
diff --git a/mm/mem_defrag.c b/mm/mem_defrag.c
index 414909e1c19c..4d458b125c95 100644
--- a/mm/mem_defrag.c
+++ b/mm/mem_defrag.c
@@ -643,6 +643,15 @@ static void exchange_free(struct page *freepage, unsigned long data)
 		head->num_freepages++;
 }
 
+static bool page_can_migrate(struct page *page)
+{
+	if (PageAnon(page))
+		return true;
+	if (page_mapping(page))
+		return true;
+	return false;
+}
+
 int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long start_addr, unsigned long end_addr,
 		struct page *anchor_page, unsigned long page_vaddr,
@@ -655,6 +664,7 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 	int not_present = 0;
 	bool src_thp = false;
 
+restart:
 	for (scan_address = start_addr; scan_address < end_addr;
 		 scan_address += page_size) {
 		struct page *scan_page;
@@ -683,6 +693,8 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 		if ((scan_page == compound_head(scan_page)) &&
 			PageTransHuge(scan_page) && !PageHuge(scan_page))
 			src_thp = true;
+		else
+			src_thp = false;
 
 		/* Allow THPs  */
 		if (PageCompound(scan_page) && !src_thp) {
@@ -720,13 +732,17 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 
 retry_defrag:
-			/* migrate */
-			if (PageBuddy(dest_page)) {
+				/* free pages */
+			if (page_count(dest_page) == 0 && dest_page->mapping == NULL) {
+				int buddy_page_order = 0;
+				unsigned long pfn = page_to_pfn(dest_page);
+				unsigned long buddy_pfn;
+				struct page *buddy = dest_page;
 				struct zone *zone = page_zone(dest_page);
 				spinlock_t *zone_lock = &zone->lock;
 				unsigned long zone_lock_flags;
 				unsigned long free_page_order = 0;
-				int err = 0;
+				int err = 0, expand_err = 0;
 				struct exchange_alloc_head exchange_alloc_head = {0};
 				int migratetype = get_pageblock_migratetype(dest_page);
 
@@ -734,32 +750,77 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 				INIT_LIST_HEAD(&exchange_alloc_head.freelist);
 				INIT_LIST_HEAD(&exchange_alloc_head.migratepage_list);
 
-				count_vm_events(MEM_DEFRAG_DST_FREE_PAGES, 1<<scan_page_order);
+				/* not managed pages  */
+				if (!dest_page->flags) {
+					failed += 1;
+					defrag_stats->dst_out_of_bound_failed += 1;
 
+					defrag_stats->not_defrag_vpn = scan_address + page_size;
+					goto quit_defrag;
+				}
+				/* spill order-0 pages to buddy allocator from pcplist */
+				if (!PageBuddy(dest_page) && !page_drained) {
+					drain_all_pages(zone);
+					page_drained = 1;
+					goto retry_defrag;
+				}
 				/* lock page_zone(dest_page)->lock  */
 				spin_lock_irqsave(zone_lock, zone_lock_flags);
 
-				if (!PageBuddy(dest_page)) {
+				while (!PageBuddy(buddy) && buddy_page_order < MAX_ORDER) {
+					buddy_pfn = pfn & ~((1<<buddy_page_order) - 1);
+					buddy = dest_page - (pfn - buddy_pfn);
+					buddy_page_order++;
+				}
+				if (!PageBuddy(buddy)) {
 					err = -EINVAL;
 					goto freepage_isolate_fail;
 				}
 
-				free_page_order = page_order(dest_page);
+				count_vm_events(MEM_DEFRAG_DST_FREE_PAGES, 1<<scan_page_order);
 
-				/* fail early if not enough free pages */
-				if (free_page_order < scan_page_order) {
+				free_page_order = page_order(buddy);
+
+				/* caught some transient-state page */
+				if (free_page_order < buddy_page_order) {
 					err = -ENOMEM;
 					goto freepage_isolate_fail;
 				}
 
+				/* fail early if not enough free pages */
+				if (free_page_order < scan_page_order) {
+					int ret;
+
+					spin_unlock_irqrestore(zone_lock, zone_lock_flags);
+
+					if (is_huge_zero_page(scan_page)) {
+						err = -ENOMEM;
+						goto freepage_isolate_fail_unlocked;
+					}
+					get_page(scan_page);
+					lock_page(scan_page);
+					ret = split_huge_page(scan_page);
+					unlock_page(scan_page);
+					put_page(scan_page);
+					if (ret) {
+						err = -ENOMEM;
+						goto freepage_isolate_fail_unlocked;
+					} else {
+						goto restart;
+					}
+				}
+
 				/* __isolate_free_page()  */
-				err = isolate_free_page_no_wmark(dest_page, free_page_order);
+				err = isolate_free_page_no_wmark(buddy, free_page_order);
 				if (!err)
 					goto freepage_isolate_fail;
 
-				expand(zone, dest_page, scan_page_order, free_page_order,
+				expand_err = expand_free_page(zone, buddy, dest_page,
+					free_page_order, scan_page_order,
 					&(zone->free_area[free_page_order]),
 					migratetype);
+				if (expand_err)
+					goto freepage_isolate_fail;
 
 				if (!is_migrate_isolate(migratetype))
 					__mod_zone_freepage_state(zone, -(1UL << scan_page_order),
@@ -778,7 +839,7 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 
 freepage_isolate_fail:
 				spin_unlock_irqrestore(zone_lock, zone_lock_flags);
-
+freepage_isolate_fail_unlocked:
 				if (err < 0) {
 					failed += (page_size/PAGE_SIZE);
 					defrag_stats->dst_isolate_free_failed += (page_size/PAGE_SIZE);
@@ -844,6 +905,8 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 				if ((dest_page == compound_head(dest_page)) &&
 					PageTransHuge(dest_page) && !PageHuge(dest_page))
 					dst_thp = true;
+				else
+					dst_thp = false;
 
 				if (PageCompound(dest_page) && !dst_thp) {
 					failed += get_contig_page_size(dest_page);
@@ -854,37 +917,56 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 				}
 
 				if (src_thp != dst_thp) {
-					failed += get_contig_page_size(scan_page);
-					if (src_thp && !dst_thp)
-						defrag_stats->src_thp_dst_not_failed +=
-							page_size/PAGE_SIZE;
-					else /* !src_thp && dst_thp */
-						defrag_stats->dst_thp_src_not_failed +=
-							page_size/PAGE_SIZE;
+					if (src_thp && !dst_thp) {
+						int ret;
+
+						if (!page_can_migrate(dest_page)) {
+							failed += get_contig_page_size(scan_page);
+							defrag_stats->not_defrag_vpn = scan_address + page_size;
+							goto quit_defrag;
+						}
 
+						get_page(scan_page);
+						lock_page(scan_page);
+						if (!PageCompound(scan_page) || is_huge_zero_page(scan_page)) {
+							ret = 0;
+							src_thp = false;
+							goto split_src_done;
+						}
+						ret = split_huge_page(scan_page);
+split_src_done:
+						unlock_page(scan_page);
+						put_page(scan_page);
+						if (ret)
+							defrag_stats->src_thp_dst_not_failed += page_size/PAGE_SIZE;
+						else
+							goto restart;
+					} else {/* !src_thp && dst_thp */
+						int ret;
+
+						get_page(dest_page);
+						lock_page(dest_page);
+						if (!PageCompound(dest_page) || is_huge_zero_page(dest_page)) {
+							ret = 0;
+							dst_thp = false;
+							goto split_dst_done;
+						}
+						ret = split_huge_page(dest_page);
+split_dst_done:
+						unlock_page(dest_page);
+						put_page(dest_page);
+						if (ret)
+							defrag_stats->dst_thp_src_not_failed += page_size/PAGE_SIZE;
+						else
+							goto retry_defrag;
+					}
+
+					failed += get_contig_page_size(scan_page);
 					defrag_stats->not_defrag_vpn = scan_address + page_size;
 					goto quit_defrag;
 					/*continue;*/
 				}
 
-				/* free page on pcplist */
-				if (page_count(dest_page) == 0) {
-					/* not managed pages  */
-					if (!dest_page->flags) {
-						failed += 1;
-						defrag_stats->dst_out_of_bound_failed += 1;
-
-						defrag_stats->not_defrag_vpn = scan_address + page_size;
-						goto quit_defrag;
-					}
-					/* spill order-0 pages to buddy allocator from pcplist */
-					if (!page_drained) {
-						drain_all_pages(NULL);
-						page_drained = 1;
-						goto retry_defrag;
-					}
-				}
-
 				if (PageAnon(dest_page)) {
 					count_vm_events(MEM_DEFRAG_DST_ANON_PAGES,
 							1<<scan_page_order);
@@ -895,6 +977,7 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 								1<<scan_page_order);
 						failed += 1<<scan_page_order;
 						defrag_stats->dst_anon_failed += 1<<scan_page_order;
+						/*print_page_stats(dest_page, "anonymous page");*/
 					}
 				} else if (page_mapping(dest_page)) {
 					count_vm_events(MEM_DEFRAG_DST_FILE_PAGES,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a35605e0924a..9ba2cdc320f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1855,6 +1855,51 @@ inline void expand(struct zone *zone, struct page *page,
 	}
 }
 
+inline int expand_free_page(struct zone *zone, struct page *buddy_head,
+	struct page *page, int buddy_order, int page_order, struct free_area *area,
+	int migratetype)
+{
+	unsigned long size = 1 << buddy_order;
+
+	if (!(page >= buddy_head && page < (buddy_head + (1<<buddy_order)))) {
+		int mapcount = PageSlab(buddy_head) ? 0 : page_mapcount(buddy_head);
+
+		mapcount = PageSlab(page) ? 0 : page_mapcount(page);
+		__free_one_page(buddy_head, page_to_pfn(buddy_head), zone, buddy_order,
+				migratetype);
+		return -EINVAL;
+	}
+
+	while (buddy_order > page_order) {
+		struct page *page_to_free;
+
+		area--;
+		buddy_order--;
+		size >>= 1;
+
+		if (page < (buddy_head + size))
+			page_to_free = buddy_head + size;
+		else {
+			page_to_free = buddy_head;
+			buddy_head = buddy_head + size;
+		}
+
+		/*
+		 * Mark as guard pages (or page), that will allow to
+		 * merge back to allocator when buddy will be freed.
+		 * Corresponding page table entries will not be touched,
+		 * pages will stay not present in virtual address space
+		 */
+		if (set_page_guard(zone, page_to_free, buddy_order, migratetype))
+			continue;
+
+		list_add(&page_to_free->lru, &area->free_list[migratetype]);
+		area->nr_free++;
+		set_page_order(page_to_free, buddy_order);
+	}
+	return 0;
+}
+
 static void check_new_page_bad(struct page *page)
 {
 	const char *bad_reason = NULL;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 06/31] mm: Make MAX_ORDER configurable in Kconfig for buddy allocator.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (4 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 05/31] mem_defrag: split a THP if either src or dst is THP only Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 07/31] mm: deallocate pages with order > MAX_ORDER Zi Yan
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

To test 1GB THP implemented in the following patches, this patch enables
changing MAX_ORDER of the buddy allocator.

It should be dropped later when we solely rely on mem_defrag to generate
1GB THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/Kconfig                 | 15 +++++++++++++++
 arch/x86/include/asm/sparsemem.h |  4 ++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..f766ff5651d5 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1665,6 +1665,21 @@ config X86_PMEM_LEGACY
 
 	  Say Y if unsure.
 
+config FORCE_MAX_ZONEORDER
+	int "Maximum zone order"
+	range 11 20
+	default "11"
+	help
+	  The kernel memory allocator divides physically contiguous memory
+	  blocks into "zones", where each zone is a power of two number of
+	  pages.  This option selects the largest power of two that the kernel
+	  keeps in the memory allocator.  If you need to allocate very large
+	  blocks of physically contiguous memory, then you may need to
+	  increase this value.
+
+	  This config option is actually maximum order plus one. For example,
+	  a value of 11 means that the largest free memory block is 2^10 pages.
+
 config HIGHPTE
 	bool "Allocate 3rd-level pagetables from highmem"
 	depends on HIGHMEM
diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 199218719a86..2df61d5ccc2d 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -21,12 +21,12 @@
 #  define MAX_PHYSADDR_BITS	36
 #  define MAX_PHYSMEM_BITS	36
 # else
-#  define SECTION_SIZE_BITS	26
+#  define SECTION_SIZE_BITS	31
 #  define MAX_PHYSADDR_BITS	32
 #  define MAX_PHYSMEM_BITS	32
 # endif
 #else /* CONFIG_X86_32 */
-# define SECTION_SIZE_BITS	27 /* matt - 128 is convenient right now */
+# define SECTION_SIZE_BITS	31 /* matt - 128 is convenient right now */
 # define MAX_PHYSADDR_BITS	(pgtable_l5_enabled() ? 52 : 44)
 # define MAX_PHYSMEM_BITS	(pgtable_l5_enabled() ? 52 : 46)
 #endif
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 07/31] mm: deallocate pages with order > MAX_ORDER.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (5 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 06/31] mm: Make MAX_ORDER configurable in Kconfig for buddy allocator Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 08/31] mm: add pagechain container for storing multiple pages Zi Yan
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

When MAX_ORDER is not set to allocate 1GB pages and 1GB THPs are created
from in-place promotion, we need this to properly free 1GB THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ba2cdc320f2..cfa99bb54bd6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1287,6 +1287,24 @@ void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 	}
 }
 
+static void destroy_compound_gigantic_page(struct page *page,
+					unsigned int order)
+{
+	int i;
+	int nr_pages = 1 << order;
+	struct page *p = page + 1;
+
+	atomic_set(compound_mapcount_ptr(page), 0);
+	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		clear_compound_head(p);
+		set_page_refcounted(p);
+	}
+
+	set_compound_order(page, 0);
+	__ClearPageHead(page);
+	set_page_refcounted(page);
+}
+
 static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
@@ -1296,11 +1314,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	if (!free_pages_prepare(page, order, true))
 		return;
 
-	migratetype = get_pfnblock_migratetype(page, pfn);
-	local_irq_save(flags);
-	__count_vm_events(PGFREE, 1 << order);
-	free_one_page(page_zone(page), page, pfn, order, migratetype);
-	local_irq_restore(flags);
+	if (order > MAX_ORDER) {
+		destroy_compound_gigantic_page(page, order);
+		free_contig_range(page_to_pfn(page), 1 << order);
+	} else {
+		migratetype = get_pfnblock_migratetype(page, pfn);
+		local_irq_save(flags);
+		__count_vm_events(PGFREE, 1 << order);
+		free_one_page(page_zone(page), page, pfn, order, migratetype);
+		local_irq_restore(flags);
+	}
 }
 
 static void __init __free_pages_boot_core(struct page *page, unsigned int order)
@@ -8281,6 +8304,8 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	return ret;
 }
 
+#endif
+
 void free_contig_range(unsigned long pfn, unsigned nr_pages)
 {
 	unsigned int count = 0;
@@ -8293,7 +8318,6 @@ void free_contig_range(unsigned long pfn, unsigned nr_pages)
 	}
 	WARN(count != 0, "%d pages are still in use!\n", count);
 }
-#endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 08/31] mm: add pagechain container for storing multiple pages.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (6 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 07/31] mm: deallocate pages with order > MAX_ORDER Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 09/31] mm: thp: 1GB anonymous page implementation Zi Yan
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

When depositing page table pages for 1GB THPs, we need 512 PTE pages +
1 PMD page. Instead of counting and depositing 513 pages, we can use the
PMD page as a leader page and chain the rest 512 PTE pages with ->lru.
This, however, prevents us depositing PMD pages with ->lru, which is
currently used by depositing PTE pages for 2MB THPs. So add a new
pagechain container for PMD pages.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/pagechain.h | 73 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 include/linux/pagechain.h

diff --git a/include/linux/pagechain.h b/include/linux/pagechain.h
new file mode 100644
index 000000000000..be536142b413
--- /dev/null
+++ b/include/linux/pagechain.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * include/linux/pagechain.h
+ *
+ * In many places it is efficient to batch an operation up against multiple
+ * pages. A pagechain is a multipage container which is used for that.
+ */
+
+#ifndef _LINUX_PAGECHAIN_H
+#define _LINUX_PAGECHAIN_H
+
+#include <linux/slab.h>
+
+/* 14 pointers + two long's align the pagechain structure to a power of two */
+#define PAGECHAIN_SIZE	13
+
+struct page;
+
+struct pagechain {
+	struct list_head list;
+	unsigned int nr;
+	struct page *pages[PAGECHAIN_SIZE];
+};
+
+static inline void pagechain_init(struct pagechain *pchain)
+{
+	pchain->nr = 0;
+	INIT_LIST_HEAD(&pchain->list);
+}
+
+static inline void pagechain_reinit(struct pagechain *pchain)
+{
+	pchain->nr = 0;
+}
+
+static inline unsigned int pagechain_count(struct pagechain *pchain)
+{
+	return pchain->nr;
+}
+
+static inline unsigned int pagechain_space(struct pagechain *pchain)
+{
+	return PAGECHAIN_SIZE - pchain->nr;
+}
+
+static inline bool pagechain_empty(struct pagechain *pchain)
+{
+	return pchain->nr == 0;
+}
+
+/*
+ * Add a page to a pagechain.  Returns the number of slots still available.
+ */
+static inline unsigned int pagechain_deposit(struct pagechain *pchain, struct page *page)
+{
+	VM_BUG_ON(!pagechain_space(pchain));
+	pchain->pages[pchain->nr++] = page;
+	return pagechain_space(pchain);
+}
+
+static inline struct page *pagechain_withdraw(struct pagechain *pchain)
+{
+	if (!pagechain_count(pchain))
+		return NULL;
+	return pchain->pages[--pchain->nr];
+}
+
+void __init pagechain_cache_init(void);
+struct pagechain *pagechain_alloc(void);
+void pagechain_free(struct pagechain *pchain);
+
+#endif /* _LINUX_PAGECHAIN_H */
+
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 09/31] mm: thp: 1GB anonymous page implementation.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (7 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 08/31] mm: add pagechain container for storing multiple pages Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 10/31] mm: proc: add 1GB THP kpageflag Zi Yan
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

This adds 1GB THP support for anonymous pages. Applications can get 1GB
pages during page faults when their VMAs are larger than 1GB. For
read-only 1GB zero THP, a shared 1GB zero THP is created for all
readers.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgalloc.h |  58 +++++++
 arch/x86/include/asm/pgtable.h |   2 +
 arch/x86/mm/pgtable.c          |  25 +++
 drivers/base/node.c            |   4 +-
 fs/proc/meminfo.c              |   3 +-
 include/asm-generic/pgtable.h  |   3 +
 include/linux/huge_mm.h        |  17 ++-
 include/linux/mm.h             |   4 +
 include/linux/mm_types.h       |   1 +
 include/linux/mmzone.h         |   1 +
 include/linux/sched/coredump.h |   1 +
 include/linux/vm_event_item.h  |   2 +
 kernel/fork.c                  |   5 +
 mm/huge_memory.c               | 267 ++++++++++++++++++++++++++++++++-
 mm/memory.c                    |  28 +++-
 mm/page_alloc.c                |   3 +-
 mm/pgtable-generic.c           |  47 +++++-
 mm/rmap.c                      |  28 +++-
 mm/vmstat.c                    |   3 +
 19 files changed, 484 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index a281e61ec60c..6e29ad9b9d7f 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *);
 extern pgtable_t pte_alloc_one(struct mm_struct *);
+extern pgtable_t pte_alloc_order(struct mm_struct *, unsigned long, int);
 
 /* Should really implement gc for free page table pages. This could be
    done with a reference count in struct page. */
@@ -65,6 +66,17 @@ static inline void pte_free(struct mm_struct *mm, struct page *pte)
 	__free_page(pte);
 }
 
+static inline void pte_free_order(struct mm_struct *mm, struct page *pte,
+		int order)
+{
+	int i;
+
+	for (i = 0; i < (1<<order); i++) {
+		pgtable_page_dtor(&pte[i]);
+		__free_page(&pte[i]);
+	}
+}
+
 extern void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte);
 
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *pte,
@@ -123,6 +135,52 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 	free_page((unsigned long)pmd);
 }
 
+static inline pmd_t *pmd_alloc_one_page_with_ptes(struct mm_struct *mm, unsigned long addr)
+{
+	pgtable_t pte_pgtables;
+	pmd_t *pmd;
+	spinlock_t *pmd_ptl;
+	int i;
+
+	pte_pgtables = pte_alloc_order(mm, addr,
+		HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+	if (!pte_pgtables)
+		return NULL;
+
+	pmd = pmd_alloc_one(mm, addr);
+	if (unlikely(!pmd)) {
+		pte_free_order(mm, pte_pgtables,
+			HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+		return NULL;
+	}
+	pmd_ptl = pmd_lock(mm, pmd);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+		pgtable_trans_huge_deposit(mm, pmd, pte_pgtables + i);
+
+	spin_unlock(pmd_ptl);
+
+	return pmd;
+}
+
+static inline void pmd_free_page_with_ptes(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *pmd_ptl;
+	int i;
+
+	BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+	pmd_ptl = pmd_lock(mm, pmd);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++) {
+		pgtable_t pte_pgtable;
+
+		pte_pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		pte_free(mm, pte_pgtable);
+	}
+
+	spin_unlock(pmd_ptl);
+	pmd_free(mm, pmd);
+}
 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd);
 
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd,
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..ae3ac49c32ad 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1165,6 +1165,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm, unsigned long
 	return native_pmdp_get_and_clear(pmdp);
 }
 
+#define mk_pud(page, pgprot)   pfn_pud(page_to_pfn(page), (pgprot))
+
 #define __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
 static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 					unsigned long addr, pud_t *pudp)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7bd01709a091..0a5008690d7c 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -42,6 +42,31 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 	return pte;
 }
 
+pgtable_t pte_alloc_order(struct mm_struct *mm, unsigned long address, int order)
+{
+	struct page *pte;
+	int i;
+
+	pte = alloc_pages(__userpte_alloc_gfp, order);
+	if (!pte)
+		return NULL;
+	split_page(pte, order);
+	for (i = 1; i < (1 << order); i++)
+		set_page_private(pte + i, 0);
+
+	for (i = 0; i < (1<<order); i++) {
+		if (!pgtable_page_ctor(&pte[i])) {
+			__free_page(&pte[i]);
+			while (--i >= 0) {
+				pgtable_page_dtor(&pte[i]);
+				__free_page(&pte[i]);
+			}
+			return NULL;
+		}
+	}
+	return pte;
+}
+
 static int __init setup_userpte(char *arg)
 {
 	if (!arg)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..f21d2235bf97 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -150,7 +150,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       ,
 		       nid, K(node_page_state(pgdat, NR_ANON_THPS) *
-				       HPAGE_PMD_NR),
+				       HPAGE_PMD_NR) +
+				    K(node_page_state(pgdat, NR_ANON_THPS_PUD) *
+				       HPAGE_PUD_NR),
 		       nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
 		       nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) *
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 568d90e17c17..9d127e440e4c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -131,7 +131,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
-		    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR);
+		    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR +
+			global_node_page_state(NR_ANON_THPS_PUD) * HPAGE_PUD_NR);
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
 	show_val_kb(m, "ShmemPmdMapped: ",
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 05e61e6c843f..0f626d6177c3 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -303,10 +303,13 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
+extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				       pgtable_t pgtable);
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
 extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+extern pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 381e872bfde0..c6272e6ffc35 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -18,10 +18,15 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
 }
+extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	return VM_FAULT_FALLBACK;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -80,6 +85,9 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
@@ -214,7 +222,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 static inline int hpage_nr_pages(struct page *page)
 {
 	if (unlikely(PageTransHuge(page)))
-		return HPAGE_PMD_NR;
+		return (1<<page[1].compound_order);
 	return 1;
 }
 
@@ -226,10 +234,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
+extern struct page *huge_pud_zero_page;
 
 static inline bool is_huge_zero_page(struct page *page)
 {
-	return READ_ONCE(huge_zero_page) == page;
+	return (READ_ONCE(huge_zero_page) == page) ||
+			(READ_ONCE(huge_pud_zero_page) == page);
 }
 
 static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -239,13 +249,14 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 
 static inline bool is_huge_zero_pud(pud_t pud)
 {
-	return false;
+	return is_huge_zero_page(pud_page(pud));
 }
 
 struct page *mm_get_huge_zero_page(struct mm_struct *mm);
 void mm_put_huge_zero_page(struct mm_struct *mm);
 
 #define mk_huge_pmd(page, prot) pmd_mkhuge(mk_pmd(page, prot))
+#define mk_huge_pud(page, prot) pud_mkhuge(mk_pud(page, prot))
 
 static inline bool thp_migration_supported(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5bcc1b03372a..d10dc9db2311 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
 #include <linux/page_ref.h>
 #include <linux/memremap.h>
 #include <linux/overflow.h>
+#include <linux/pagechain.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -1985,6 +1986,7 @@ static inline void pgtable_init(void)
 {
 	ptlock_cache_init();
 	pgtable_cache_init();
+	pagechain_cache_init();
 }
 
 static inline bool pgtable_page_ctor(struct page *page)
@@ -2101,6 +2103,8 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
+#define pud_huge_pte(mm, pud) ((mm)->pud_huge_pte)
+
 extern void __init pagecache_init(void);
 extern void free_area_init(unsigned long * zones_size);
 extern void __init free_area_init_node(int nid, unsigned long * zones_size,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 32549b255d25..a5ac5946a375 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -466,6 +466,7 @@ struct mm_struct {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
+		struct list_head pud_huge_pte; /* protected by page_table_lock */
 #ifdef CONFIG_NUMA_BALANCING
 		/*
 		 * numa_next_scan is the next time that the PTEs will be marked
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 842f9189537b..ea84d6a1802d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -177,6 +177,7 @@ enum node_stat_item {
 	NR_SHMEM_THPS,
 	NR_SHMEM_PMDMAPPED,
 	NR_ANON_THPS,
+	NR_ANON_THPS_PUD,
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 52ad71db6687..4893849d11eb 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -73,6 +73,7 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_OOM_VICTIM		25	/* mm is the oom victim */
 #define MMF_OOM_REAP_QUEUED	26	/* mm was queued for oom_reaper */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
+#define MMF_HUGE_PUD_ZERO_PAGE	26	/* mm has ever used the global huge pud zero page */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
 				 MMF_DISABLE_THP_MASK)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 6b32c8243616..4550667b2274 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -82,6 +82,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+		THP_FAULT_ALLOC_PUD,
+		THP_FAULT_FALLBACK_PUD,
 		THP_SPLIT_PUD,
 #endif
 		THP_ZERO_PAGE_ALLOC,
diff --git a/kernel/fork.c b/kernel/fork.c
index dcefa978c232..fc5a925e0496 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -662,6 +662,10 @@ static void check_mm(struct mm_struct *mm)
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
 #endif
+	VM_BUG_ON_MM(!list_empty(&mm->pud_huge_pte) &&
+				 !pagechain_empty(list_first_entry(&mm->pud_huge_pte,
+					struct pagechain, list)),
+				mm);
 }
 
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
@@ -1003,6 +1007,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	mm->pmd_huge_pte = NULL;
 #endif
+	INIT_LIST_HEAD(&mm->pud_huge_pte);
 	mm_init_uprobes_state(mm);
 
 	if (current->mm) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ffcae07a87d3..cad4ef01f607 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -62,6 +62,8 @@ static struct shrinker deferred_split_shrinker;
 
 static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
+static atomic_t huge_pud_zero_refcount;
+struct page *huge_pud_zero_page __read_mostly;
 
 bool transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
@@ -109,6 +111,42 @@ static void put_huge_zero_page(void)
 	BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
 }
 
+static struct page *get_huge_pud_zero_page(void)
+{
+	struct page *zero_page;
+retry:
+	if (likely(atomic_inc_not_zero(&huge_pud_zero_refcount)))
+		return READ_ONCE(huge_pud_zero_page);
+
+	zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+			HPAGE_PUD_ORDER);
+	if (!zero_page) {
+		count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
+		return NULL;
+	}
+	count_vm_event(THP_ZERO_PAGE_ALLOC);
+	preempt_disable();
+	if (cmpxchg(&huge_pud_zero_page, NULL, zero_page)) {
+		preempt_enable();
+		__free_pages(zero_page, compound_order(zero_page));
+		goto retry;
+	}
+
+	/* We take additional reference here. It will be put back by shrinker */
+	atomic_set(&huge_pud_zero_refcount, 2);
+	preempt_enable();
+	return READ_ONCE(huge_pud_zero_page);
+}
+
+static void put_huge_pud_zero_page(void)
+{
+	/*
+	 * Counter should never go to zero here. Only shrinker can put
+	 * last reference.
+	 */
+	BUG_ON(atomic_dec_and_test(&huge_pud_zero_refcount));
+}
+
 struct page *mm_get_huge_zero_page(struct mm_struct *mm)
 {
 	if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
@@ -123,9 +161,23 @@ struct page *mm_get_huge_zero_page(struct mm_struct *mm)
 	return READ_ONCE(huge_zero_page);
 }
 
+struct page *mm_get_huge_pud_zero_page(struct mm_struct *mm)
+{
+	if (test_bit(MMF_HUGE_PUD_ZERO_PAGE, &mm->flags))
+		return READ_ONCE(huge_pud_zero_page);
+
+	if (!get_huge_pud_zero_page())
+		return NULL;
+
+	if (test_and_set_bit(MMF_HUGE_PUD_ZERO_PAGE, &mm->flags))
+		put_huge_pud_zero_page();
+
+	return READ_ONCE(huge_pud_zero_page);
+}
+
 void mm_put_huge_zero_page(struct mm_struct *mm)
 {
-	if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
+	if (test_bit(MMF_HUGE_PUD_ZERO_PAGE, &mm->flags))
 		put_huge_zero_page();
 }
 
@@ -859,6 +911,175 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
+
+static int __do_huge_pud_anonymous_page(struct vm_fault *vmf, struct page *page,
+		gfp_t gfp)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct mem_cgroup *memcg;
+	pmd_t *pmd_pgtable;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+	int ret = 0;
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
+		put_page(page);
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		return VM_FAULT_FALLBACK;
+	}
+
+	pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, haddr);
+	if (unlikely(!pmd_pgtable)) {
+		ret = VM_FAULT_OOM;
+		goto release;
+	}
+
+	clear_huge_page(page, vmf->address, HPAGE_PUD_NR);
+	/*
+	 * The memory barrier inside __SetPageUptodate makes sure that
+	 * clear_huge_page writes become visible before the set_pmd_at()
+	 * write.
+	 */
+	__SetPageUptodate(page);
+
+	vmf->ptl = pud_lock(vma->vm_mm, vmf->pud);
+	if (unlikely(!pud_none(*vmf->pud))) {
+		goto unlock_release;
+	} else {
+		pud_t entry;
+		int i;
+
+		ret = check_stable_address_space(vma->vm_mm);
+		if (ret)
+			goto unlock_release;
+
+		/* Deliver the page fault to userland */
+		if (userfaultfd_missing(vma)) {
+			int ret;
+
+			spin_unlock(vmf->ptl);
+			mem_cgroup_cancel_charge(page, memcg, true);
+			put_page(page);
+			pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+			ret = handle_userfault(vmf, VM_UFFD_MISSING);
+			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			return ret;
+		}
+
+		entry = mk_huge_pud(page, vma->vm_page_prot);
+		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+		page_add_new_anon_rmap(page, vma, haddr, true);
+		mem_cgroup_commit_charge(page, memcg, false, true);
+		lru_cache_add_active_or_unevictable(page, vma);
+		pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud,
+				virt_to_page(pmd_pgtable));
+		set_pud_at(vma->vm_mm, haddr, vmf->pud, entry);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+		mm_inc_nr_pmds(vma->vm_mm);
+		for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+			mm_inc_nr_ptes(vma->vm_mm);
+		spin_unlock(vmf->ptl);
+		count_vm_event(THP_FAULT_ALLOC_PUD);
+	}
+
+	return 0;
+unlock_release:
+	spin_unlock(vmf->ptl);
+release:
+	if (pmd_pgtable)
+		pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+	mem_cgroup_cancel_charge(page, memcg, true);
+	put_page(page);
+	return ret;
+
+}
+
+/* Caller must hold page table lock. */
+static bool set_huge_pud_zero_page(pgtable_t pmd_pgtable,
+		struct mm_struct *mm,
+		struct vm_area_struct *vma, unsigned long haddr, pud_t *pud,
+		struct page *zero_page)
+{
+	pud_t entry;
+	int i;
+
+	if (!pud_none(*pud))
+		return false;
+	entry = mk_pud(zero_page, vma->vm_page_prot);
+	entry = pud_mkhuge(entry);
+	if (pmd_pgtable)
+		pgtable_trans_huge_pud_deposit(mm, pud, pmd_pgtable);
+	set_pud_at(mm, haddr, pud, entry);
+	mm_inc_nr_pmds(mm);
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+		mm_inc_nr_ptes(mm);
+	return true;
+}
+
+int do_huge_pud_anonymous_page(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	gfp_t gfp;
+	struct page *page;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+
+	if (haddr < vma->vm_start || haddr + HPAGE_PUD_SIZE > vma->vm_end)
+		return VM_FAULT_FALLBACK;
+	if (unlikely(anon_vma_prepare(vma)))
+		return VM_FAULT_OOM;
+	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
+		return VM_FAULT_OOM;
+	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
+			!mm_forbids_zeropage(vma->vm_mm) &&
+			transparent_hugepage_use_zero_page()) {
+		pmd_t *pmd_pgtable;
+		struct page *zero_page;
+		bool set;
+		int ret;
+
+		pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, haddr);
+		if (unlikely(!pmd_pgtable))
+			return VM_FAULT_OOM;
+		zero_page = mm_get_huge_pud_zero_page(vma->vm_mm);
+		if (unlikely(!zero_page)) {
+			pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+			count_vm_event(THP_FAULT_FALLBACK_PUD);
+			return VM_FAULT_FALLBACK;
+		}
+		vmf->ptl = pud_lock(vma->vm_mm, vmf->pud);
+		ret = 0;
+		set = false;
+		if (pud_none(*vmf->pud)) {
+			ret = check_stable_address_space(vma->vm_mm);
+			if (ret) {
+				spin_unlock(vmf->ptl);
+			} else if (userfaultfd_missing(vma)) {
+				spin_unlock(vmf->ptl);
+				ret = handle_userfault(vmf, VM_UFFD_MISSING);
+				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+			} else {
+				set_huge_pud_zero_page(virt_to_page(pmd_pgtable),
+					vma->vm_mm, vma, haddr, vmf->pud, zero_page);
+				spin_unlock(vmf->ptl);
+				set = true;
+			}
+		} else
+			spin_unlock(vmf->ptl);
+		if (!set)
+			pmd_free_page_with_ptes(vma->vm_mm, pmd_pgtable);
+		return ret;
+	}
+	gfp = alloc_hugepage_direct_gfpmask(vma);
+	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PUD_ORDER);
+	if (unlikely(!page)) {
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		return VM_FAULT_FALLBACK;
+	}
+	prep_transhuge_page(page);
+	return __do_huge_pud_anonymous_page(vmf, page, gfp);
+}
+
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1980,12 +2201,27 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline void zap_pud_deposited_table(struct mm_struct *mm, pud_t *pud)
+{
+	pgtable_t pgtable;
+	int i;
+
+	pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+	pmd_free_page_with_ptes(mm, (pmd_t *)page_address(pgtable));
+
+	mm_dec_nr_pmds(mm);
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+		mm_dec_nr_ptes(mm);
+}
+
 int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
 	pud_t orig_pud;
 	spinlock_t *ptl;
 
+	tlb_remove_check_page_size_change(tlb, HPAGE_PUD_SIZE);
+
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
 		return 0;
@@ -2001,9 +2237,34 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	if (vma_is_dax(vma)) {
 		spin_unlock(ptl);
 		/* No zero page support yet */
+	} else if (is_huge_zero_pud(orig_pud)) {
+		zap_pud_deposited_table(tlb->mm, pud);
+		spin_unlock(ptl);
+		tlb_remove_page_size(tlb, pud_page(orig_pud), HPAGE_PUD_SIZE);
 	} else {
-		/* No support for anonymous PUD pages yet */
-		BUG();
+		struct page *page = NULL;
+		int flush_needed = 1;
+
+		if (pud_present(orig_pud)) {
+			page = pud_page(orig_pud);
+			page_remove_rmap(page, true);
+			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+		} else
+			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+		if (PageAnon(page)) {
+			zap_pud_deposited_table(tlb->mm, pud);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PUD_NR);
+		} else {
+			if (arch_needs_pgtable_deposit())
+				zap_pud_deposited_table(tlb->mm, pud);
+			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PUD_NR);
+		}
+
+		spin_unlock(ptl);
+		if (flush_needed)
+			tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
 	}
 	return 1;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 019036e87088..177478d5ee47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3710,7 +3710,7 @@ static vm_fault_t create_huge_pud(struct vm_fault *vmf)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vmf->vma))
-		return VM_FAULT_FALLBACK;
+		return do_huge_pud_anonymous_page(vmf);
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -4593,3 +4593,29 @@ void ptlock_free(struct page *page)
 	kmem_cache_free(page_ptl_cachep, page->ptl);
 }
 #endif
+
+static struct kmem_cache *pagechain_cachep;
+
+void __init pagechain_cache_init(void)
+{
+	pagechain_cachep = kmem_cache_create("pagechain",
+		sizeof(struct pagechain), 0, SLAB_PANIC, NULL);
+}
+
+struct pagechain *pagechain_alloc(void)
+{
+	struct pagechain *chain;
+
+	chain = kmem_cache_alloc(pagechain_cachep, GFP_ATOMIC);
+
+	if (!chain)
+		return NULL;
+
+	pagechain_init(chain);
+	return chain;
+}
+
+void pagechain_free(struct pagechain *pchain)
+{
+	kmem_cache_free(pagechain_cachep, pchain);
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cfa99bb54bd6..a3b295ea7348 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5157,7 +5157,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_SHMEM_THPS) * HPAGE_PMD_NR),
 			K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)
 					* HPAGE_PMD_NR),
-			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
+			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR +
+			  node_page_state(pgdat, NR_ANON_THPS_PUD) * HPAGE_PUD_NR),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 532c29276fce..0b79568fba1c 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -9,6 +9,7 @@
 
 #include <linux/pagemap.h>
 #include <linux/hugetlb.h>
+#include <linux/pagechain.h>
 #include <asm/tlb.h>
 #include <asm-generic/pgtable.h>
 
@@ -44,7 +45,7 @@ void pmd_clear_bad(pmd_t *pmd)
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 /*
- * Only sets the access flags (dirty, accessed), as well as write 
+ * Only sets the access flags (dirty, accessed), as well as write
  * permission. Furthermore, we know it always gets set to a "more
  * permissive" setting, which allows most architectures to optimize
  * this. We return whether the PTE actually changed, which in turn
@@ -161,6 +162,23 @@ void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 		list_add(&pgtable->lru, &pmd_huge_pte(mm, pmdp)->lru);
 	pmd_huge_pte(mm, pmdp) = pgtable;
 }
+
+void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				pgtable_t pgtable)
+{
+	struct pagechain *chain = NULL;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+	/* FIFO */
+	chain = list_first_entry_or_null(&pud_huge_pte(mm, pudp),
+			struct pagechain, list);
+
+	if (!chain || !pagechain_space(chain)) {
+		chain = pagechain_alloc();
+		list_add(&chain->list, &pud_huge_pte(mm, pudp));
+	}
+	pagechain_deposit(chain, pgtable);
+}
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_WITHDRAW
@@ -179,6 +197,33 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 		list_del(&pgtable->lru);
 	return pgtable;
 }
+
+pgtable_t pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
+{
+	pgtable_t pgtable;
+	struct pagechain *chain = NULL;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	/* FIFO */
+retry:
+	chain = list_first_entry_or_null(&pud_huge_pte(mm, pudp),
+			struct pagechain, list);
+
+	if (!chain)
+		return NULL;
+
+	if (pagechain_empty(chain)) {
+		if (list_is_singular(&chain->list))
+			return NULL;
+		list_del(&chain->list);
+		pagechain_free(chain);
+		goto retry;
+	}
+
+	pgtable = pagechain_withdraw(chain);
+	return pgtable;
+}
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc29537..dae66a4329ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -712,6 +712,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
+	pud_t pude;
 	pmd_t *pmd = NULL;
 	pmd_t pmde;
 
@@ -724,7 +725,10 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 		goto out;
 
 	pud = pud_offset(p4d, address);
-	if (!pud_present(*pud))
+
+	pude = *pud;
+	barrier();
+	if (!pud_present(pude) || pud_trans_huge(pude))
 		goto out;
 
 	pmd = pmd_offset(pud, address);
@@ -1121,8 +1125,12 @@ void do_page_add_anon_rmap(struct page *page,
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (compound)
-			__inc_node_page_state(page, NR_ANON_THPS);
+		if (compound) {
+			if (nr == HPAGE_PMD_NR)
+				__inc_node_page_state(page, NR_ANON_THPS);
+			else
+				__inc_node_page_state(page, NR_ANON_THPS_PUD);
+		}
 		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
 	}
 	if (unlikely(PageKsm(page)))
@@ -1160,7 +1168,10 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
-		__inc_node_page_state(page, NR_ANON_THPS);
+		if (nr == HPAGE_PMD_NR)
+			__inc_node_page_state(page, NR_ANON_THPS);
+		else
+			__inc_node_page_state(page, NR_ANON_THPS_PUD);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1265,19 +1276,22 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__dec_node_page_state(page, NR_ANON_THPS);
+	if (hpage_nr_pages(page) == HPAGE_PMD_NR)
+		__dec_node_page_state(page, NR_ANON_THPS);
+	else
+		__dec_node_page_state(page, NR_ANON_THPS_PUD);
 
 	if (TestClearPageDoubleMap(page)) {
 		/*
 		 * Subpages can be mapped with PTEs too. Check how many of
 		 * themi are still mapped.
 		 */
-		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+		for (i = 0, nr = 0; i < hpage_nr_pages(page); i++) {
 			if (atomic_add_negative(-1, &page[i]._mapcount))
 				nr++;
 		}
 	} else {
-		nr = HPAGE_PMD_NR;
+		nr = hpage_nr_pages(page);
 	}
 
 	if (unlikely(PageMlocked(page)))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c18a42250a5c..25a88693e417 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1158,6 +1158,7 @@ const char * const vmstat_text[] = {
 	"nr_shmem_hugepages",
 	"nr_shmem_pmdmapped",
 	"nr_anon_transparent_hugepages",
+	"nr_anon_transparent_pud_hugepages",
 	"nr_unstable",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
@@ -1259,6 +1260,8 @@ const char * const vmstat_text[] = {
 	"thp_deferred_split_page",
 	"thp_split_pmd",
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	"thp_fault_alloc_pud",
+	"thp_fault_fallback_pud",
 	"thp_split_pud",
 #endif
 	"thp_zero_page_alloc",
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 10/31] mm: proc: add 1GB THP kpageflag.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (8 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 09/31] mm: thp: 1GB anonymous page implementation Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 11/31] mm: debug: print compound page order in dump_page() Zi Yan
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Bit 27 is used to identify 1GB THP.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/page.c                         | 2 ++
 include/uapi/linux/kernel-page-flags.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 40b05e0d4274..5d1471a6082a 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -138,6 +138,8 @@ u64 stable_page_flags(struct page *page)
 			u |= 1 << KPF_ZERO_PAGE;
 			u |= 1 << KPF_THP;
 		}
+		if (compound_order(head) == HPAGE_PUD_ORDER)
+			u |= 1 << KPF_PUD_THP;
 	} else if (is_zero_pfn(page_to_pfn(page)))
 		u |= 1 << KPF_ZERO_PAGE;
 
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index 21b9113c69da..743bd730917d 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -36,5 +36,7 @@
 #define KPF_ZERO_PAGE		24
 #define KPF_IDLE		25
 #define KPF_PGTABLE		26
+#define KPF_PUD_THP		27
+
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 11/31] mm: debug: print compound page order in dump_page().
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (9 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 10/31] mm: proc: add 1GB THP kpageflag Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 12/31] mm: stats: Separate PMD THP and PUD THP stats Zi Yan
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Since we have more than just PMD-level THPs, printing compound page
order is helpful to check the actual compound page sizes.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/debug.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index 0abb987dad9b..21d211d7776c 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -68,8 +68,12 @@ void __dump_page(struct page *page, const char *reason)
 	pr_warn("page:%px count:%d mapcount:%d mapping:%px index:%#lx",
 		  page, page_ref_count(page), mapcount,
 		  page->mapping, page_to_pgoff(page));
-	if (PageCompound(page))
-		pr_cont(" compound_mapcount: %d", compound_mapcount(page));
+	if (PageCompound(page)) {
+		struct page *head = compound_head(page);
+
+		pr_cont(" compound_mapcount: %d, order: %d", compound_mapcount(page),
+				compound_order(head));
+	}
 	pr_cont("\n");
 	if (PageAnon(page))
 		pr_warn("anon ");
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 12/31] mm: stats: Separate PMD THP and PUD THP stats.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (10 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 11/31] mm: debug: print compound page order in dump_page() Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 13/31] mm: thp: 1GB THP copy on write implementation Zi Yan
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

PMD THPs and PUD THPs are shown in separate stats.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 drivers/base/node.c | 5 +++--
 fs/proc/meminfo.c   | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index f21d2235bf97..5d947a17b61b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -127,6 +127,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d AnonHugePages(1GB):  %8lu kB\n"
 		       "Node %d ShmemHugePages: %8lu kB\n"
 		       "Node %d ShmemPmdMapped: %8lu kB\n"
 #endif
@@ -150,8 +151,8 @@ static ssize_t node_read_meminfo(struct device *dev,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       ,
 		       nid, K(node_page_state(pgdat, NR_ANON_THPS) *
-				       HPAGE_PMD_NR) +
-				    K(node_page_state(pgdat, NR_ANON_THPS_PUD) *
+				       HPAGE_PMD_NR),
+			   nid, K(node_page_state(pgdat, NR_ANON_THPS_PUD) *
 				       HPAGE_PUD_NR),
 		       nid, K(node_page_state(pgdat, NR_SHMEM_THPS) *
 				       HPAGE_PMD_NR),
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 9d127e440e4c..44a4d2dbd1d4 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -131,7 +131,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	show_val_kb(m, "AnonHugePages:  ",
-		    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR +
+		    global_node_page_state(NR_ANON_THPS) * HPAGE_PMD_NR);
+	show_val_kb(m, "AnonHugePages(1GB):  ",
 			global_node_page_state(NR_ANON_THPS_PUD) * HPAGE_PUD_NR);
 	show_val_kb(m, "ShmemHugePages: ",
 		    global_node_page_state(NR_SHMEM_THPS) * HPAGE_PMD_NR);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 13/31] mm: thp: 1GB THP copy on write implementation.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (11 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 12/31] mm: stats: Separate PMD THP and PUD THP stats Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 14/31] mm: thp: handling 1GB THP reference bit Zi Yan
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

COW on 1GB THPs will fall back to 2MB THPs if 1GB THP is not available.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgalloc.h |   9 +
 include/linux/huge_mm.h        |   5 +
 mm/huge_memory.c               | 319 ++++++++++++++++++++++++++++++++-
 mm/memory.c                    |   2 +-
 4 files changed, 331 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 6e29ad9b9d7f..ebcb022f6bb9 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -110,6 +110,15 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
+static inline void pud_populate_with_pgtable(struct mm_struct *mm, pud_t *pud,
+				struct page *pte)
+{
+	unsigned long pfn = page_to_pfn(pte);
+
+	paravirt_alloc_pmd(mm, pfn);
+	set_pud(pud, __pud(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+}
+
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c6272e6ffc35..02419fa91e12 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,7 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
+extern int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -27,6 +28,10 @@ extern int do_huge_pud_anonymous_page(struct vm_fault *vmf)
 {
 	return VM_FAULT_FALLBACK;
 }
+extern int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+	return VM_FAULT_FALLBACK;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cad4ef01f607..0a006592f3fe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1284,7 +1284,12 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 {
 	spinlock_t *dst_ptl, *src_ptl;
 	pud_t pud;
-	int ret;
+	pmd_t *pmd_pgtable = NULL;
+	int ret = -ENOMEM;
+
+	pmd_pgtable = pmd_alloc_one_page_with_ptes(vma->vm_mm, addr);
+	if (unlikely(!pmd_pgtable))
+		goto out;
 
 	dst_ptl = pud_lock(dst_mm, dst_pud);
 	src_ptl = pud_lockptr(src_mm, src_pud);
@@ -1292,8 +1297,13 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pud = *src_pud;
-	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud))) {
+		pmd_free_page_with_ptes(dst_mm, pmd_pgtable);
 		goto out_unlock;
+	}
+
+	if (pud_devmap(pud))
+		pmd_free_page_with_ptes(dst_mm, pmd_pgtable);
 
 	/*
 	 * When page table lock is held, the huge zero pud should not be
@@ -1301,7 +1311,32 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * a page table.
 	 */
 	if (is_huge_zero_pud(pud)) {
-		/* No huge zero pud yet */
+		struct page *zero_page;
+		/*
+		 * get_huge_zero_page() will never allocate a new page here,
+		 * since we already have a zero page to copy. It just takes a
+		 * reference.
+		 */
+		zero_page = mm_get_huge_pud_zero_page(dst_mm);
+		set_huge_pud_zero_page(virt_to_page(pmd_pgtable),
+			dst_mm, vma, addr, dst_pud, zero_page);
+		ret = 0;
+		goto out_unlock;
+	}
+
+	if (pud_trans_huge(pud)) {
+		struct page *src_page;
+		int i;
+
+		src_page = pud_page(pud);
+		VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+		get_page(src_page);
+		page_dup_rmap(src_page, true);
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+		mm_inc_nr_pmds(dst_mm);
+		for (i = 0; i < (1<<(HPAGE_PUD_ORDER - HPAGE_PMD_ORDER)); i++)
+			mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_pud_deposit(dst_mm, dst_pud, virt_to_page(pmd_pgtable));
 	}
 
 	pudp_set_wrprotect(src_mm, addr, src_pud);
@@ -1312,6 +1347,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 out_unlock:
 	spin_unlock(src_ptl);
 	spin_unlock(dst_ptl);
+out:
 	return ret;
 }
 
@@ -1335,6 +1371,283 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 unlock:
 	spin_unlock(vmf->ptl);
 }
+
+static int do_huge_pud_wp_page_fallback(struct vm_fault *vmf, pud_t orig_pud,
+		struct page *page)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+	struct mem_cgroup *memcg;
+	pgtable_t pgtable, pmd_pgtable;
+	pud_t _pud;
+	int ret = 0, i, j;
+	struct page **pages;
+	struct mmu_notifier_range range;
+
+	pages = kmalloc(sizeof(struct page *) * HPAGE_PUD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	pmd_pgtable = pte_alloc_order(vma->vm_mm, haddr,
+		HPAGE_PUD_ORDER - HPAGE_PMD_ORDER);
+	if (!pmd_pgtable) {
+		ret |= VM_FAULT_OOM;
+		goto out_kfree_pages;
+	}
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER)); i++) {
+		pages[i] = alloc_page_vma_node(GFP_TRANSHUGE, vma,
+					       vmf->address, page_to_nid(page));
+		if (unlikely(!pages[i] ||
+			     mem_cgroup_try_charge(pages[i], vma->vm_mm,
+				     GFP_KERNEL, &memcg, true))) {
+			if (pages[i])
+				put_page(pages[i]);
+			while (--i >= 0) {
+				memcg = (void *)page_private(pages[i]);
+				set_page_private(pages[i], 0);
+				mem_cgroup_cancel_charge(pages[i], memcg,
+						true);
+				put_page(pages[i]);
+			}
+			kfree(pages);
+			pte_free_order(vma->vm_mm, pmd_pgtable,
+				HPAGE_PMD_ORDER - HPAGE_PMD_ORDER);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+		count_vm_event(THP_FAULT_ALLOC);
+		set_page_private(pages[i], (unsigned long)memcg);
+		prep_transhuge_page(pages[i]);
+	}
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER)); i++) {
+		for (j = 0; j < HPAGE_PMD_NR; j++) {
+			copy_user_highpage(pages[i] + j, page + i * HPAGE_PMD_NR + j,
+					   haddr + PAGE_SIZE * (i * HPAGE_PMD_NR + j), vma);
+			cond_resched();
+		}
+		__SetPageUptodate(pages[i]);
+	}
+
+	mmu_notifier_range_init(&range, vma->vm_mm, haddr,
+				haddr + HPAGE_PUD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	vmf->ptl = pud_lock(vma->vm_mm, vmf->pud);
+	if (unlikely(!pud_same(*vmf->pud, orig_pud)))
+		goto out_free_pages;
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	/*
+	 * Leave pmd empty until pte is filled note we must notify here as
+	 * concurrent CPU thread might write to new page before the call to
+	 * mmu_notifier_invalidate_range_end() happens which can lead to a
+	 * device seeing memory write in different order than CPU.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
+
+	pgtable = pgtable_trans_huge_pud_withdraw(vma->vm_mm, vmf->pud);
+	pud_populate_with_pgtable(vma->vm_mm, &_pud, pgtable);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER));
+		 i++, haddr += (PAGE_SIZE * HPAGE_PMD_NR)) {
+		pmd_t entry;
+
+		entry = mk_huge_pmd(pages[i], vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
+		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, true);
+		mem_cgroup_commit_charge(pages[i], memcg, false, true);
+		lru_cache_add_active_or_unevictable(pages[i], vma);
+		vmf->pmd = pmd_offset(&_pud, haddr);
+		VM_BUG_ON(!pmd_none(*vmf->pmd));
+		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, &pmd_pgtable[i]);
+		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
+	}
+	kfree(pages);
+
+	smp_wmb(); /* make pte visible before pmd */
+	pud_populate_with_pgtable(vma->vm_mm, vmf->pud, pgtable);
+	page_remove_rmap(page, true);
+	spin_unlock(vmf->ptl);
+
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(&range);
+
+	ret |= VM_FAULT_WRITE;
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(vmf->ptl);
+	mmu_notifier_invalidate_range_end(&range);
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER)); i++) {
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
+		mem_cgroup_cancel_charge(pages[i], memcg, true);
+		put_page(pages[i]);
+	}
+out_kfree_pages:
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page = NULL, *new_page;
+	struct mem_cgroup *memcg;
+	unsigned long haddr = vmf->address & HPAGE_PUD_MASK;
+	struct mmu_notifier_range range;
+	gfp_t huge_gfp;			/* for allocation and charge */
+	int ret = 0;
+
+	vmf->ptl = pud_lockptr(vma->vm_mm, vmf->pud);
+	VM_BUG_ON_VMA(!vma->anon_vma, vma);
+	if (is_huge_zero_pud(orig_pud))
+		goto alloc;
+	spin_lock(vmf->ptl);
+	if (unlikely(!pud_same(*vmf->pud, orig_pud)))
+		goto out_unlock;
+
+	page = pud_page(orig_pud);
+	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
+	/*
+	 * We can only reuse the page if nobody else maps the huge page or it's
+	 * part.
+	 */
+	if (!trylock_page(page)) {
+		get_page(page);
+		spin_unlock(vmf->ptl);
+		lock_page(page);
+		spin_lock(vmf->ptl);
+		if (unlikely(!pud_same(*vmf->pud, orig_pud))) {
+			unlock_page(page);
+			put_page(page);
+			goto out_unlock;
+		}
+		put_page(page);
+	}
+	if (reuse_swap_page(page, NULL)) {
+		pud_t entry;
+
+		entry = pud_mkyoung(orig_pud);
+		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+		if (pudp_set_access_flags(vma, haddr, vmf->pud, entry,  1))
+			update_mmu_cache_pud(vma, vmf->address, vmf->pud);
+		ret |= VM_FAULT_WRITE;
+		unlock_page(page);
+		goto out_unlock;
+	}
+	unlock_page(page);
+	get_page(page);
+	spin_unlock(vmf->ptl);
+alloc:
+	if (transparent_hugepage_enabled(vma) &&
+	    !transparent_hugepage_debug_cow()) {
+		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PUD_ORDER);
+	} else
+		new_page = NULL;
+
+	if (likely(new_page)) {
+		prep_transhuge_page(new_page);
+	} else {
+		if (!page) {
+			WARN(1, "%s: split_huge_page\n", __func__);
+			split_huge_pud(vma, vmf->pud, vmf->address);
+			ret |= VM_FAULT_FALLBACK;
+		} else {
+			ret = do_huge_pud_wp_page_fallback(vmf, orig_pud, page);
+			if (ret & VM_FAULT_OOM) {
+				WARN(1, "%s: split_huge_page after wp fallback\n", __func__);
+				split_huge_pud(vma, vmf->pud, vmf->address);
+				ret |= VM_FAULT_FALLBACK;
+			}
+			put_page(page);
+		}
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		goto out;
+	}
+
+	if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+					huge_gfp, &memcg, true))) {
+		put_page(new_page);
+		WARN(1, "%s: split_huge_page after mem cgroup failed\n", __func__);
+		split_huge_pud(vma, vmf->pud, vmf->address);
+		if (page)
+			put_page(page);
+		ret |= VM_FAULT_FALLBACK;
+		count_vm_event(THP_FAULT_FALLBACK_PUD);
+		goto out;
+	}
+
+	count_vm_event(THP_FAULT_ALLOC_PUD);
+
+	if (!page)
+		clear_huge_page(new_page, vmf->address, HPAGE_PUD_NR);
+	else
+		copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PUD_NR);
+	__SetPageUptodate(new_page);
+
+	mmu_notifier_range_init(&range, vma->vm_mm, haddr,
+				haddr + HPAGE_PUD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	spin_lock(vmf->ptl);
+	if (page)
+		put_page(page);
+	if (unlikely(!pud_same(*vmf->pud, orig_pud))) {
+		spin_unlock(vmf->ptl);
+		mem_cgroup_cancel_charge(new_page, memcg, true);
+		put_page(new_page);
+		goto out_mn;
+	} else {
+		pud_t entry;
+
+		entry = mk_huge_pud(new_page, vma->vm_page_prot);
+		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
+		pudp_huge_clear_flush_notify(vma, haddr, vmf->pud);
+		page_add_new_anon_rmap(new_page, vma, haddr, true);
+		mem_cgroup_commit_charge(new_page, memcg, false, true);
+		lru_cache_add_active_or_unevictable(new_page, vma);
+		set_pud_at(vma->vm_mm, haddr, vmf->pud, entry);
+		update_mmu_cache_pud(vma, vmf->address, vmf->pud);
+		if (!page) {
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PUD_NR);
+		} else {
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+			page_remove_rmap(page, true);
+			put_page(page);
+		}
+		ret |= VM_FAULT_WRITE;
+	}
+	spin_unlock(vmf->ptl);
+out_mn:
+	/*
+	 * No need to double call mmu_notifier->invalidate_range() callback as
+	 * the above pmdp_huge_clear_flush_notify() did already call it.
+	 */
+	mmu_notifier_invalidate_range_only_end(&range);
+out:
+	return ret;
+out_unlock:
+	spin_unlock(vmf->ptl);
+	return ret;
+}
+
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd)
diff --git a/mm/memory.c b/mm/memory.c
index 177478d5ee47..3608b5436519 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3722,7 +3722,7 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* No support for anonymous transparent PUD pages yet */
 	if (vma_is_anonymous(vmf->vma))
-		return VM_FAULT_FALLBACK;
+		return do_huge_pud_wp_page(vmf, orig_pud);
 	if (vmf->vma->vm_ops->huge_fault)
 		return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PUD);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 14/31] mm: thp: handling 1GB THP reference bit.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (12 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 13/31] mm: thp: 1GB THP copy on write implementation Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 15/31] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Add PUD-level TLB flush ops and teach page_vma_mapped_talk about 1GB
THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgtable.h |  3 +++
 arch/x86/mm/pgtable.c          | 13 +++++++++++++
 include/asm-generic/pgtable.h  | 14 ++++++++++++++
 include/linux/mmu_notifier.h   | 13 +++++++++++++
 include/linux/rmap.h           |  1 +
 mm/page_vma_mapped.c           | 33 +++++++++++++++++++++++++++++----
 mm/rmap.c                      | 12 +++++++++---
 7 files changed, 82 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ae3ac49c32ad..f99ce657d282 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1151,6 +1151,9 @@ extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
 extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
+#define __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
+extern int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp);
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 0a5008690d7c..0edcfa8007cb 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -643,6 +643,19 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
+int pudp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pud_t *pudp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
+
+	young = pudp_test_and_clear_young(vma, address, pudp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+
+	return young;
+}
 #endif
 
 /**
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 0f626d6177c3..682531e0d55c 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -121,6 +121,20 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_CLEAR_YOUNG_FLUSH
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+extern int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp);
+#else
+int pudp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pud_t *pudp)
+{
+	BUILD_BUG();
+	return 0;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD  */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long address,
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 4050ec1c3b45..6850b9e9b2cb 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -353,6 +353,19 @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *range,
 	__young;							\
 })
 
+#define pudp_clear_flush_young_notify(__vma, __address, __pudp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pudp_clear_flush_young(___vma, ___address, __pudp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address,		\
+						  ___address +		\
+							PUD_SIZE);	\
+	__young;							\
+})
+
 #define ptep_clear_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..2b566736e3c2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -206,6 +206,7 @@ struct page_vma_mapped_walk {
 	struct page *page;
 	struct vm_area_struct *vma;
 	unsigned long address;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 11df03e71288..a473553aa9a5 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -141,9 +141,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	struct page *page = pvmw->page;
 	pgd_t *pgd;
 	p4d_t *p4d;
-	pud_t *pud;
+	pud_t pude;
 	pmd_t pmde;
 
+	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
+		return not_found(pvmw);
+
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
@@ -171,10 +174,31 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	p4d = p4d_offset(pgd, pvmw->address);
 	if (!p4d_present(*p4d))
 		return false;
-	pud = pud_offset(p4d, pvmw->address);
-	if (!pud_present(*pud))
+	pvmw->pud = pud_offset(p4d, pvmw->address);
+
+	/*
+	 * Make sure the pud value isn't cached in a register by the
+	 * compiler and used as a stale value after we've observed a
+	 * subsequent update.
+	 */
+	pude = READ_ONCE(*pvmw->pud);
+	if (pud_trans_huge(pude)) {
+		pvmw->ptl = pud_lock(mm, pvmw->pud);
+		if (likely(pud_trans_huge(*pvmw->pud))) {
+			if (pvmw->flags & PVMW_MIGRATION)
+				return not_found(pvmw);
+			if (pud_page(*pvmw->pud) != page)
+				return not_found(pvmw);
+			return true;
+		} else {
+			/* THP pud was split under us: handle on pmd level */
+			spin_unlock(pvmw->ptl);
+			pvmw->ptl = NULL;
+		}
+	} else if (!pud_present(pude))
 		return false;
-	pvmw->pmd = pmd_offset(pud, pvmw->address);
+
+	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
 	/*
 	 * Make sure the pmd value isn't cached in a register by the
 	 * compiler and used as a stale value after we've observed a
@@ -210,6 +234,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	} else if (!pmd_present(pmde)) {
 		return false;
 	}
+
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
diff --git a/mm/rmap.c b/mm/rmap.c
index dae66a4329ea..f69d81d4a956 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -789,9 +789,15 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
 					referenced++;
 			}
 		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-			if (pmdp_clear_flush_young_notify(vma, address,
-						pvmw.pmd))
-				referenced++;
+			if (pvmw.pmd) {
+				if (pmdp_clear_flush_young_notify(vma, address,
+							pvmw.pmd))
+					referenced++;
+			} else if (pvmw.pud) {
+				if (pudp_clear_flush_young_notify(vma, address,
+							pvmw.pud))
+					referenced++;
+			}
 		} else {
 			/* unexpected pmd-mapped page? */
 			WARN_ON_ONCE(1);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 15/31] mm: thp: add 1GB THP split_huge_pud_page() function.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (13 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 14/31] mm: thp: handling 1GB THP reference bit Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 16/31] mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time Zi Yan
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It mimics PMD-level THP split. In addition, to support PMD-mapped PUD
THP, PMDPageInPUD() is used. For the mapcount of PMD-mapped PUD THP,
sub_compound_mapcount() is used, which uses
(head_page+3).compound_mapcount, since each base page's mapcount is used
for PTE mapping. PagePUDDoubleMap() is used for both PUD-mapped and
PMD-mapped PUD THPs.

page_xxx_rmap() functions now have an extra page order parameter to
distinguish different THP sizes.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgtable.h |  15 +
 include/asm-generic/pgtable.h  |  83 ++++
 include/linux/huge_mm.h        |  31 +-
 include/linux/memcontrol.h     |   5 +
 include/linux/mm.h             |  18 +
 include/linux/page-flags.h     |  79 +++-
 include/linux/rmap.h           |   9 +-
 include/linux/swap.h           |   2 +
 include/linux/vm_event_item.h  |   4 +
 kernel/events/uprobes.c        |   4 +-
 mm/huge_memory.c               | 695 ++++++++++++++++++++++++++++++---
 mm/hugetlb.c                   |   4 +-
 mm/khugepaged.c                |   4 +-
 mm/ksm.c                       |   4 +-
 mm/memcontrol.c                |  13 +
 mm/memory.c                    |  16 +-
 mm/migrate.c                   |   8 +-
 mm/page_alloc.c                |  18 +-
 mm/pgtable-generic.c           |  11 +
 mm/rmap.c                      | 108 +++--
 mm/swap.c                      |  38 ++
 mm/swapfile.c                  |   4 +-
 mm/userfaultfd.c               |   2 +-
 mm/util.c                      |   7 +
 mm/vmstat.c                    |   4 +
 25 files changed, 1079 insertions(+), 107 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f99ce657d282..4a6805f8f128 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1269,6 +1269,21 @@ static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
 }
 #endif /* CONFIG_PAGE_TABLE_ISOLATION */
 
+#ifndef pudp_establish
+#define pudp_establish pudp_establish
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+		unsigned long address, pud_t *pudp, pud_t pud)
+{
+	if (IS_ENABLED(CONFIG_SMP)) {
+		return xchg(pudp, pud);
+	} else {
+		pud_t old = *pudp;
+		*pudp = pud;
+		return old;
+	}
+}
+#endif
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 682531e0d55c..1ae33b6590b8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -346,6 +346,11 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 			    pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			    pud_t *pudp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 static inline int pte_same(pte_t pte_a, pte_t pte_b)
 {
@@ -941,6 +946,18 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 }
 #endif
 
+#ifndef pud_read_atomic
+static inline pud_t pud_read_atomic(pud_t *pudp)
+{
+	/*
+	 * Depend on compiler for an atomic pmd read. NOTE: this is
+	 * only going to work, if the pmdval_t isn't larger than
+	 * an unsigned long.
+	 */
+	return *pudp;
+}
+#endif
+
 #ifndef arch_needs_pgtable_deposit
 #define arch_needs_pgtable_deposit() (false)
 #endif
@@ -1032,6 +1049,72 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
 
+static inline int pud_none_or_trans_huge_or_clear_bad(pud_t *pud)
+{
+	pud_t pudval = pud_read_atomic(pud);
+	/*
+	 * The barrier will stabilize the pmdval in a register or on
+	 * the stack so that it will stop changing under the code.
+	 *
+	 * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
+	 * pmd_read_atomic is allowed to return a not atomic pmdval
+	 * (for example pointing to an hugepage that has never been
+	 * mapped in the pmd). The below checks will only care about
+	 * the low part of the pmd with 32bit PAE x86 anyway, with the
+	 * exception of pmd_none(). So the important thing is that if
+	 * the low part of the pmd is found null, the high part will
+	 * be also null or the pmd_none() check below would be
+	 * confused.
+	 */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+	/*
+	 * !pmd_present() checks for pmd migration entries
+	 *
+	 * The complete check uses is_pmd_migration_entry() in linux/swapops.h
+	 * But using that requires moving current function and pmd_trans_unstable()
+	 * to linux/swapops.h to resovle dependency, which is too much code move.
+	 *
+	 * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
+	 * because !pmd_present() pages can only be under migration not swapped
+	 * out.
+	 *
+	 * pmd_none() is preseved for future condition checks on pmd migration
+	 * entries and not confusing with this function name, although it is
+	 * redundant with !pmd_present().
+	 */
+	if (pud_none(pudval) || pud_trans_huge(pudval))
+		return 1;
+	if (unlikely(pud_bad(pudval))) {
+		pud_clear_bad(pud);
+		return 1;
+	}
+	return 0;
+}
+
+/*
+ * This is a noop if Transparent Hugepage Support is not built into
+ * the kernel. Otherwise it is equivalent to
+ * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
+ * places that already verified the pmd is not none and they want to
+ * walk ptes while holding the mmap sem in read mode (write mode don't
+ * need this). If THP is not enabled, the pmd can't go away under the
+ * code even if MADV_DONTNEED runs, but if THP is enabled we need to
+ * run a pmd_trans_unstable before walking the ptes after
+ * split_huge_page_pmd returns (because it may have run when the pmd
+ * become null, but then a page fault can map in a THP and not a
+ * regular page).
+ */
+static inline int pud_trans_unstable(pud_t *pud)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	return pud_none_or_trans_huge_or_clear_bad(pud);
+#else
+	return 0;
+#endif
+}
+
 #ifndef CONFIG_NUMA_BALANCING
 /*
  * Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 02419fa91e12..bd5cc5e65de8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -178,17 +178,27 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
 		bool freeze, struct page *page);
 
+bool can_split_huge_pud_page(struct page *page, int *pextra_pins);
+int split_huge_pud_page_to_list(struct page *page, struct list_head *list);
+static inline int split_huge_pud_page(struct page *page)
+{
+	return split_huge_pud_page_to_list(page, NULL);
+}
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address);
+		unsigned long address, bool freeze, struct page *page);
 
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
 		if (pud_trans_huge(*____pud)				\
 					|| pud_devmap(*____pud))	\
-			__split_huge_pud(__vma, __pud, __address);	\
+			__split_huge_pud(__vma, __pud, __address,	\
+						false, NULL);		\
 	}  while (0)
 
+void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct page *page);
+
 extern int hugepage_madvise(struct vm_area_struct *vma,
 			    unsigned long *vm_flags, int advice);
 extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -319,8 +329,25 @@ static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 static inline void split_huge_pmd_address(struct vm_area_struct *vma,
 		unsigned long address, bool freeze, struct page *page) {}
 
+static inline bool
+can_split_huge_pud_page(struct page *page, int *pextra_pins)
+{
+	BUILD_BUG();
+	return false;
+}
+static inline int
+split_huge_pud_page_to_list(struct page *page, struct list_head *list)
+{
+	return 0;
+}
+static inline int split_huge_pud_page(struct page *page)
+{
+	return 0;
+}
 #define split_huge_pud(__vma, __pmd, __address)	\
 	do { } while (0)
+static inline void split_huge_pud_address(struct vm_area_struct *vma,
+		unsigned long address, bool freeze, struct page *page) {}
 
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83ae11cbd12c..fd362559d4b7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -790,6 +790,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_pud_fixup(struct page *head);
 #endif
 
 #else /* CONFIG_MEMCG */
@@ -1098,6 +1099,10 @@ static inline void mem_cgroup_split_huge_fixup(struct page *head)
 {
 }
 
+static inline void mem_cgroup_split_huge_pud_fixup(struct page *head)
+{
+}
+
 static inline void count_memcg_events(struct mem_cgroup *memcg,
 				      enum vm_event_item idx,
 				      unsigned long count)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d10dc9db2311..af6257d05189 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -652,6 +652,24 @@ static inline int compound_mapcount(struct page *page)
 	return atomic_read(compound_mapcount_ptr(page)) + 1;
 }
 
+static inline unsigned int compound_order(struct page *page);
+static inline atomic_t *sub_compound_mapcount_ptr(struct page *page, int sub_level)
+{
+	struct page *head = compound_head(page);
+
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(compound_order(head) != HPAGE_PUD_ORDER, page);
+	VM_BUG_ON_PAGE((page - head) % HPAGE_PMD_NR, page);
+	VM_BUG_ON_PAGE(sub_level != 1, page);
+	return &page[2 + sub_level].compound_mapcount;
+}
+
+/* Only works for PUD pages */
+static inline int sub_compound_mapcount(struct page *page)
+{
+	return atomic_read(sub_compound_mapcount_ptr(page, 1)) + 1;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 39b4494e29f1..480e091f52ac 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -607,6 +607,23 @@ static inline int PageTransTail(struct page *page)
 	return PageTail(page);
 }
 
+#define HPAGE_PMD_SHIFT PMD_SHIFT
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+
+#define HPAGE_PUD_SHIFT PUD_SHIFT
+#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PUD_NR (1<<HPAGE_PUD_ORDER)
+
+static inline unsigned int compound_order(struct page *page);
+
+static inline int PMDPageInPUD(struct page *page)
+{
+	struct page *head = compound_head(page);
+	return (PageCompound(page) && compound_order(head) == HPAGE_PUD_ORDER &&
+		((page - head) % HPAGE_PMD_NR == 0));
+}
+
 /*
  * PageDoubleMap indicates that the compound page is mapped with PTEs as well
  * as PMDs.
@@ -622,30 +639,72 @@ static inline int PageTransTail(struct page *page)
  */
 static inline int PageDoubleMap(struct page *page)
 {
-	return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
+	return (PageHead(page) || PMDPageInPUD(page)) &&
+		test_bit(PG_double_map, &compound_head(page)[1].flags);
 }
 
 static inline void SetPageDoubleMap(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	set_bit(PG_double_map, &page[1].flags);
+	VM_BUG_ON_PAGE(!PageHead(page) && !PMDPageInPUD(page), page);
+	set_bit(PG_double_map, &compound_head(page)[1].flags);
 }
 
 static inline void ClearPageDoubleMap(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	clear_bit(PG_double_map, &page[1].flags);
+	VM_BUG_ON_PAGE(!PageHead(page) && !PMDPageInPUD(page), page);
+	clear_bit(PG_double_map, &compound_head(page)[1].flags);
 }
 static inline int TestSetPageDoubleMap(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_set_bit(PG_double_map, &page[1].flags);
+	VM_BUG_ON_PAGE(!PageHead(page) && !PMDPageInPUD(page), page);
+	return test_and_set_bit(PG_double_map, &compound_head(page)[1].flags);
 }
 
 static inline int TestClearPageDoubleMap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageHead(page) && !PMDPageInPUD(page), page);
+	return test_and_clear_bit(PG_double_map, &compound_head(page)[1].flags);
+}
+
+/*
+ * PagePUDDoubleMap indicates that the compound page is mapped with PMDs as well
+ * as PUDs.
+ *
+ * This is required for optimization of rmap operations for THP: we can postpone
+ * per small page mapcount accounting (and its overhead from atomic operations)
+ * until the first PMD split.
+ *
+ * For the page PageDoubleMap means ->_mapcount in all sub-pages is offset up
+ * by one. This reference will go away with last compound_mapcount.
+ *
+ * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap().
+ */
+static inline int PagePUDDoubleMap(struct page *page)
+{
+	return PageHead(page) && test_bit(PG_double_map, &page[2].flags);
+}
+
+static inline void SetPagePUDDoubleMap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	set_bit(PG_double_map, &page[2].flags);
+}
+
+static inline void ClearPagePUDDoubleMap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	clear_bit(PG_double_map, &page[2].flags);
+}
+static inline int TestSetPagePUDDoubleMap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	return test_and_set_bit(PG_double_map, &page[2].flags);
+}
+
+static inline int TestClearPagePUDDoubleMap(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_clear_bit(PG_double_map, &page[1].flags);
+	return test_and_clear_bit(PG_double_map, &page[2].flags);
 }
 
 #else
@@ -653,9 +712,13 @@ TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
 TESTPAGEFLAG_FALSE(TransCompoundMap)
 TESTPAGEFLAG_FALSE(TransTail)
+TESTPAGEFLAG_FALSE(PMDPageInPUD)
 PAGEFLAG_FALSE(DoubleMap)
 	TESTSETFLAG_FALSE(DoubleMap)
 	TESTCLEARFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(PUDDoubleMap)
+	TESTSETFLAG_FALSE(PUDDoubleMap)
+	TESTCLEARFLAG_FALSE(PUDDoubleMap)
 #endif
 
 /*
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2b566736e3c2..6adb6e835b30 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -99,6 +99,7 @@ enum ttu_flags {
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
 	TTU_SPLIT_FREEZE	= 0x100,		/* freeze pte under splitting thp */
+	TTU_SPLIT_HUGE_PUD	= 0x200,		/* split huge PUD if any */
 };
 
 #ifdef CONFIG_MMU
@@ -171,13 +172,13 @@ struct anon_vma *page_get_anon_vma(struct page *page);
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long, bool, int);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
-			   unsigned long, int);
+			   unsigned long, int, int);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long, bool, int);
 void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
+void page_remove_rmap(struct page *, bool, int);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 622025ac1461..1a6bac77c854 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -333,6 +333,8 @@ extern void lru_cache_add_anon(struct page *page);
 extern void lru_cache_add_file(struct page *page);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
+extern void lru_add_pud_page_tail(struct page *page, struct page *page_tail,
+			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 4550667b2274..df619262b1b4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -85,6 +85,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_ALLOC_PUD,
 		THP_FAULT_FALLBACK_PUD,
 		THP_SPLIT_PUD,
+		THP_SPLIT_PUD_PAGE,
+		THP_SPLIT_PUD_PAGE_FAILED,
+		THP_ZERO_PUD_PAGE_ALLOC,
+		THP_ZERO_PUD_PAGE_ALLOC_FAILED,
 #endif
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8aef47ee7bfa..e4819fef634f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -195,7 +195,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	VM_BUG_ON_PAGE(addr != pvmw.address, old_page);
 
 	get_page(new_page);
-	page_add_new_anon_rmap(new_page, vma, addr, false);
+	page_add_new_anon_rmap(new_page, vma, addr, false, 0);
 	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 
@@ -209,7 +209,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	set_pte_at_notify(mm, addr, pvmw.pte,
 			mk_pte(new_page, vma->vm_page_prot));
 
-	page_remove_rmap(old_page, false);
+	page_remove_rmap(old_page, false, 0);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
 	page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0a006592f3fe..5f83f4c5eac7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -121,10 +121,10 @@ static struct page *get_huge_pud_zero_page(void)
 	zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
 			HPAGE_PUD_ORDER);
 	if (!zero_page) {
-		count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
+		count_vm_event(THP_ZERO_PUD_PAGE_ALLOC_FAILED);
 		return NULL;
 	}
-	count_vm_event(THP_ZERO_PAGE_ALLOC);
+	count_vm_event(THP_ZERO_PUD_PAGE_ALLOC);
 	preempt_disable();
 	if (cmpxchg(&huge_pud_zero_page, NULL, zero_page)) {
 		preempt_enable();
@@ -660,7 +660,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr, true, HPAGE_PMD_ORDER);
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
@@ -969,7 +969,7 @@ static int __do_huge_pud_anonymous_page(struct vm_fault *vmf, struct page *page,
 
 		entry = mk_huge_pud(page, vma->vm_page_prot);
 		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr, true, HPAGE_PUD_ORDER);
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_pud_deposit(vma->vm_mm, vmf->pud,
@@ -1463,7 +1463,7 @@ static int do_huge_pud_wp_page_fallback(struct vm_fault *vmf, pud_t orig_pud,
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, true);
+		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, true, HPAGE_PMD_ORDER);
 		mem_cgroup_commit_charge(pages[i], memcg, false, true);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		vmf->pmd = pmd_offset(&_pud, haddr);
@@ -1475,7 +1475,7 @@ static int do_huge_pud_wp_page_fallback(struct vm_fault *vmf, pud_t orig_pud,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pud_populate_with_pgtable(vma->vm_mm, vmf->pud, pgtable);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, true, HPAGE_PUD_ORDER);
 	spin_unlock(vmf->ptl);
 
 	/*
@@ -1566,13 +1566,13 @@ int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 		prep_transhuge_page(new_page);
 	} else {
 		if (!page) {
-			WARN(1, "%s: split_huge_page\n", __func__);
+			/*WARN(1, "%s: split_huge_page\n", __func__);*/
 			split_huge_pud(vma, vmf->pud, vmf->address);
 			ret |= VM_FAULT_FALLBACK;
 		} else {
 			ret = do_huge_pud_wp_page_fallback(vmf, orig_pud, page);
 			if (ret & VM_FAULT_OOM) {
-				WARN(1, "%s: split_huge_page after wp fallback\n", __func__);
+				/*WARN(1, "%s: split_huge_page after wp fallback\n", __func__);*/
 				split_huge_pud(vma, vmf->pud, vmf->address);
 				ret |= VM_FAULT_FALLBACK;
 			}
@@ -1585,7 +1585,7 @@ int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 	if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
 					huge_gfp, &memcg, true))) {
 		put_page(new_page);
-		WARN(1, "%s: split_huge_page after mem cgroup failed\n", __func__);
+		/*WARN(1, "%s: split_huge_page after mem cgroup failed\n", __func__);*/
 		split_huge_pud(vma, vmf->pud, vmf->address);
 		if (page)
 			put_page(page);
@@ -1620,7 +1620,7 @@ int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 		entry = mk_huge_pud(new_page, vma->vm_page_prot);
 		entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
 		pudp_huge_clear_flush_notify(vma, haddr, vmf->pud);
-		page_add_new_anon_rmap(new_page, vma, haddr, true);
+		page_add_new_anon_rmap(new_page, vma, haddr, true, HPAGE_PUD_ORDER);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pud_at(vma->vm_mm, haddr, vmf->pud, entry);
@@ -1629,7 +1629,7 @@ int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PUD_NR);
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, true, HPAGE_PUD_ORDER);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -1748,7 +1748,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
+		page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false, 0);
 		mem_cgroup_commit_charge(pages[i], memcg, false, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		vmf->pte = pte_offset_map(&_pmd, haddr);
@@ -1760,7 +1760,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(vma->vm_mm, vmf->pmd, pgtable);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 	spin_unlock(vmf->ptl);
 
 	/*
@@ -1900,7 +1900,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
-		page_add_new_anon_rmap(new_page, vma, haddr, true);
+		page_add_new_anon_rmap(new_page, vma, haddr, true, HPAGE_PMD_ORDER);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
@@ -1909,7 +1909,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 			add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -2282,9 +2282,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
+			VM_BUG_ON_PAGE(!PageHead(page) && !PMDPageInPUD(page), page);
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
@@ -2560,7 +2560,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 
 		if (pud_present(orig_pud)) {
 			page = pud_page(orig_pud);
-			page_remove_rmap(page, true);
+			page_remove_rmap(page, true, HPAGE_PUD_ORDER);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else
@@ -2582,9 +2582,60 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return 1;
 }
 
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd);
+
+static void __split_huge_zero_page_pud(struct vm_area_struct *vma,
+		unsigned long haddr, pud_t *pud)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	pud_t _pud;
+	int i;
+
+	/*
+	 * Leave pmd empty until pte is filled note that it is fine to delay
+	 * notification until mmu_notifier_invalidate_range_end() as we are
+	 * replacing a zero pmd write protected page with a zero pte write
+	 * protected page.
+	 *
+	 * See Documentation/vm/mmu_notifier.txt
+	 */
+	pudp_huge_clear_flush(vma, haddr, pud);
+
+	pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+	pud_populate_with_pgtable(mm, &_pud, pgtable);
+
+	for (i = 0; i < (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER));
+		 i++, haddr += PMD_SIZE) {
+		pmd_t *pmd = pmd_offset(&_pud, haddr), entry;
+		struct page *zero_page = mm_get_huge_zero_page(mm);
+
+		if (unlikely(!zero_page)) {
+			VM_BUG_ON(1);
+			__split_huge_zero_page_pmd(vma, haddr, pmd);
+			continue;
+		}
+
+		VM_BUG_ON(!pmd_none(*pmd));
+		entry = mk_huge_pmd(zero_page, vma->vm_page_prot);
+		set_pmd_at(mm, haddr, pmd, entry);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pud_populate_with_pgtable(mm, pud, pgtable);
+}
+
 static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long haddr)
+		unsigned long haddr, bool freeze)
 {
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pgtable_t pgtable;
+	pud_t _pud, old_pud;
+	bool young, write, dirty, soft_dirty;
+	unsigned long addr;
+	int i;
+
 	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
@@ -2592,22 +2643,149 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
 
 	count_vm_event(THP_SPLIT_PUD);
 
-	pudp_huge_clear_flush_notify(vma, haddr, pud);
+	if (!vma_is_anonymous(vma)) {
+		_pud = pudp_huge_clear_flush_notify(vma, haddr, pud);
+		/*
+		 * We are going to unmap this huge page. So
+		 * just go ahead and zap it
+		 */
+		if (arch_needs_pgtable_deposit())
+			zap_pud_deposited_table(mm, pud);
+		if (vma_is_dax(vma))
+			return;
+		page = pud_page(_pud);
+		if (!PageReferenced(page) && pud_young(_pud))
+			SetPageReferenced(page);
+		page_remove_rmap(page, true, HPAGE_PUD_ORDER);
+		put_page(page);
+		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PUD_NR);
+		return;
+	} else if (is_huge_zero_pud(*pud)) {
+		/*
+		 * FIXME: Do we want to invalidate secondary mmu by calling
+		 * mmu_notifier_invalidate_range() see comments below inside
+		 * __split_huge_pmd() ?
+		 *
+		 * We are going from a zero huge page write protected to zero
+		 * small page also write protected so it does not seems useful
+		 * to invalidate secondary mmu at this time.
+		 */
+		return __split_huge_zero_page_pud(vma, haddr, pud);
+	}
+
+	/* See the comment above pmdp_invalidate() in __split_huge_pmd_locked() */
+	old_pud = pudp_invalidate(vma, haddr, pud);
+
+	page = pud_page(old_pud);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	page_ref_add(page, (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER)) - 1);
+	if (pud_dirty(old_pud))
+		SetPageDirty(page);
+	write = pud_write(old_pud);
+	young = pud_young(old_pud);
+	dirty = pud_dirty(old_pud);
+	soft_dirty = pud_soft_dirty(old_pud);
+
+	pgtable = pgtable_trans_huge_pud_withdraw(mm, pud);
+	pud_populate_with_pgtable(mm, &_pud, pgtable);
+
+	for (i = 0, addr = haddr; i < HPAGE_PUD_NR;
+		 i += HPAGE_PMD_NR, addr += PMD_SIZE) {
+		pmd_t entry, *pmd;
+		/*
+		 * Note that NUMA hinting access restrictions are not
+		 * transferred to avoid any possibility of altering
+		 * permissions across VMAs.
+		 */
+		if (freeze) {
+			swp_entry_t swp_entry;
+
+			swp_entry = make_migration_entry(page + i, write);
+			entry = swp_entry_to_pmd(swp_entry);
+			if (soft_dirty)
+				entry = pmd_swp_mksoft_dirty(entry);
+		} else {
+			entry = mk_huge_pmd(page + i, READ_ONCE(vma->vm_page_prot));
+			entry = maybe_pmd_mkwrite(entry, vma);
+			if (!write)
+				entry = pmd_wrprotect(entry);
+			if (!young)
+				entry = pmd_mkold(entry);
+			if (soft_dirty)
+				entry = pmd_mksoft_dirty(entry);
+		}
+		pmd = pmd_offset(&_pud, addr);
+		VM_BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, addr, pmd, entry);
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+		if (atomic_inc_and_test(sub_compound_mapcount_ptr(&page[i], 1)))
+			/* first pmd-mapped pud page */
+			__inc_node_page_state(page, NR_ANON_THPS);
+	}
+
+	/*
+	 * Set PG_double_map before dropping compound_mapcount to avoid
+	 * false-negative page_mapped().
+	 */
+	if (compound_mapcount(page) > 1 && !TestSetPagePUDDoubleMap(page)) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+			atomic_inc(sub_compound_mapcount_ptr(&page[i], 1));
+	}
+
+	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+		/* Last compound_mapcount is gone. */
+		__dec_node_page_state(page, NR_ANON_THPS_PUD);
+		if (TestClearPagePUDDoubleMap(page)) {
+			/* No need in mapcount reference anymore */
+			for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		/* distinguish between pud compound_mapcount and pmd compound_mapcount */
+				atomic_dec(sub_compound_mapcount_ptr(&page[i], 1));
+		}
+	}
+
+	smp_wmb(); /* make pte visible before pmd */
+	pud_populate_with_pgtable(mm, pud, pgtable);
+
+	if (freeze) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			/*page_remove_rmap(page + i, true, HPAGE_PMD_ORDER);*/
+			atomic_dec(sub_compound_mapcount_ptr(&page[i], 1));
+			__dec_node_page_state(page, NR_ANON_THPS);
+			__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -HPAGE_PMD_NR);
+			put_page(page + i);
+		}
+	}
 }
 
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address)
+		unsigned long address, bool freeze, struct page *page)
 {
 	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PUD_MASK;
 	struct mmu_notifier_range range;
 
 	mmu_notifier_range_init(&range, vma->vm_mm, address & HPAGE_PUD_MASK,
 				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
-	ptl = pud_lock(vma->vm_mm, pud);
-	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+	ptl = pud_lock(mm, pud);
+
+	/*
+	 * If caller asks to setup a migration entries, we need a page to check
+	 * pmd against. Otherwise we can end up replacing wrong page.
+	 */
+	VM_BUG_ON(freeze && !page);
+	if (page && page != pud_page(*pud))
 		goto out;
-	__split_huge_pud_locked(vma, pud, range.start);
+
+	if (pud_trans_huge(*pud)) {
+		page = pud_page(*pud);
+		if (PageMlocked(page))
+			clear_page_mlock(page);
+	} else if (unlikely(!pud_devmap(*pud)))
+		goto out;
+	__split_huge_pud_locked(vma, pud, haddr, freeze);
 
 out:
 	spin_unlock(ptl);
@@ -2617,6 +2795,369 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	 */
 	mmu_notifier_invalidate_range_only_end(&range);
 }
+
+void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct page *page)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+
+	pgd = pgd_offset(vma->vm_mm, address);
+	if (!pgd_present(*pgd))
+		return;
+
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		return;
+
+	pud = pud_offset(p4d, address);
+
+	__split_huge_pud(vma, pud, address, freeze, page);
+}
+
+static void freeze_pud_page(struct page *page)
+{
+	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PUD;
+	bool unmap_success;
+
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	if (PageAnon(page))
+		ttu_flags |= TTU_SPLIT_FREEZE;
+
+	unmap_success = try_to_unmap(page, ttu_flags);
+	VM_BUG_ON_PAGE(!unmap_success, page);
+}
+
+static void unfreeze_pud_page(struct page *page)
+{
+	int i;
+
+	VM_BUG_ON(!PageTransHuge(page));
+	if (compound_order(page) == HPAGE_PUD_ORDER) {
+		remove_migration_ptes(page, page, true);
+	} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			remove_migration_ptes(page + i, page + i, true);
+	} else
+		VM_BUG_ON_PAGE(1, page);
+}
+
+static void __split_huge_pud_page_tail(struct page *head, int tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	struct page *page_tail = head + tail;
+	/*int page_tail_mapcount = sub_compound_mapcount(page_tail);*/
+
+	VM_BUG_ON_PAGE(page_ref_count(page_tail) != 0, page_tail);
+
+	/*atomic_set(sub_compound_mapcount_ptr(page_tail, 1), -1);*/
+
+	clear_compound_head(page_tail);
+	prep_compound_page(page_tail, HPAGE_PMD_ORDER);
+	prep_transhuge_page(page_tail);
+
+	/* move sub PMD page mapcount */
+	/*atomic_set(compound_mapcount_ptr(page_tail), page_tail_mapcount);*/
+	/*
+	 * tail_page->_refcount is zero and not changing from under us. But
+	 * get_page_unless_zero() may be running from under us on the
+	 * tail_page. If we used atomic_set() below instead of atomic_inc() or
+	 * atomic_add(), we would then run atomic_set() concurrently with
+	 * get_page_unless_zero(), and atomic_set() is implemented in C not
+	 * using locked ops. spin_unlock on x86 sometime uses locked ops
+	 * because of PPro errata 66, 92, so unless somebody can guarantee
+	 * atomic_set() here would be safe on all archs (and not only on x86),
+	 * it's safer to use atomic_inc()/atomic_add().
+	 */
+	if (PageAnon(head) && !PageSwapCache(head)) {
+		page_ref_inc(page_tail);
+	} else {
+		VM_BUG_ON(1);
+		/* Additional pin to radix tree */
+		page_ref_add(page_tail, 2);
+	}
+
+	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+	page_tail->flags |= (head->flags &
+			((1L << PG_referenced) |
+			 (1L << PG_swapbacked) |
+			 (1L << PG_swapcache) |
+			 (1L << PG_mlocked) |
+			 (1L << PG_uptodate) |
+			 (1L << PG_active) |
+			 (1L << PG_locked) |
+			 (1L << PG_unevictable) |
+			 (1L << PG_dirty) |
+			 /* preserve THP */
+			 (1L << PG_head)));
+
+	/*
+	 * After clearing PageTail the gup refcount can be released.
+	 * Page flags also must be visible before we make the page non-compound.
+	 */
+	smp_wmb();
+
+	if (page_is_young(head))
+		set_page_young(page_tail);
+	if (page_is_idle(head))
+		set_page_idle(page_tail);
+
+	/* ->mapping in first tail page is compound_mapcount */
+	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+			page_tail);
+	page_tail->mapping = head->mapping;
+
+	page_tail->index = head->index + tail;
+	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+	lru_add_pud_page_tail(head, page_tail, lruvec, list);
+}
+
+static void __split_huge_pud_page(struct page *page, struct list_head *list,
+		unsigned long flags)
+{
+	struct page *head = compound_head(page);
+	struct zone *zone = page_zone(head);
+	struct lruvec *lruvec;
+	pgoff_t end = -1;
+	int i;
+
+	lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
+
+	/* complete memcg works before add pages to LRU */
+	mem_cgroup_split_huge_pud_fixup(head);
+
+	if (!PageAnon(page)) {
+		VM_BUG_ON(1);
+		end = DIV_ROUND_UP(i_size_read(head->mapping->host), PAGE_SIZE);
+	}
+
+	for (i = HPAGE_PUD_NR - HPAGE_PMD_NR; i >= 1; i -= HPAGE_PMD_NR) {
+		__split_huge_pud_page_tail(head, i, lruvec, list);
+		/* Some pages can be beyond i_size: drop them from page cache */
+		if (head[i].index >= end) {
+			VM_BUG_ON(1);
+			__ClearPageDirty(head + i);
+			__delete_from_page_cache(head + i, NULL);
+			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
+				shmem_uncharge(head->mapping->host, 1);
+			put_page(head + i);
+		}
+	}
+	/* reset head page order  */
+	prep_compound_page(head, HPAGE_PMD_ORDER);
+	prep_transhuge_page(head);
+
+	/* See comment in __split_huge_page_tail() */
+	if (PageAnon(head)) {
+		/* Additional pin to radix tree of swap cache */
+		if (PageSwapCache(head)) {
+			VM_BUG_ON(1);
+			page_ref_add(head, 2);
+		} else
+			page_ref_inc(head);
+	} else {
+		VM_BUG_ON(1);
+		/* Additional pin to radix tree */
+		page_ref_add(head, 2);
+		xa_unlock(&head->mapping->i_pages);
+	}
+
+	spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
+
+	unfreeze_pud_page(head);
+
+	for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+		struct page *subpage = head + i;
+
+		if (subpage == page)
+			continue;
+		unlock_page(subpage);
+
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(subpage);
+	}
+}
+/* Racy check whether the huge page can be split */
+bool can_split_huge_pud_page(struct page *page, int *pextra_pins)
+{
+	int extra_pins;
+
+	/* Additional pins from radix tree */
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? HPAGE_PUD_NR : 0;
+	else
+		extra_pins = HPAGE_PUD_NR;
+	if (pextra_pins)
+		*pextra_pins = extra_pins;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
+/*
+ * This function splits huge page into normal pages. @page can point to any
+ * subpage of huge page to split. Split doesn't change the position of @page.
+ *
+ * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
+ * The huge page must be locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if
+ * they are not mapped.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_pud_page_to_list(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
+	struct anon_vma *anon_vma = NULL;
+	struct address_space *mapping = NULL;
+	int count, mapcount, extra_pins, ret;
+	bool mlocked;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (PageWriteback(page))
+		return -EBUSY;
+
+	if (PageAnon(head)) {
+		/*
+		 * The caller does not necessarily hold an mmap_sem that would
+		 * prevent the anon_vma disappearing so we first we take a
+		 * reference to it and then lock the anon_vma for write. This
+		 * is similar to page_lock_anon_vma_read except the write lock
+		 * is taken to serialise against parallel split or collapse
+		 * operations.
+		 */
+		anon_vma = page_get_anon_vma(head);
+		if (!anon_vma) {
+			ret = -EBUSY;
+			goto out;
+		}
+		mapping = NULL;
+		anon_vma_lock_write(anon_vma);
+	} else {
+		VM_BUG_ON(1);
+		mapping = head->mapping;
+
+		/* Truncated ? */
+		if (!mapping) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		anon_vma = NULL;
+		i_mmap_lock_read(mapping);
+	}
+
+	/*
+	 * Racy check if we can split the page, before freeze_pud_page() will
+	 * split PUDs
+	 */
+	if (!can_split_huge_pud_page(head, &extra_pins)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	mlocked = PageMlocked(page);
+	freeze_pud_page(head);
+	VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+	/* Make sure the page is not on per-CPU pagevec as it takes pin */
+	if (mlocked)
+		lru_add_drain();
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irqsave(zone_lru_lock(page_zone(head)), flags);
+
+	if (mapping) {
+		void **pslot;
+
+		VM_BUG_ON(1);
+
+		xa_lock(&mapping->i_pages);
+		pslot = radix_tree_lookup_slot(&mapping->i_pages,
+				page_index(head));
+		/*
+		 * Check if the head page is present in radix tree.
+		 * We assume all tail are present too, if head is there.
+		 */
+		if (radix_tree_deref_slot_protected(pslot,
+					&mapping->i_pages.xa_lock) != head)
+			goto fail;
+	}
+
+	/* Prevent deferred_split_scan() touching ->_refcount */
+	spin_lock(&pgdata->split_queue_lock);
+	count = page_count(head);
+	mapcount = total_mapcount(head);
+	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
+		if (!list_empty(page_deferred_list(head))) {
+			pgdata->split_queue_len--;
+			list_del(page_deferred_list(head));
+		}
+		if (mapping) {
+			VM_BUG_ON(1);
+			__dec_node_page_state(page, NR_SHMEM_THPS);
+		}
+		spin_unlock(&pgdata->split_queue_lock);
+		__split_huge_pud_page(page, list, flags);
+		if (PageSwapCache(head)) {
+			swp_entry_t entry = { .val = page_private(head) };
+
+			VM_BUG_ON(1);
+
+			ret = split_swap_cluster(entry);
+		} else
+			ret = 0;
+	} else {
+		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+			pr_alert("total_mapcount: %u, page_count(): %u\n",
+					mapcount, count);
+			if (PageTail(page))
+				dump_page(head, NULL);
+			dump_page(page, "total_mapcount(head) > 0");
+			VM_BUG_ON(1);
+		}
+		spin_unlock(&pgdata->split_queue_lock);
+fail:
+		if (mapping) {
+			VM_BUG_ON(1);
+			xa_unlock(&mapping->i_pages);
+		}
+		spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
+		unfreeze_pud_page(head);
+		ret = -EBUSY;
+	}
+
+out_unlock:
+	if (anon_vma) {
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
+	if (mapping)
+		i_mmap_unlock_read(mapping);
+out:
+	count_vm_event(!ret ? THP_SPLIT_PUD_PAGE : THP_SPLIT_PUD_PAGE_FAILED);
+	return ret;
+}
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
 static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
@@ -2687,7 +3228,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			set_page_dirty(page);
 		if (!PageReferenced(page) && pmd_young(_pmd))
 			SetPageReferenced(page);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 		put_page(page);
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
@@ -2787,12 +3328,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Set PG_double_map before dropping compound_mapcount to avoid
 	 * false-negative page_mapped().
 	 */
-	if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page)) {
+	if (((PMDPageInPUD(page) &&
+		sub_compound_mapcount(page) >
+			(1 + PagePUDDoubleMap(compound_head(page)))) ||
+		compound_mapcount(page) > 1)
+		&& !TestSetPageDoubleMap(page)) {
 		for (i = 0; i < HPAGE_PMD_NR; i++)
 			atomic_inc(&page[i]._mapcount);
 	}
 
-	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+	if ((PMDPageInPUD(page) &&
+		atomic_add_negative(-(1 + PagePUDDoubleMap(compound_head(page))),
+			sub_compound_mapcount_ptr(page, 1))) ||
+		atomic_add_negative(-1, compound_mapcount_ptr(page))) {
 		/* Last compound_mapcount is gone. */
 		__dec_node_page_state(page, NR_ANON_THPS);
 		if (TestClearPageDoubleMap(page)) {
@@ -2807,7 +3355,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (freeze) {
 		for (i = 0; i < HPAGE_PMD_NR; i++) {
-			page_remove_rmap(page + i, false);
+			page_remove_rmap(page + i, false, 0);
 			put_page(page + i);
 		}
 	}
@@ -2892,6 +3440,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	 * previously contain an hugepage: check if we need to split
 	 * an huge pmd.
 	 */
+	if (start & ~HPAGE_PUD_MASK &&
+	    (start & HPAGE_PUD_MASK) >= vma->vm_start &&
+	    (start & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= vma->vm_end)
+		split_huge_pud_address(vma, start, false, NULL);
+
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
@@ -2902,6 +3455,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	 * previously contain an hugepage: check if we need to split
 	 * an huge pmd.
 	 */
+	if (end & ~HPAGE_PUD_MASK &&
+	    (end & HPAGE_PUD_MASK) >= vma->vm_start &&
+	    (end & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= vma->vm_end)
+		split_huge_pud_address(vma, end, false, NULL);
+
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
@@ -2916,6 +3474,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long nstart = next->vm_start;
 		nstart += adjust_next << PAGE_SHIFT;
+		if (nstart & ~HPAGE_PUD_MASK &&
+		    (nstart & HPAGE_PUD_MASK) >= next->vm_start &&
+		    (nstart & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE <= next->vm_end)
+			split_huge_pud_address(next, nstart, false, NULL);
+
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
@@ -3084,12 +3647,23 @@ int total_mapcount(struct page *page)
 	if (PageHuge(page))
 		return compound;
 	ret = compound;
-	for (i = 0; i < HPAGE_PMD_NR; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
+	/* if PMD, read all base page, if PUD, read the sub_compound_mapcount()*/
+	if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < hpage_nr_pages(page); i++)
+			ret += atomic_read(&page[i]._mapcount) + 1;
+	} else if (compound_order(page) == HPAGE_PUD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			ret += sub_compound_mapcount(&page[i]);
+		for (i = 0; i < hpage_nr_pages(page); i++)
+			ret += atomic_read(&page[i]._mapcount) + 1;
+	} else
+		VM_BUG_ON_PAGE(1, page);
 	/* File pages has compound_mapcount included in _mapcount */
+	/* both PUD and PMD has HPAGE_PMD_NR sub pages */
 	if (!PageAnon(page))
 		return ret - compound * HPAGE_PMD_NR;
-	if (PageDoubleMap(page))
+	/* both PUD and PMD has HPAGE_PMD_NR sub pages */
+	if (PagePUDDoubleMap(page) || PageDoubleMap(page))
 		ret -= HPAGE_PMD_NR;
 	return ret;
 }
@@ -3135,13 +3709,38 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 	page = compound_head(page);
 
 	_total_mapcount = ret = 0;
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		mapcount = atomic_read(&page[i]._mapcount) + 1;
-		ret = max(ret, mapcount);
-		_total_mapcount += mapcount;
-	}
-	if (PageDoubleMap(page)) {
+	/* if PMD, read all base page, if PUD, read the sub_compound_mapcount()*/
+	if (compound_order(page) == HPAGE_PMD_ORDER) {
+		for (i = 0; i < hpage_nr_pages(page); i++) {
+			mapcount = atomic_read(&page[i]._mapcount) + 1;
+			ret = max(ret, mapcount);
+			_total_mapcount += mapcount;
+		}
+	} else if (compound_order(page) == HPAGE_PUD_ORDER) {
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			int j;
+
+			mapcount = sub_compound_mapcount(&page[i]);
+			ret = max(ret, mapcount);
+			_total_mapcount += mapcount;
+
+			/* Triple mapped at base page size */
+			for (j = 0; j < HPAGE_PMD_NR; j++) {
+				mapcount = atomic_read(&page[i + j]._mapcount) + 1;
+				ret = max(ret, mapcount);
+				_total_mapcount += mapcount;
+			}
+
+			if (PageDoubleMap(&page[i])) {
+				ret -= 1;
+				_total_mapcount -= HPAGE_PMD_NR;
+			}
+		}
+	} else
+		VM_BUG_ON_PAGE(1, page);
+	if (PageDoubleMap(page) || PagePUDDoubleMap(page)) {
 		ret -= 1;
+		/* both PUD and PMD has HPAGE_PMD_NR sub pages */
 		_total_mapcount -= HPAGE_PMD_NR;
 	}
 	mapcount = compound_mapcount(page);
@@ -3360,6 +3959,9 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(pgdata->split_queue_len);
 }
 
+#define deferred_list_entry(x) (compound_head(list_entry((void *)x, \
+					struct page, mapping)))
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -3372,8 +3974,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_lock_irqsave(&pgdata->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_safe(pos, next, &pgdata->split_queue) {
-		page = list_entry((void *)pos, struct page, mapping);
-		page = compound_head(page);
+		page = deferred_list_entry(pos);
 		if (get_page_unless_zero(page)) {
 			list_move(page_deferred_list(page), &list);
 		} else {
@@ -3387,12 +3988,18 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
 
 	list_for_each_safe(pos, next, &list) {
-		page = list_entry((void *)pos, struct page, mapping);
+		page = deferred_list_entry(pos);
 		if (!trylock_page(page))
 			goto next;
 		/* split_huge_page() removes page from list on success */
-		if (!split_huge_page(page))
-			split++;
+		if (compound_order(page) == HPAGE_PUD_ORDER) {
+			if (!split_huge_pud_page(page))
+				split++;
+		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+			if (!split_huge_page(page))
+				split++;
+		} else
+			VM_BUG_ON_PAGE(1, page);
 		unlock_page(page);
 next:
 		put_page(page);
@@ -3499,7 +4106,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 	if (pmd_soft_dirty(pmdval))
 		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
 	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 	put_page(page);
 }
 
@@ -3525,7 +4132,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
 	if (PageAnon(new))
-		page_add_anon_rmap(new, vma, mmun_start, true);
+		page_add_anon_rmap(new, vma, mmun_start, true, HPAGE_PMD_ORDER);
 	else
 		page_add_file_rmap(new, true);
 	set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index afef61656c1e..0db6c31440e8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3418,7 +3418,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			set_page_dirty(page);
 
 		hugetlb_count_sub(pages_per_huge_page(h), mm);
-		page_remove_rmap(page, true);
+		page_remove_rmap(page, true, huge_page_order(h));
 
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -3643,7 +3643,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, haddr, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page, true);
+		page_remove_rmap(old_page, true, huge_page_order(h));
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		/* Make the old page be freed below */
 		new_page = old_page;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index aedaa9f75806..3acfddcba714 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -674,7 +674,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page, false);
+			page_remove_rmap(src_page, false, 0);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -1073,7 +1073,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address, true);
+	page_add_new_anon_rmap(new_page, vma, address, true, HPAGE_PMD_ORDER);
 	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/ksm.c b/mm/ksm.c
index dc1ec06b71a0..68f1d0f8be22 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1154,7 +1154,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	 */
 	if (!is_zero_pfn(page_to_pfn(kpage))) {
 		get_page(kpage);
-		page_add_anon_rmap(kpage, vma, addr, false);
+		page_add_anon_rmap(kpage, vma, addr, false, 0);
 		newpte = mk_pte(kpage, vma->vm_page_prot);
 	} else {
 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage),
@@ -1178,7 +1178,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, newpte);
 
-	page_remove_rmap(page, false);
+	page_remove_rmap(page, false, 0);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b32389..ae3ff6a4da8c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2678,6 +2678,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 
 	__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
 }
+
+void mem_cgroup_split_huge_pud_fixup(struct page *head)
+{
+	int i;
+
+	if (mem_cgroup_disabled())
+		return;
+
+	for (i = HPAGE_PMD_NR; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		head[i].mem_cgroup = head->mem_cgroup;
+
+	/*__mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PUD_NR);*/
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_MEMCG_SWAP
diff --git a/mm/memory.c b/mm/memory.c
index 3608b5436519..c875cc1a2600 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1088,7 +1088,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, false, 0);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1116,7 +1116,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, false, 0);
 			put_page(page);
 			continue;
 		}
@@ -2300,7 +2300,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+		page_add_new_anon_rmap(new_page, vma, vmf->address, false, 0);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2333,7 +2333,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page, false);
+			page_remove_rmap(old_page, false, 0);
 		}
 
 		/* Free the old page.. */
@@ -2816,11 +2816,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/* ksm created a completely new copy */
 	if (unlikely(page != swapcache && swapcache)) {
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address, false, 0);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	} else {
-		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		do_page_add_anon_rmap(page, vma, vmf->address, exclusive, 0);
 		mem_cgroup_commit_charge(page, memcg, true, false);
 		activate_page(page);
 	}
@@ -2967,7 +2967,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, vmf->address, false);
+	page_add_new_anon_rmap(page, vma, vmf->address, false, 0);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
@@ -3241,7 +3241,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address, false, 0);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index b8c79aa62134..f7e5d88210ee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -268,7 +268,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 			set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
 			if (PageAnon(new))
-				page_add_anon_rmap(new, vma, pvmw.address, false);
+				page_add_anon_rmap(new, vma, pvmw.address, false, 0);
 			else
 				page_add_file_rmap(new, false);
 		}
@@ -2067,7 +2067,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	page_ref_unfreeze(page, 2);
 	mlock_migrate_page(new_page, page);
-	page_remove_rmap(page, true);
+	page_remove_rmap(page, true, HPAGE_PMD_ORDER);
 	set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED);
 
 	spin_unlock(ptl);
@@ -2297,7 +2297,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * drop page refcount. Page won't be freed, as we took
 			 * a reference just above.
 			 */
-			page_remove_rmap(page, false);
+			page_remove_rmap(page, false, 0);
 			put_page(page);
 
 			if (pte_present(pte))
@@ -2688,7 +2688,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	}
 
 	inc_mm_counter(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, addr, false);
+	page_add_new_anon_rmap(page, vma, addr, false, 0);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	if (!is_zone_device_page(page))
 		lru_cache_add_active_or_unevictable(page, vma);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3b295ea7348..dbcccc022b30 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -626,6 +626,9 @@ void prep_compound_page(struct page *page, unsigned int order)
 		set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
+	if (order == HPAGE_PUD_ORDER)
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+			atomic_set(sub_compound_mapcount_ptr(&page[i], 1), -1);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -1001,6 +1004,13 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		 */
 		break;
 	default:
+		/* sub_compound_map_ptr store here */
+		if (compound_order(head_page) == HPAGE_PUD_ORDER &&
+			(page - head_page) % HPAGE_PMD_NR == 3) {
+			if (unlikely(atomic_read(&page->compound_mapcount) != -1))
+				bad_page(page, "nonzero sub_compound_mapcount", 0);
+			break;
+		}
 		if (page->mapping != TAIL_MAPPING) {
 			bad_page(page, "corrupted mapping in tail page", 0);
 			goto out;
@@ -1041,8 +1051,14 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
-		if (compound)
+		if (compound) {
 			ClearPageDoubleMap(page);
+			if (order == HPAGE_PUD_ORDER) {
+				ClearPagePUDDoubleMap(page);
+				for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+					ClearPageDoubleMap(&page[i]);
+			}
+		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 0b79568fba1c..95af1d67f209 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,6 +236,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
+#ifndef __HAVE_ARCH_PUDP_INVALIDATE
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pud_t *pudp)
+{
+	pud_t old = pudp_establish(vma, address, pudp, pud_mknotpresent(*pudp));
+
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return old;
+}
+#endif
+
 #ifndef pmdp_collapse_flush
 pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 			  pmd_t *pmdp)
diff --git a/mm/rmap.c b/mm/rmap.c
index f69d81d4a956..79908cfc518a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1097,9 +1097,9 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address, bool compound, int order)
 {
-	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
+	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0, order);
 }
 
 /*
@@ -1108,7 +1108,7 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int flags)
+	struct vm_area_struct *vma, unsigned long address, int flags, int order)
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
@@ -1117,7 +1117,18 @@ void do_page_add_anon_rmap(struct page *page,
 		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		mapcount = compound_mapcount_ptr(page);
+		if (compound_order(page) == HPAGE_PUD_ORDER) {
+			if (order == HPAGE_PUD_ORDER) {
+				mapcount = compound_mapcount_ptr(page);
+			} else if (order == HPAGE_PMD_ORDER) {
+				VM_BUG_ON(!PMDPageInPUD(page));
+				mapcount = sub_compound_mapcount_ptr(page, 1);
+			} else
+				VM_BUG_ON(1);
+		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+			mapcount = compound_mapcount_ptr(page);
+		} else
+			VM_BUG_ON(1);
 		first = atomic_inc_and_test(mapcount);
 	} else {
 		first = atomic_inc_and_test(&page->_mapcount);
@@ -1132,7 +1143,7 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
-			if (nr == HPAGE_PMD_NR)
+			if (order == HPAGE_PMD_ORDER)
 				__inc_node_page_state(page, NR_ANON_THPS);
 			else
 				__inc_node_page_state(page, NR_ANON_THPS_PUD);
@@ -1164,7 +1175,7 @@ void do_page_add_anon_rmap(struct page *page,
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address, bool compound, int order)
 {
 	int nr = compound ? hpage_nr_pages(page) : 1;
 
@@ -1174,10 +1185,15 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
-		if (nr == HPAGE_PMD_NR)
-			__inc_node_page_state(page, NR_ANON_THPS);
-		else
+		if (order == HPAGE_PUD_ORDER) {
+			VM_BUG_ON(compound_order(page) != HPAGE_PUD_ORDER);
+			/* Anon THP always mapped first with PMD */
 			__inc_node_page_state(page, NR_ANON_THPS_PUD);
+		} else if (order == HPAGE_PMD_ORDER) {
+			VM_BUG_ON(compound_order(page) != HPAGE_PMD_ORDER);
+			__inc_node_page_state(page, NR_ANON_THPS);
+		} else
+			VM_BUG_ON(1);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1268,12 +1284,40 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 	unlock_page_memcg(page);
 }
 
-static void page_remove_anon_compound_rmap(struct page *page)
+static void page_remove_anon_compound_rmap(struct page *page, int order)
 {
-	int i, nr;
-
-	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
-		return;
+	int i, nr = 0;
+	struct page *head = compound_head(page);
+
+	if (compound_order(head) == HPAGE_PUD_ORDER) {
+		if (order == HPAGE_PMD_ORDER) {
+			VM_BUG_ON(!PMDPageInPUD(page));
+			if (atomic_add_negative(-1, sub_compound_mapcount_ptr(page, 1))) {
+				if (TestClearPageDoubleMap(page)) {
+					/*
+					 * Subpages can be mapped with PTEs too. Check how many of
+					 * themi are still mapped.
+					 */
+					for (i = 0; i < hpage_nr_pages(head); i++) {
+						if (atomic_add_negative(-1, &head[i]._mapcount))
+							nr++;
+					}
+				}
+				__dec_node_page_state(page, NR_ANON_THPS);
+			}
+			nr += HPAGE_PMD_NR;
+			__mod_node_page_state(page_pgdat(head), NR_ANON_MAPPED, -nr);
+			return;
+		} else {
+			VM_BUG_ON(order != HPAGE_PUD_ORDER);
+			if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+				return;
+		}
+	} else if (compound_order(head) == HPAGE_PMD_ORDER) {
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+	} else
+		VM_BUG_ON_PAGE(1, page);
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
 	if (unlikely(PageHuge(page)))
@@ -1282,30 +1326,44 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	if (hpage_nr_pages(page) == HPAGE_PMD_NR)
+	if (order == HPAGE_PMD_ORDER)
 		__dec_node_page_state(page, NR_ANON_THPS);
-	else
+	else if (order == HPAGE_PUD_ORDER)
 		__dec_node_page_state(page, NR_ANON_THPS_PUD);
+	else
+		VM_BUG_ON(1);
 
-	if (TestClearPageDoubleMap(page)) {
+	/* PMD-mapped PUD THP is handled above */
+	if (TestClearPagePUDDoubleMap(head)) {
+		VM_BUG_ON(!(compound_order(head) == HPAGE_PUD_ORDER || head == page));
+		/*
+		 * Subpages can be mapped with PMDs too. Check how many of
+		 * themi are still mapped.
+		 */
+		for (i = 0, nr = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			if (atomic_add_negative(-1, sub_compound_mapcount_ptr(&head[i], 1)))
+				nr += HPAGE_PMD_NR;
+		}
+	} else if (TestClearPageDoubleMap(head)) {
+		VM_BUG_ON(compound_order(head) != HPAGE_PMD_ORDER);
 		/*
 		 * Subpages can be mapped with PTEs too. Check how many of
 		 * themi are still mapped.
 		 */
-		for (i = 0, nr = 0; i < hpage_nr_pages(page); i++) {
-			if (atomic_add_negative(-1, &page[i]._mapcount))
+		for (i = 0, nr = 0; i < hpage_nr_pages(head); i++) {
+			if (atomic_add_negative(-1, &head[i]._mapcount))
 				nr++;
 		}
 	} else {
-		nr = hpage_nr_pages(page);
+		nr = hpage_nr_pages(head);
 	}
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
 	if (nr) {
-		__mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
-		deferred_split_huge_page(page);
+		__mod_node_page_state(page_pgdat(head), NR_ANON_MAPPED, -nr);
+		deferred_split_huge_page(head);
 	}
 }
 
@@ -1316,13 +1374,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page, bool compound, int order)
 {
 	if (!PageAnon(page))
 		return page_remove_file_rmap(page, compound);
 
 	if (compound)
-		return page_remove_anon_compound_rmap(page);
+		return page_remove_anon_compound_rmap(page, order);
 
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
@@ -1672,7 +1730,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page));
+		page_remove_rmap(subpage, PageHuge(page), 0);
 		put_page(page);
 	}
 
diff --git a/mm/swap.c b/mm/swap.c
index 4929bc1be60e..79de59875280 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -851,6 +851,44 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 	if (!PageUnevictable(page))
 		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
 }
+
+/* used by __split_pud_huge_page_tail() */
+void lru_add_pud_page_tail(struct page *page, struct page *page_tail,
+		       struct lruvec *lruvec, struct list_head *list)
+{
+	const int file = 0;
+
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
+	VM_BUG_ON(NR_CPUS != 1 &&
+		  !spin_is_locked(&lruvec_pgdat(lruvec)->lru_lock));
+
+	if (!list)
+		SetPageLRU(page_tail);
+
+	if (likely(PageLRU(page)))
+		list_add_tail(&page_tail->lru, &page->lru);
+	else if (list) {
+		/* page reclaim is reclaiming a huge page */
+		get_page(page_tail);
+		list_add_tail(&page_tail->lru, list);
+	} else {
+		struct list_head *list_head;
+		/*
+		 * Head page has not yet been counted, as an hpage,
+		 * so we must account for each subpage individually.
+		 *
+		 * Use the standard add function to put page_tail on the list,
+		 * but then correct its position so they all end up in order.
+		 */
+		add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
+		list_head = page_tail->lru.prev;
+		list_move_tail(&page_tail->lru, list_head);
+	}
+
+	if (!PageUnevictable(page))
+		update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dbac1d49469d..742caaea2aa5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1775,10 +1775,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, false);
+		page_add_anon_rmap(page, vma, addr, false, 0);
 		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr, false, 0);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a73dfb3..e49537f6000e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -90,7 +90,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_uncharge_unlock;
 
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+	page_add_new_anon_rmap(page, dst_vma, dst_addr, false, 0);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, dst_vma);
 
diff --git a/mm/util.c b/mm/util.c
index 1ea055138043..1b1b6dd386d1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -536,8 +536,15 @@ struct address_space *page_mapping_file(struct page *page)
 int __page_mapcount(struct page *page)
 {
 	int ret;
+	struct page *head = compound_head(page);
 
 	ret = atomic_read(&page->_mapcount) + 1;
+	if (compound_order(head) == HPAGE_PUD_ORDER) {
+		struct page *sub_compound_page = head +
+			(((page - head) / HPAGE_PMD_NR) * HPAGE_PMD_NR);
+
+		ret += sub_compound_mapcount(sub_compound_page);
+	}
 	/*
 	 * For file THP page->_mapcount contains total number of mapping
 	 * of the page: no need to look into compound_mapcount.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 25a88693e417..1d185cf748a6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1263,6 +1263,10 @@ const char * const vmstat_text[] = {
 	"thp_fault_alloc_pud",
 	"thp_fault_fallback_pud",
 	"thp_split_pud",
+	"thp_split_pud_page",
+	"thp_split_pud_page_failed",
+	"thp_zero_pud_page_alloc",
+	"thp_zero_pud_page_alloc_failed",
 #endif
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 16/31] mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (14 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 15/31] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 17/31] mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP Zi Yan
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

PMD mappings on a PUD THP should be zero when the page is freed.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dbcccc022b30..b87a2ca0a97c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1007,8 +1007,10 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		/* sub_compound_map_ptr store here */
 		if (compound_order(head_page) == HPAGE_PUD_ORDER &&
 			(page - head_page) % HPAGE_PMD_NR == 3) {
-			if (unlikely(atomic_read(&page->compound_mapcount) != -1))
+			if (unlikely(atomic_read(&page->compound_mapcount) != -1)) {
+				pr_err("sub_compound_mapcount: %d\n", atomic_read(&page->compound_mapcount) + 1);
 				bad_page(page, "nonzero sub_compound_mapcount", 0);
+			}
 			break;
 		}
 		if (page->mapping != TAIL_MAPPING) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 17/31] mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (15 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 16/31] mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 18/31] mm: page_vma_walk: teach it about PMD-mapped " Zi Yan
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Page count increase needs to goto the head of the PUD page.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5f83f4c5eac7..bbdbc9ae06bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3198,7 +3198,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
+	struct page *page, *head;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false;
@@ -3285,7 +3285,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		soft_dirty = pmd_soft_dirty(old_pmd);
 	}
 	VM_BUG_ON_PAGE(!page_count(page), page);
-	page_ref_add(page, HPAGE_PMD_NR - 1);
+	head = compound_head(page);
+	page_ref_add(head, HPAGE_PMD_NR - 1);
 
 	/*
 	 * Withdraw the table only after we mark the pmd entry invalid.
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 18/31] mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (16 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 17/31] mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP Zi Yan
@ 2019-02-15 22:08 ` " Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 19/31] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

We now have PMD-mapped PUD THP and PTE-mapped PUD THP, page_vma_walk
should handle them properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_vma_mapped.c | 116 ++++++++++++++++++++++++++++++-------------
 1 file changed, 82 insertions(+), 34 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a473553aa9a5..fde47dae0b9c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -52,6 +52,22 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
 	return true;
 }
 
+static bool map_pmd(struct page_vma_mapped_walk *pvmw)
+{
+	pmd_t pmde;
+
+	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
+	pmde = READ_ONCE(*pvmw->pmd);
+	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+		pvmw->ptl = pmd_lock(pvmw->vma->vm_mm, pvmw->pmd);
+		return true;
+	} else if (!pmd_present(pmde))
+		return false;
+
+	pvmw->ptl = pmd_lock(pvmw->vma->vm_mm, pvmw->pmd);
+	return true;
+}
+
 static inline bool pfn_in_hpage(struct page *hpage, unsigned long pfn)
 {
 	unsigned long hpage_pfn = page_to_pfn(hpage);
@@ -111,6 +127,38 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
 	return pfn_in_hpage(pvmw->page, pfn);
 }
 
+/* 0: not mapped, 1: pmd_page, 2: pmd  */
+static int check_pmd(struct page_vma_mapped_walk *pvmw)
+{
+	unsigned long pfn;
+
+	if (likely(pmd_trans_huge(*pvmw->pmd))) {
+		if (pvmw->flags & PVMW_MIGRATION)
+			return 0;
+		pfn = pmd_pfn(*pvmw->pmd);
+		if (!pfn_in_hpage(pvmw->page, pfn))
+			return 0;
+		return 1;
+	} else if (!pmd_present(*pvmw->pmd)) {
+		if (thp_migration_supported()) {
+			if (!(pvmw->flags & PVMW_MIGRATION))
+				return 0;
+			if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+
+				pfn = migration_entry_to_pfn(entry);
+				if (!pfn_in_hpage(pvmw->page, pfn))
+					return 0;
+				return 1;
+			}
+		}
+		return 0;
+	}
+	/* THP pmd was split under us: handle on pte level */
+	spin_unlock(pvmw->ptl);
+	pvmw->ptl = NULL;
+	return 2;
+}
 /**
  * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
  * @pvmw->address
@@ -142,14 +190,14 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t pude;
-	pmd_t pmde;
+	int pmd_res;
 
 	if (!pvmw->pte && !pvmw->pmd && pvmw->pud)
 		return not_found(pvmw);
 
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
-		return not_found(pvmw);
+		goto next_pmd;
 
 	if (pvmw->pte)
 		goto next_pte;
@@ -198,43 +246,43 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	} else if (!pud_present(pude))
 		return false;
 
-	pvmw->pmd = pmd_offset(pvmw->pud, pvmw->address);
-	/*
-	 * Make sure the pmd value isn't cached in a register by the
-	 * compiler and used as a stale value after we've observed a
-	 * subsequent update.
-	 */
-	pmde = READ_ONCE(*pvmw->pmd);
-	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
-		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
-		if (likely(pmd_trans_huge(*pvmw->pmd))) {
-			if (pvmw->flags & PVMW_MIGRATION)
-				return not_found(pvmw);
-			if (pmd_page(*pvmw->pmd) != page)
-				return not_found(pvmw);
+	if (!map_pmd(pvmw))
+		goto next_pmd;
+	/* pmd locked after map_pmd  */
+	while (1) {
+		pmd_res = check_pmd(pvmw);
+		if (pmd_res == 1) /* pmd_page */
 			return true;
-		} else if (!pmd_present(*pvmw->pmd)) {
-			if (thp_migration_supported()) {
-				if (!(pvmw->flags & PVMW_MIGRATION))
-					return not_found(pvmw);
-				if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
-					swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
-
-					if (migration_entry_to_page(entry) != page)
-						return not_found(pvmw);
-					return true;
+		else if (pmd_res == 2) /* pmd entry  */
+			goto pte_level;
+next_pmd:
+		/* Only PMD-mapped PUD THP has next pmd  */
+		if (!(PageTransHuge(pvmw->page) && compound_order(pvmw->page) == HPAGE_PUD_ORDER))
+			return not_found(pvmw);
+		do {
+			pvmw->address += HPAGE_PMD_SIZE;
+			if (pvmw->address >= pvmw->vma->vm_end ||
+			    pvmw->address >=
+					__vma_address(pvmw->page, pvmw->vma) +
+					hpage_nr_pages(pvmw->page) * PAGE_SIZE)
+				return not_found(pvmw);
+			/* Did we cross page table boundary? */
+			if (pvmw->address % PUD_SIZE == 0) {
+				if (pvmw->ptl) {
+					spin_unlock(pvmw->ptl);
+					pvmw->ptl = NULL;
 				}
+				goto restart;
+			} else {
+				pvmw->pmd++;
 			}
-			return not_found(pvmw);
-		} else {
-			/* THP pmd was split under us: handle on pte level */
-			spin_unlock(pvmw->ptl);
-			pvmw->ptl = NULL;
-		}
-	} else if (!pmd_present(pmde)) {
-		return false;
+		} while (pmd_none(*pvmw->pmd));
+
+		if (!pvmw->ptl)
+			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 	}
 
+pte_level:
 	if (!map_pte(pvmw))
 		goto next_pte;
 	while (1) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 19/31] mm: thp: 1GB THP support in try_to_unmap().
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (17 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 18/31] mm: page_vma_walk: teach it about PMD-mapped " Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 20/31] mm: thp: split 1GB THPs at page reclaim Zi Yan
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Unmap different subpages in different sized THPs properly in the
try_to_unmap() function.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/migrate.c |   2 +-
 mm/rmap.c    | 140 +++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 103 insertions(+), 39 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f7e5d88210ee..7deb64d75adb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -223,7 +223,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte) {
+		if (!pvmw.pte && pvmw.pmd) {
 			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
 			remove_migration_pmd(&pvmw, new);
 			continue;
diff --git a/mm/rmap.c b/mm/rmap.c
index 79908cfc518a..39f446a6775d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1031,7 +1031,7 @@ void page_move_anon_rmap(struct page *page, struct vm_area_struct *vma)
  * __page_set_anon_rmap - set up new anonymous rmap
  * @page:	Page or Hugepage to add to rmap
  * @vma:	VM area to add page to.
- * @address:	User virtual address of the mapping	
+ * @address:	User virtual address of the mapping
  * @exclusive:	the page is exclusively owned by the current process
  */
 static void __page_set_anon_rmap(struct page *page,
@@ -1423,7 +1423,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		.address = address,
 	};
 	pte_t pteval;
-	struct page *subpage;
+	pmd_t pmdval;
+	pud_t pudval;
+	struct page *subpage = NULL;
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)arg;
@@ -1436,6 +1438,11 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	    is_zone_device_page(page) && !is_device_private_page(page))
 		return true;
 
+	if (flags & TTU_SPLIT_HUGE_PUD) {
+		split_huge_pud_address(vma, address,
+				flags & TTU_SPLIT_FREEZE, page);
+	}
+
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_SPLIT_FREEZE, page);
@@ -1465,7 +1472,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		/* PMD-mapped THP migration entry */
-		if (!pvmw.pte && (flags & TTU_MIGRATION)) {
+		if (!pvmw.pte && pvmw.pmd && (flags & TTU_MIGRATION)) {
 			VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
 
 			set_pmd_migration_entry(&pvmw, page);
@@ -1497,9 +1504,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_PAGE(!pvmw.pte, page);
 
-		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		if (pvmw.pte)
+			subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		else if (!pvmw.pte && pvmw.pmd)
+			subpage = page - page_to_pfn(page) + pmd_pfn(*pvmw.pmd);
+		else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+			subpage = page - page_to_pfn(page) + pud_pfn(*pvmw.pud);
+		VM_BUG_ON(!subpage);
 		address = pvmw.address;
 
 		if (PageHuge(page)) {
@@ -1556,16 +1568,26 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (!(flags & TTU_IGNORE_ACCESS)) {
-			if (ptep_clear_flush_young_notify(vma, address,
-						pvmw.pte)) {
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+			if ((pvmw.pte &&
+				 ptep_clear_flush_young_notify(vma, address, pvmw.pte)) ||
+				((!pvmw.pte && pvmw.pmd) &&
+				 pmdp_clear_flush_young_notify(vma, address, pvmw.pmd)) ||
+				((!pvmw.pte && !pvmw.pmd && pvmw.pud) &&
+				 pudp_clear_flush_young_notify(vma, address, pvmw.pud))
+				) {
+					ret = false;
+					page_vma_mapped_walk_done(&pvmw);
+					break;
 			}
 		}
 
 		/* Nuke the page table entry. */
-		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		if (pvmw.pte)
+			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		else if (!pvmw.pte && pvmw.pmd)
+			flush_cache_page(vma, address, pmd_pfn(*pvmw.pmd));
+		else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+			flush_cache_page(vma, address, pud_pfn(*pvmw.pud));
 		if (should_defer_flush(mm, flags)) {
 			/*
 			 * We clear the PTE but do not flush so potentially
@@ -1575,16 +1597,34 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * transition on a cached TLB entry is written through
 			 * and traps if the PTE is unmapped.
 			 */
-			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+			if (pvmw.pte) {
+				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+			} else if (!pvmw.pte && pvmw.pmd) {
+				pmdval = pmdp_huge_get_and_clear(mm, address, pvmw.pmd);
 
-			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+				set_tlb_ubc_flush_pending(mm, pmd_dirty(pmdval));
+			} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+				pudval = pudp_huge_get_and_clear(mm, address, pvmw.pud);
+
+				set_tlb_ubc_flush_pending(mm, pud_dirty(pudval));
+			}
 		} else {
-			pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			if (pvmw.pte)
+				pteval = ptep_clear_flush(vma, address, pvmw.pte);
+			else if (!pvmw.pte && pvmw.pmd)
+				pmdval = pmdp_huge_clear_flush(vma, address, pvmw.pmd);
+			else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+				pudval = pudp_huge_clear_flush(vma, address, pvmw.pud);
 		}
 
 		/* Move the dirty bit to the page. Now the pte is gone. */
-		if (pte_dirty(pteval))
-			set_page_dirty(page);
+			if ((pvmw.pte && pte_dirty(pteval)) ||
+				((!pvmw.pte && pvmw.pmd) && pmd_dirty(pmdval)) ||
+				((!pvmw.pte && !pvmw.pmd && pvmw.pud) && pud_dirty(pudval))
+				)
+				set_page_dirty(page);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
@@ -1620,33 +1660,57 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
 				(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
 			swp_entry_t entry;
-			pte_t swp_pte;
 
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
+			if (pvmw.pte) {
+				pte_t swp_pte;
 
-			/*
-			 * Store the pfn of the page in a special migration
-			 * pte. do_swap_page() will wait until the migration
-			 * pte is removed and then restart fault handling.
-			 */
-			entry = make_migration_entry(subpage,
-					pte_write(pteval));
-			swp_pte = swp_entry_to_pte(entry);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/*
-			 * No need to invalidate here it will synchronize on
-			 * against the special swap migration pte.
-			 */
+				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+					set_pte_at(mm, address, pvmw.pte, pteval);
+					ret = false;
+					page_vma_mapped_walk_done(&pvmw);
+					break;
+				}
+
+				/*
+				 * Store the pfn of the page in a special migration
+				 * pte. do_swap_page() will wait until the migration
+				 * pte is removed and then restart fault handling.
+				 */
+				entry = make_migration_entry(subpage,
+						pte_write(pteval));
+				swp_pte = swp_entry_to_pte(entry);
+				if (pte_soft_dirty(pteval))
+					swp_pte = pte_swp_mksoft_dirty(swp_pte);
+				set_pte_at(mm, address, pvmw.pte, swp_pte);
+				/*
+				 * No need to invalidate here it will synchronize on
+				 * against the special swap migration pte.
+				 */
+			} else if (!pvmw.pte && pvmw.pmd) {
+				pmd_t swp_pmd;
+				/*
+				 * Store the pfn of the page in a special migration
+				 * pte. do_swap_page() will wait until the migration
+				 * pte is removed and then restart fault handling.
+				 */
+				entry = make_migration_entry(subpage,
+						pmd_write(pmdval));
+				swp_pmd = swp_entry_to_pmd(entry);
+				if (pmd_soft_dirty(pmdval))
+					swp_pmd = pmd_swp_mksoft_dirty(swp_pmd);
+				set_pmd_at(mm, address, pvmw.pmd, swp_pmd);
+				/*
+				 * No need to invalidate here it will synchronize on
+				 * against the special swap migration pte.
+				 */
+			} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
+				VM_BUG_ON(1);
+			}
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
+
+			VM_BUG_ON(!pvmw.pte);
 			/*
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 20/31] mm: thp: split 1GB THPs at page reclaim.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (18 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 19/31] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 21/31] mm: thp: 1GB zero page shrinker Zi Yan
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

We cannot swap 1GB THPs, so split them before swap them out.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/swap_slots.c |  2 ++
 mm/vmscan.c     | 55 ++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 63a7b4563a57..797c804ff905 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -315,6 +315,8 @@ swp_entry_t get_swap_page(struct page *page)
 	entry.val = 0;
 
 	if (PageTransHuge(page)) {
+		if (compound_order(page) == HPAGE_PUD_ORDER)
+			return entry;
 		if (IS_ENABLED(CONFIG_THP_SWAP))
 			get_swap_pages(1, &entry, HPAGE_PMD_NR);
 		goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a714c4f800e9..a2a91c1d3dae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1288,25 +1288,47 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				if (!(sc->gfp_mask & __GFP_IO))
 					goto keep_locked;
 				if (PageTransHuge(page)) {
-					/* cannot split THP, skip it */
-					if (!can_split_huge_page(page, NULL))
-						goto activate_locked;
-					/*
-					 * Split pages without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
-					 */
-					if (!compound_mapcount(page) &&
-					    split_huge_page_to_list(page,
-								    page_list))
+					if (compound_order(page) == HPAGE_PUD_ORDER) {
+						/* cannot split THP, skip it */
+						if (!can_split_huge_pud_page(page, NULL))
+							goto activate_locked;
+						/*
+						 * Split pages without a PMD map right
+						 * away. Chances are some or all of the
+						 * tail pages can be freed without IO.
+						 */
+						if (!compound_mapcount(page) &&
+							split_huge_pud_page_to_list(page,
+										page_list))
+							goto activate_locked;
+					}
+					if (compound_order(page) == HPAGE_PMD_ORDER) {
+						/* cannot split THP, skip it */
+						if (!can_split_huge_page(page, NULL))
+							goto activate_locked;
+						/*
+						 * Split pages without a PMD map right
+						 * away. Chances are some or all of the
+						 * tail pages can be freed without IO.
+						 */
+						if (!compound_mapcount(page) &&
+							split_huge_page_to_list(page,
+										page_list))
+							goto activate_locked;
+					}
+				}
+				if (compound_order(page) == HPAGE_PUD_ORDER) {
+					if (split_huge_pud_page_to_list(page,
+									page_list))
 						goto activate_locked;
 				}
 				if (!add_to_swap(page)) {
 					if (!PageTransHuge(page))
 						goto activate_locked;
 					/* Fallback to swap normal pages */
+					VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER, page);
 					if (split_huge_page_to_list(page,
-								    page_list))
+									page_list))
 						goto activate_locked;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 					count_vm_event(THP_SWPOUT_FALLBACK);
@@ -1321,6 +1343,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				mapping = page_mapping(page);
 			}
 		} else if (unlikely(PageTransHuge(page))) {
+			VM_BUG_ON_PAGE(compound_order(page) != HPAGE_PMD_ORDER, page);
 			/* Split file THP */
 			if (split_huge_page_to_list(page, page_list))
 				goto keep_locked;
@@ -1333,8 +1356,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (page_mapped(page)) {
 			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
 
-			if (unlikely(PageTransHuge(page)))
-				flags |= TTU_SPLIT_HUGE_PMD;
+			if (unlikely(PageTransHuge(page))) {
+				if (compound_order(page) == HPAGE_PMD_ORDER)
+					flags |= TTU_SPLIT_HUGE_PMD;
+				else if (compound_order(page) == HPAGE_PUD_ORDER)
+					flags |= TTU_SPLIT_HUGE_PUD;
+			}
 			if (!try_to_unmap(page, flags)) {
 				nr_unmap_fail++;
 				goto activate_locked;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 21/31] mm: thp: 1GB zero page shrinker.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (19 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 20/31] mm: thp: split 1GB THPs at page reclaim Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 22/31] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Remove 1GB zero page when we are under memory pressure.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbdbc9ae06bf..41adc103ead1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -207,6 +207,32 @@ static struct shrinker huge_zero_page_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
+static unsigned long shrink_huge_pud_zero_page_count(struct shrinker *shrink,
+					struct shrink_control *sc)
+{
+	/* we can free zero page only if last reference remains */
+	return atomic_read(&huge_pud_zero_refcount) == 1 ? HPAGE_PUD_NR : 0;
+}
+
+static unsigned long shrink_huge_pud_zero_page_scan(struct shrinker *shrink,
+				       struct shrink_control *sc)
+{
+	if (atomic_cmpxchg(&huge_pud_zero_refcount, 1, 0) == 1) {
+		struct page *zero_page = xchg(&huge_pud_zero_page, NULL);
+		BUG_ON(zero_page == NULL);
+		__free_pages(zero_page, compound_order(zero_page));
+		return HPAGE_PUD_NR;
+	}
+
+	return 0;
+}
+
+static struct shrinker huge_pud_zero_page_shrinker = {
+	.count_objects = shrink_huge_pud_zero_page_count,
+	.scan_objects = shrink_huge_pud_zero_page_scan,
+	.seeks = DEFAULT_SEEKS,
+};
+
 #ifdef CONFIG_SYSFS
 static ssize_t enabled_show(struct kobject *kobj,
 			    struct kobj_attribute *attr, char *buf)
@@ -474,6 +500,9 @@ static int __init hugepage_init(void)
 	err = register_shrinker(&huge_zero_page_shrinker);
 	if (err)
 		goto err_hzp_shrinker;
+	err = register_shrinker(&huge_pud_zero_page_shrinker);
+	if (err)
+		goto err_hpzp_shrinker;
 	err = register_shrinker(&deferred_split_shrinker);
 	if (err)
 		goto err_split_shrinker;
@@ -496,6 +525,8 @@ static int __init hugepage_init(void)
 err_khugepaged:
 	unregister_shrinker(&deferred_split_shrinker);
 err_split_shrinker:
+	unregister_shrinker(&huge_pud_zero_page_shrinker);
+err_hpzp_shrinker:
 	unregister_shrinker(&huge_zero_page_shrinker);
 err_hzp_shrinker:
 	khugepaged_destroy();
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 22/31] mm: thp: 1GB THP follow_p*d_page() support.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (20 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 21/31] mm: thp: 1GB zero page shrinker Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 23/31] mm: support 1GB THP pagemap support Zi Yan
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Add follow_page support for 1GB THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 11 +++++++
 mm/gup.c                | 60 ++++++++++++++++++++++++++++++++-
 mm/huge_memory.c        | 73 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 142 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bd5cc5e65de8..b1acada9ce8c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -20,6 +20,10 @@ extern int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 extern void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 extern int do_huge_pud_anonymous_page(struct vm_fault *vmf);
 extern int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud);
+extern struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+					  unsigned long addr,
+					  pud_t *pud,
+					  unsigned int flags);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
@@ -32,6 +36,13 @@ extern int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 {
 	return VM_FAULT_FALLBACK;
 }
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+					  unsigned long addr,
+					  pud_t *pud,
+					  unsigned int flags)
+{
+	return NULL;
+}
 #endif
 
 extern vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
diff --git a/mm/gup.c b/mm/gup.c
index 05acd7e2eb22..0ad0509b03fc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -348,10 +348,68 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 		if (page)
 			return page;
 	}
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	if (likely(!pud_trans_huge(*pud))) {
+		if (unlikely(pud_bad(*pud)))
+			return no_page_table(vma, flags);
+		return follow_pmd_mask(vma, address, pud, flags, ctx);
+	}
+
+	ptl = pud_lock(mm, pud);
+
+	if (unlikely(!pud_trans_huge(*pud))) {
+		spin_unlock(ptl);
+		if (unlikely(pud_bad(*pud)))
+			return no_page_table(vma, flags);
+		return follow_pmd_mask(vma, address, pud, flags, ctx);
+	}
+
+	if (flags & FOLL_SPLIT) {
+		int ret;
+		pmd_t *pmd = NULL;
+
+		page = pud_page(*pud);
+		if (is_huge_zero_page(page)) {
+
+			spin_unlock(ptl);
+			ret = 0;
+			split_huge_pud(vma, pud, address);
+			pmd = pmd_offset(pud, address);
+			split_huge_pmd(vma, pmd, address);
+			if (pmd_trans_unstable(pmd))
+				ret = -EBUSY;
+		} else {
+			get_page(page);
+			spin_unlock(ptl);
+			lock_page(page);
+			ret = split_huge_pud_page(page);
+			if (!ret)
+				ret = split_huge_page(page);
+			else {
+				unlock_page(page);
+				put_page(page);
+				goto out;
+			}
+			unlock_page(page);
+			put_page(page);
+			if (pud_none(*pud))
+				return no_page_table(vma, flags);
+			pmd = pmd_offset(pud, address);
+		}
+out:
+		return ret ? ERR_PTR(ret) :
+			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+	}
+	page = follow_trans_huge_pud(vma, address, pud, flags);
+	spin_unlock(ptl);
+	ctx->page_mask = HPAGE_PUD_NR - 1;
+	return page;
+#else
 	if (unlikely(pud_bad(*pud)))
 		return no_page_table(vma, flags);
-
 	return follow_pmd_mask(vma, address, pud, flags, ctx);
+#endif
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41adc103ead1..191261771452 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1309,6 +1309,77 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	return page;
 }
 
+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pud(pud_t pud, unsigned int flags)
+{
+	return pud_write(pud) ||
+	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pud_dirty(pud));
+}
+
+struct page *follow_trans_huge_pud(struct vm_area_struct *vma,
+				   unsigned long addr,
+				   pud_t *pud,
+				   unsigned int flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page = NULL;
+
+	assert_spin_locked(pud_lockptr(mm, pud));
+
+	if (flags & FOLL_WRITE && !can_follow_write_pud(*pud, flags))
+		goto out;
+
+	/* Avoid dumping huge zero page */
+	if ((flags & FOLL_DUMP) && is_huge_zero_pud(*pud))
+		return ERR_PTR(-EFAULT);
+
+	/* Full NUMA hinting faults to serialise migration in fault paths */
+	/*&& pud_protnone(*pmd)*/
+	if ((flags & FOLL_NUMA))
+		goto out;
+
+	page = pud_page(*pud);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+	if (flags & FOLL_TOUCH)
+		touch_pud(vma, addr, pud, flags);
+	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
+		/*
+		 * We don't mlock() pte-mapped THPs. This way we can avoid
+		 * leaking mlocked pages into non-VM_LOCKED VMAs.
+		 *
+		 * For anon THP:
+		 *
+		 * We do the same thing as PMD-level THP.
+		 *
+		 * For file THP:
+		 *
+		 * No support yet.
+		 *
+		 */
+
+		if (PageAnon(page) && compound_mapcount(page) != 1)
+			goto skip_mlock;
+		if (PagePUDDoubleMap(page) || !page->mapping)
+			goto skip_mlock;
+		if (!trylock_page(page))
+			goto skip_mlock;
+		lru_add_drain();
+		if (page->mapping && !PagePUDDoubleMap(page))
+			mlock_vma_page(page);
+		unlock_page(page);
+	}
+skip_mlock:
+	page += (addr & ~HPAGE_PUD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
@@ -1991,7 +2062,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		goto out;
 
 	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page) && !PMDPageInPUD(page), page);
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 23/31] mm: support 1GB THP pagemap support.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (21 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 22/31] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 24/31] sysctl: add an option to only print the head page virtual address Zi Yan
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Print page flags properly.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/task_mmu.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f0ec9edab2f3..ccf8ce760283 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1373,6 +1373,45 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	return err;
 }
 
+static int pagemap_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
+			     struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	struct pagemapread *pm = walk->private;
+	int err = 0;
+	u64 flags = 0, frame = 0;
+	pud_t pud = *pudp;
+	struct page *page = NULL;
+
+	if (vma->vm_flags & VM_SOFTDIRTY)
+		flags |= PM_SOFT_DIRTY;
+
+	if (pud_present(pud)) {
+		page = pud_page(pud);
+
+		flags |= PM_PRESENT;
+		if (pud_soft_dirty(pud))
+			flags |= PM_SOFT_DIRTY;
+		if (pm->show_pfn)
+			frame = pud_pfn(pud) +
+				((addr & ~PMD_MASK) >> PAGE_SHIFT);
+	}
+
+	if (page && page_mapcount(page) == 1)
+		flags |= PM_MMAP_EXCLUSIVE;
+
+	for (; addr != end; addr += PAGE_SIZE) {
+		pagemap_entry_t pme = make_pme(frame, flags);
+
+		err = add_to_pagemap(addr, &pme, pm);
+		if (err)
+			break;
+		if (pm->show_pfn && (flags & PM_PRESENT))
+			frame++;
+	}
+	return err;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 /* This function walks within one hugetlb entry in the single call */
 static int pagemap_hugetlb_range(pte_t *ptep, unsigned long hmask,
@@ -1479,6 +1518,9 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 	if (!pm.buffer)
 		goto out_mm;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pagemap_walk.pud_entry = pagemap_pud_range;
+#endif
 	pagemap_walk.pmd_entry = pagemap_pmd_range;
 	pagemap_walk.pte_hole = pagemap_pte_hole;
 #ifdef CONFIG_HUGETLB_PAGE
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 24/31] sysctl: add an option to only print the head page virtual address.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (22 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 23/31] mm: support 1GB THP pagemap support Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 25/31] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It can help distinguish between PUD-mapped, PMD-mapped THPs, and
PTE-mapped THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 fs/proc/task_mmu.c |  7 +++++--
 kernel/sysctl.c    | 11 +++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ccf8ce760283..5106d5a07576 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -27,6 +27,9 @@
 
 #define SEQ_PUT_DEC(str, val) \
 		seq_put_decimal_ull_width(m, str, (val) << (PAGE_SHIFT-10), 8)
+
+int only_print_head_pfn;
+
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
 	unsigned long text, lib, swap, anon, file, shmem;
@@ -1308,7 +1311,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 				flags |= PM_SOFT_DIRTY;
 			if (pm->show_pfn)
 				frame = pmd_pfn(pmd) +
-					((addr & ~PMD_MASK) >> PAGE_SHIFT);
+					(only_print_head_pfn?0:((addr & ~PMD_MASK) >> PAGE_SHIFT));
 		}
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 		else if (is_swap_pmd(pmd)) {
@@ -1394,7 +1397,7 @@ static int pagemap_pud_range(pud_t *pudp, unsigned long addr, unsigned long end,
 			flags |= PM_SOFT_DIRTY;
 		if (pm->show_pfn)
 			frame = pud_pfn(pud) +
-				((addr & ~PMD_MASK) >> PAGE_SHIFT);
+				(only_print_head_pfn?0:((addr & ~PUD_MASK) >> PAGE_SHIFT));
 	}
 
 	if (page && page_mapcount(page) == 1)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6bf0be1af7e0..762535a2c7d1 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -122,6 +122,8 @@ extern int vma_no_repeat_defrag;
 extern int num_breakout_chunks;
 extern int defrag_size_threshold;
 
+extern int only_print_head_pfn;
+
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
 static int sixty = 60;
@@ -1750,6 +1752,15 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "only_print_head_pfn",
+		.data		= &only_print_head_pfn,
+		.maxlen		= sizeof(only_print_head_pfn),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 	{ }
 };
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 25/31] mm: thp: add a knob to enable/disable 1GB THPs.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (23 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 24/31] sysctl: add an option to only print the head page virtual address Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 26/31] mm: thp: promote PTE-mapped THP to PMD-mapped THP Zi Yan
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It does not affect existing 1GB THPs. It is similar to the knob for
2MB THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h | 14 ++++++++++++++
 mm/huge_memory.c        | 42 ++++++++++++++++++++++++++++++++++++++++-
 mm/memory.c             |  2 +-
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b1acada9ce8c..687c7d59df8b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -84,6 +84,8 @@ enum transparent_hugepage_flag {
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
+	TRANSPARENT_PUD_HUGEPAGE_FLAG,
+	TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG,
 };
 
 struct kobject;
@@ -146,6 +148,18 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 }
 
 bool transparent_hugepage_enabled(struct vm_area_struct *vma);
+static inline bool transparent_pud_hugepage_enabled(struct vm_area_struct *vma)
+{
+	if (transparent_hugepage_enabled(vma)) {
+		if (transparent_hugepage_flags & (1 << TRANSPARENT_PUD_HUGEPAGE_FLAG))
+			return true;
+		if (transparent_hugepage_flags &
+					(1 << TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
+			return !!(vma->vm_flags & VM_HUGEPAGE);
+	}
+
+	return false;
+}
 
 #define transparent_hugepage_use_zero_page()				\
 	(transparent_hugepage_flags &					\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 191261771452..fa3e12b17621 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -50,9 +50,11 @@
 unsigned long transparent_hugepage_flags __read_mostly =
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
 	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+	(1<<TRANSPARENT_PUD_HUGEPAGE_FLAG)|
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
 	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
+	(1<<TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG)|
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
@@ -276,6 +278,43 @@ static ssize_t enabled_store(struct kobject *kobj,
 static struct kobj_attribute enabled_attr =
 	__ATTR(enabled, 0644, enabled_show, enabled_store);
 
+static ssize_t enabled_1gb_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	if (test_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "[always] madvise never\n");
+	else if (test_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+
+static ssize_t enabled_1gb_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	ssize_t ret = count;
+
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_FLAG, &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_PUD_HUGEPAGE_REQ_MADV_FLAG, &transparent_hugepage_flags);
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+static struct kobj_attribute enabled_1gb_attr =
+	__ATTR(enabled_1gb, 0644, enabled_1gb_show, enabled_1gb_store);
+
 ssize_t single_hugepage_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag flag)
@@ -405,6 +444,7 @@ static struct kobj_attribute debug_cow_attr =
 
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
+	&enabled_1gb_attr.attr,
 	&defrag_attr.attr,
 	&use_zero_page_attr.attr,
 	&hpage_pmd_size_attr.attr,
@@ -1657,7 +1697,7 @@ int do_huge_pud_wp_page(struct vm_fault *vmf, pud_t orig_pud)
 	get_page(page);
 	spin_unlock(vmf->ptl);
 alloc:
-	if (transparent_hugepage_enabled(vma) &&
+	if (transparent_pud_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
 		huge_gfp = alloc_hugepage_direct_gfpmask(vma);
 		new_page = alloc_hugepage_vma(huge_gfp, vma, haddr, HPAGE_PUD_ORDER);
diff --git a/mm/memory.c b/mm/memory.c
index c875cc1a2600..5b8ad19cc439 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3859,7 +3859,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	vmf.pud = pud_alloc(mm, p4d, address);
 	if (!vmf.pud)
 		return VM_FAULT_OOM;
-	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
+	if (pud_none(*vmf.pud) && transparent_pud_hugepage_enabled(vma)) {
 		ret = create_huge_pud(&vmf);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 26/31] mm: thp: promote PTE-mapped THP to PMD-mapped THP.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (24 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 25/31] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 27/31] mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages Zi Yan
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

First promote 512 base pages to a PTE-mapped THP, then promote the
PTE-mapped THP to a PMD-mapped THP.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/khugepaged.h |   1 +
 mm/filemap.c               |   8 +
 mm/huge_memory.c           | 419 +++++++++++++++++++++++++++++++++++++
 mm/internal.h              |   6 +
 mm/khugepaged.c            |   2 +-
 5 files changed, 435 insertions(+), 1 deletion(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 082d1d2a5216..675c5ee99698 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -55,6 +55,7 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
 				return -ENOMEM;
 	return 0;
 }
+void release_pte_pages(pte_t *pte, pte_t *_pte);
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 9f5e323e883e..54babad945ad 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1236,6 +1236,14 @@ static inline bool clear_bit_unlock_is_negative_byte(long nr, volatile void *mem
 
 #endif
 
+void __unlock_page(struct page *page)
+{
+	BUILD_BUG_ON(PG_waiters != 7);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	if (clear_bit_unlock_is_negative_byte(PG_locked, &page->flags))
+		wake_up_page_bit(page, PG_locked);
+}
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa3e12b17621..f856f7e39095 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4284,3 +4284,422 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
+
+/* promote HPAGE_PMD_SIZE range into a PMD map.
+ * mmap_sem needs to be down_write.
+ */
+int promote_huge_pmd_address(struct vm_area_struct *vma, unsigned long haddr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	pte_t *pte, *_pte;
+	spinlock_t *pmd_ptl, *pte_ptl;
+	struct mmu_notifier_range range;
+	pgtable_t pgtable;
+	struct page *page, *head;
+	unsigned long address = haddr;
+	int ret = -EBUSY;
+
+	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
+
+	if (haddr < vma->vm_start || (haddr + HPAGE_PMD_SIZE) > vma->vm_end)
+		return -EINVAL;
+
+	pmd = mm_find_pmd(mm, haddr);
+	if (!pmd || pmd_trans_huge(*pmd))
+		goto out;
+
+	anon_vma_lock_write(vma->anon_vma);
+
+	pte = pte_offset_map(pmd, haddr);
+	pte_ptl = pte_lockptr(mm, pmd);
+
+	head = page = vm_normal_page(vma, haddr, *pte);
+	if (!page || !PageTransCompound(page))
+		goto out_unlock;
+	VM_BUG_ON(page != compound_head(page));
+	lock_page(head);
+
+	mmu_notifier_range_init(&range, mm, haddr, haddr + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+	/*
+	 * After this gup_fast can't run anymore. This also removes
+	 * any huge TLB entry from the CPU so we won't allow
+	 * huge and small TLB entries for the same virtual address
+	 * to avoid the risk of CPU bugs in that area.
+	 */
+
+	_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+	spin_unlock(pmd_ptl);
+	mmu_notifier_invalidate_range_end(&range);
+
+	/* remove ptes */
+	for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
+				_pte++, page++, address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+
+		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			pr_err("pte none or zero pfn during pmd promotion\n");
+			if (is_zero_pfn(pte_pfn(pteval))) {
+				/*
+				 * ptl mostly unnecessary.
+				 */
+				spin_lock(pte_ptl);
+				/*
+				 * paravirt calls inside pte_clear here are
+				 * superfluous.
+				 */
+				pte_clear(vma->vm_mm, address, _pte);
+				spin_unlock(pte_ptl);
+			}
+		} else {
+			/*
+			 * ptl mostly unnecessary, but preempt has to
+			 * be disabled to update the per-cpu stats
+			 * inside page_remove_rmap().
+			 */
+			spin_lock(pte_ptl);
+			/*
+			 * paravirt calls inside pte_clear here are
+			 * superfluous.
+			 */
+			pte_clear(vma->vm_mm, address, _pte);
+			atomic_dec(&page->_mapcount);
+			/*page_remove_rmap(page, false, 0);*/
+			if (atomic_read(&page->_mapcount) > -1) {
+				SetPageDoubleMap(head);
+				pr_info("page double mapped");
+			}
+			spin_unlock(pte_ptl);
+		}
+	}
+	page_ref_sub(head, HPAGE_PMD_NR - 1);
+
+	pte_unmap(pte);
+	pgtable = pmd_pgtable(_pmd);
+
+	_pmd = mk_huge_pmd(head, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(pmd_ptl);
+	VM_BUG_ON(!pmd_none(*pmd));
+	atomic_inc(compound_mapcount_ptr(head));
+	__inc_node_page_state(head, NR_ANON_THPS);
+	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	set_pmd_at(mm, haddr, pmd, _pmd);
+	update_mmu_cache_pmd(vma, haddr, pmd);
+	spin_unlock(pmd_ptl);
+	unlock_page(head);
+	ret = 0;
+
+out_unlock:
+	anon_vma_unlock_write(vma->anon_vma);
+out:
+	return ret;
+}
+
+/* Racy check whether the huge page can be split */
+static bool can_promote_huge_page(struct page *page)
+{
+	int extra_pins;
+
+	/* Additional pins from radix tree */
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? 1 : 0;
+	else
+		return false;
+	if (PageSwapCache(page))
+		return false;
+	if (PageWriteback(page))
+		return false;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
+/* write a __promote_huge_page_isolate(struct vm_area_struct *vma,
+ * unsigned long address, pte_t *pte) to isolate all subpages into a list,
+ * then call promote_list_to_huge_page() to promote in-place
+ */
+
+static int __promote_huge_page_isolate(struct vm_area_struct *vma,
+					unsigned long haddr, pte_t *pte,
+					struct page **head, struct list_head *subpage_list)
+{
+	struct page *page = NULL;
+	pte_t *_pte;
+	bool writable = false;
+	unsigned long address = haddr;
+
+	*head = NULL;
+	lru_add_drain();
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+
+		if (pte_none(pteval) || (pte_present(pteval) &&
+				is_zero_pfn(pte_pfn(pteval))))
+			goto out;
+		if (!pte_present(pteval))
+			goto out;
+		page = vm_normal_page(vma, address, pteval);
+		if (unlikely(!page))
+			goto out;
+
+		if (address == haddr) {
+			*head = page;
+			if (page_to_pfn(page) & ((1<<HPAGE_PMD_ORDER) - 1))
+				goto out;
+		}
+
+		if ((*head + (address - haddr)/PAGE_SIZE) != page)
+			goto out;
+
+		if (PageCompound(page))
+			goto out;
+
+		if (PageMlocked(page))
+			goto out;
+
+		VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+		/*
+		 * We can do it before isolate_lru_page because the
+		 * page can't be freed from under us. NOTE: PG_lock
+		 * is needed to serialize against split_huge_page
+		 * when invoked from the VM.
+		 */
+		if (!trylock_page(page))
+			goto out;
+
+		/*
+		 * cannot use mapcount: can't collapse if there's a gup pin.
+		 * The page must only be referenced by the scanned process
+		 * and page swap cache.
+		 */
+		if (page_count(page) != page_mapcount(page) + PageSwapCache(page)) {
+			unlock_page(page);
+			goto out;
+		}
+		if (pte_write(pteval)) {
+			writable = true;
+		} else {
+			if (PageSwapCache(page) &&
+			    !reuse_swap_page(page, NULL)) {
+				unlock_page(page);
+				goto out;
+			}
+			/*
+			 * Page is not in the swap cache. It can be collapsed
+			 * into a THP.
+			 */
+		}
+
+		/*
+		 * Isolate the page to avoid collapsing an hugepage
+		 * currently in use by the VM.
+		 */
+		if (isolate_lru_page(page)) {
+			unlock_page(page);
+			goto out;
+		}
+
+		inc_node_page_state(page,
+				NR_ISOLATED_ANON + page_is_file_cache(page));
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+	}
+	if (likely(writable)) {
+		int i;
+
+		for (i = 0; i < HPAGE_PMD_NR; i++) {
+			struct page *p = *head + i;
+
+			list_add_tail(&p->lru, subpage_list);
+			VM_BUG_ON_PAGE(!PageLocked(p), p);
+		}
+		return 1;
+	} else {
+		/*result = SCAN_PAGE_RO;*/
+	}
+
+out:
+	release_pte_pages(pte, _pte);
+	return 0;
+}
+
+/*
+ * This function promotes normal pages into a huge page. @list point to all
+ * subpages of huge page to promote, @head point to the head page.
+ *
+ * Only caller must hold pin on the pages on @list, otherwise promotion
+ * fails with -EBUSY. All subpages must be locked.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transferred to @page. *
+ *
+ * Returns 0 if the hugepage is promoted successfully.
+ * Returns -EBUSY if any subpage is pinned or if anon_vma disappeared from
+ * under us.
+ */
+int promote_list_to_huge_page(struct page *head, struct list_head *list)
+{
+	struct anon_vma *anon_vma = NULL;
+	int ret = 0;
+	DECLARE_BITMAP(subpage_bitmap, HPAGE_PMD_NR);
+	struct page *subpage;
+	int i;
+
+	/* no file-backed page support yet */
+	if (PageAnon(head)) {
+		/*
+		 * The caller does not necessarily hold an mmap_sem that would
+		 * prevent the anon_vma disappearing so we first we take a
+		 * reference to it and then lock the anon_vma for write. This
+		 * is similar to page_lock_anon_vma_read except the write lock
+		 * is taken to serialise against parallel split or collapse
+		 * operations.
+		 */
+		anon_vma = page_get_anon_vma(head);
+		if (!anon_vma) {
+			ret = -EBUSY;
+			goto out;
+		}
+		anon_vma_lock_write(anon_vma);
+	} else
+		return -EBUSY;
+
+	/* Racy check each subpage to see if any has extra pin */
+	list_for_each_entry(subpage, list, lru) {
+		if (can_promote_huge_page(subpage))
+			bitmap_set(subpage_bitmap, subpage - head, 1);
+	}
+	/* Proceed only if none of subpages has extra pin.  */
+	if (!bitmap_full(subpage_bitmap, HPAGE_PMD_NR)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	list_for_each_entry(subpage, list, lru) {
+		enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+			TTU_RMAP_LOCKED;
+		bool unmap_success;
+
+		if (PageAnon(subpage))
+			ttu_flags |= TTU_SPLIT_FREEZE;
+
+		unmap_success = try_to_unmap(subpage, ttu_flags);
+		VM_BUG_ON_PAGE(!unmap_success, subpage);
+	}
+
+	/* Take care of migration wait list:
+	 * make compound page first, since it is impossible to move waiting
+	 * process from subpage queues to the head page queue.
+	 */
+	set_compound_page_dtor(head, COMPOUND_PAGE_DTOR);
+	set_compound_order(head, HPAGE_PMD_ORDER);
+	__SetPageHead(head);
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *p = head + i;
+
+		p->index = 0;
+		p->mapping = TAIL_MAPPING;
+		p->mem_cgroup = NULL;
+		ClearPageActive(p);
+		/* move subpage refcount to head page */
+		page_ref_add(head, page_count(p) - 1);
+		set_page_count(p, 0);
+		set_compound_head(p, head);
+	}
+	atomic_set(compound_mapcount_ptr(head), -1);
+	prep_transhuge_page(head);
+
+	remap_page(head);
+
+	if (!mem_cgroup_disabled())
+		mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, HPAGE_PMD_NR);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *subpage = head + i;
+		__unlock_page(subpage);
+	}
+
+	INIT_LIST_HEAD(&head->lru);
+	unlock_page(head);
+	putback_lru_page(head);
+
+	mod_node_page_state(page_pgdat(head),
+			NR_ISOLATED_ANON + page_is_file_cache(head), -HPAGE_PMD_NR);
+out_unlock:
+	if (anon_vma) {
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
+out:
+	return ret;
+}
+
+static int promote_huge_page_isolate(struct vm_area_struct *vma,
+					unsigned long haddr,
+					struct page **head, struct list_head *subpage_list)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *pte_ptl;
+	int ret = -EBUSY;
+
+	pmd = mm_find_pmd(mm, haddr);
+	if (!pmd || pmd_trans_huge(*pmd))
+		goto out;
+
+	anon_vma_lock_write(vma->anon_vma);
+
+	pte = pte_offset_map(pmd, haddr);
+	pte_ptl = pte_lockptr(mm, pmd);
+
+	spin_lock(pte_ptl);
+	ret = __promote_huge_page_isolate(vma, haddr, pte, head, subpage_list);
+	spin_unlock(pte_ptl);
+
+	if (unlikely(!ret)) {
+		pte_unmap(pte);
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+	ret = 0;
+	/*
+	 * All pages are isolated and locked so anon_vma rmap
+	 * can't run anymore.
+	 */
+out_unlock:
+	anon_vma_unlock_write(vma->anon_vma);
+out:
+	return ret;
+}
+
+/* assume mmap_sem is down_write, wrapper for madvise */
+int promote_huge_page_address(struct vm_area_struct *vma, unsigned long haddr)
+{
+	LIST_HEAD(subpage_list);
+	struct page *head;
+
+	if (haddr & (HPAGE_PMD_SIZE - 1))
+		return -EINVAL;
+
+	if (haddr < vma->vm_start || (haddr + HPAGE_PMD_SIZE) > vma->vm_end)
+		return -EINVAL;
+
+	if (promote_huge_page_isolate(vma, haddr, &head, &subpage_list))
+		return -EBUSY;
+
+	return promote_list_to_huge_page(head, &subpage_list);
+}
diff --git a/mm/internal.h b/mm/internal.h
index 70a6ef603e5b..c5e5a0f1cc58 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -581,4 +581,10 @@ int expand_free_page(struct zone *zone, struct page *buddy_head,
 void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 							unsigned int alloc_flags);
 
+void __unlock_page(struct page *page);
+
+int promote_huge_pmd_address(struct vm_area_struct *vma, unsigned long haddr);
+
+int promote_huge_page_address(struct vm_area_struct *vma, unsigned long haddr);
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3acfddcba714..ff059353ebc3 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -508,7 +508,7 @@ static void release_pte_page(struct page *page)
 	putback_lru_page(page);
 }
 
-static void release_pte_pages(pte_t *pte, pte_t *_pte)
+void release_pte_pages(pte_t *pte, pte_t *_pte)
 {
 	while (--_pte >= pte) {
 		pte_t pteval = *_pte;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 27/31] mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (25 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 26/31] mm: thp: promote PTE-mapped THP to PMD-mapped THP Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 28/31] mm: vmstats: add page promotion stats Zi Yan
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

First promote 512 PMD-mapped THPs to a PMD-mapped PUD THP, then promote
a PMD-mapped PUD THP to a PUD-mapped PUD THP.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 arch/x86/include/asm/pgalloc.h |   2 +
 include/asm-generic/pgtable.h  |  10 +
 mm/huge_memory.c               | 497 ++++++++++++++++++++++++++++++++-
 mm/internal.h                  |   2 +
 mm/pgtable-generic.c           |  20 ++
 mm/rmap.c                      |  23 +-
 6 files changed, 540 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index ebcb022f6bb9..153a6749f92b 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -119,6 +119,8 @@ static inline void pud_populate_with_pgtable(struct mm_struct *mm, pud_t *pud,
 	set_pud(pud, __pud(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
 }
 
+#define pud_pgtable(pud) pud_page(pud)
+
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 1ae33b6590b8..9984c75d64ce 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -302,6 +302,8 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp);
+extern pud_t pudp_collapse_flush(struct vm_area_struct *vma,
+				 unsigned long address, pud_t *pudp);
 #else
 static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 					unsigned long address,
@@ -310,7 +312,15 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return *pmdp;
 }
+static inline pud_t pudp_collapse_flush(struct vm_area_struct *vma,
+					unsigned long address,
+					pud_t *pudp)
+{
+	BUILD_BUG();
+	return *pudp;
+}
 #define pmdp_collapse_flush pmdp_collapse_flush
+#define pudp_collapse_flush pudp_collapse_flush
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f856f7e39095..67fd1821f4dc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2958,7 +2958,7 @@ void split_huge_pud_address(struct vm_area_struct *vma, unsigned long address,
 	__split_huge_pud(vma, pud, address, freeze, page);
 }
 
-static void freeze_pud_page(struct page *page)
+static void unmap_pud_page(struct page *page)
 {
 	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
 		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PUD;
@@ -2973,7 +2973,7 @@ static void freeze_pud_page(struct page *page)
 	VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
-static void unfreeze_pud_page(struct page *page)
+static void remap_pud_page(struct page *page)
 {
 	int i;
 
@@ -3109,7 +3109,7 @@ static void __split_huge_pud_page(struct page *page, struct list_head *list,
 
 	spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
 
-	unfreeze_pud_page(head);
+	remap_pud_page(head);
 
 	for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
 		struct page *subpage = head + i;
@@ -3210,7 +3210,7 @@ int split_huge_pud_page_to_list(struct page *page, struct list_head *list)
 	}
 
 	/*
-	 * Racy check if we can split the page, before freeze_pud_page() will
+	 * Racy check if we can split the page, before unmap_pud_page() will
 	 * split PUDs
 	 */
 	if (!can_split_huge_pud_page(head, &extra_pins)) {
@@ -3219,7 +3219,7 @@ int split_huge_pud_page_to_list(struct page *page, struct list_head *list)
 	}
 
 	mlocked = PageMlocked(page);
-	freeze_pud_page(head);
+	unmap_pud_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
 	/* Make sure the page is not on per-CPU pagevec as it takes pin */
@@ -3285,7 +3285,7 @@ int split_huge_pud_page_to_list(struct page *page, struct list_head *list)
 			xa_unlock(&mapping->i_pages);
 		}
 		spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
-		unfreeze_pud_page(head);
+		remap_pud_page(head);
 		ret = -EBUSY;
 	}
 
@@ -4703,3 +4703,488 @@ int promote_huge_page_address(struct vm_area_struct *vma, unsigned long haddr)
 
 	return promote_list_to_huge_page(head, &subpage_list);
 }
+
+static pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud = NULL;
+	pud_t pude;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		goto out;
+
+	pud = pud_offset(p4d, address);
+
+	pude = *pud;
+	barrier();
+	if (!pud_present(pude) || pud_trans_huge(pude))
+		pud = NULL;
+out:
+	return pud;
+}
+
+/* promote HPAGE_PUD_SIZE range into a PUD map.
+ * mmap_sem needs to be down_write.
+ */
+int promote_huge_pud_address(struct vm_area_struct *vma, unsigned long haddr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t *pud, _pud;
+	pmd_t *pmd, *_pmd;
+	spinlock_t *pud_ptl, *pmd_ptl;
+	struct mmu_notifier_range range;
+	pgtable_t pgtable;
+	struct page *page, *head;
+	unsigned long address = haddr;
+	int ret = -EBUSY;
+
+	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
+
+	if (haddr < vma->vm_start || (haddr + HPAGE_PUD_SIZE) > vma->vm_end)
+		return -EINVAL;
+
+	pud = mm_find_pud(mm, haddr);
+	if (!pud)
+		goto out;
+
+	anon_vma_lock_write(vma->anon_vma);
+
+	pmd = pmd_offset(pud, haddr);
+	pmd_ptl = pmd_lockptr(mm, pmd);
+
+	head = page = vm_normal_page_pmd(vma, haddr, *pmd);
+	if (!page || !PageTransCompound(page) ||
+		compound_order(page) != HPAGE_PUD_ORDER)
+		goto out_unlock;
+	VM_BUG_ON(head != compound_head(page));
+	lock_page(head);
+
+	mmu_notifier_range_init(&range, mm, haddr, haddr + HPAGE_PUD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+	pud_ptl = pud_lock(mm, pud);
+	/*
+	 * After this gup_fast can't run anymore. This also removes
+	 * any huge TLB entry from the CPU so we won't allow
+	 * huge and small TLB entries for the same virtual address
+	 * to avoid the risk of CPU bugs in that area.
+	 */
+
+	_pud = pudp_collapse_flush(vma, haddr, pud);
+	spin_unlock(pud_ptl);
+	mmu_notifier_invalidate_range_end(&range);
+
+	/* remove ptes */
+	for (_pmd = pmd; _pmd < pmd + (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER));
+				_pmd++, page += HPAGE_PMD_NR, address += HPAGE_PMD_SIZE) {
+		pmd_t pmdval = *_pmd;
+
+		if (pmd_none(pmdval) || is_zero_pfn(pmd_pfn(pmdval))) {
+			if (is_zero_pfn(pmd_pfn(pmdval))) {
+				/*
+				 * ptl mostly unnecessary.
+				 */
+				spin_lock(pmd_ptl);
+				/*
+				 * paravirt calls inside pte_clear here are
+				 * superfluous.
+				 */
+				pmd_clear(_pmd);
+				spin_unlock(pmd_ptl);
+			}
+		} else {
+			/*
+			 * ptl mostly unnecessary, but preempt has to
+			 * be disabled to update the per-cpu stats
+			 * inside page_remove_rmap().
+			 */
+			spin_lock(pmd_ptl);
+			/*
+			 * paravirt calls inside pte_clear here are
+			 * superfluous.
+			 */
+			pmd_clear(_pmd);
+			atomic_dec(sub_compound_mapcount_ptr(page, 1));
+			__dec_node_page_state(page, NR_ANON_THPS);
+			spin_unlock(pmd_ptl);
+		}
+	}
+	page_ref_sub(head, (1<<(HPAGE_PUD_ORDER-HPAGE_PMD_ORDER)) - 1);
+
+	pgtable = pud_pgtable(_pud);
+
+	_pud = mk_huge_pud(head, vma->vm_page_prot);
+	_pud = maybe_pud_mkwrite(pud_mkdirty(_pud), vma);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(pud_ptl);
+	BUG_ON(!pud_none(*pud));
+	pgtable_trans_huge_pud_deposit(mm, pud, pgtable);
+	set_pud_at(mm, haddr, pud, _pud);
+	update_mmu_cache_pud(vma, haddr, pud);
+	__inc_node_page_state(head, NR_ANON_THPS_PUD);
+	atomic_inc(compound_mapcount_ptr(head));
+	spin_unlock(pud_ptl);
+	unlock_page(head);
+	ret = 0;
+
+out_unlock:
+	anon_vma_unlock_write(vma->anon_vma);
+out:
+	return ret;
+}
+
+/* Racy check whether the huge page can be split */
+static bool can_promote_huge_pud_page(struct page *page)
+{
+	int extra_pins;
+
+	/* Additional pins from radix tree */
+	if (PageAnon(page))
+		extra_pins = PageSwapCache(page) ? 1 : 0;
+	else
+		return false;
+	if (PageSwapCache(page))
+		return false;
+	if (PageWriteback(page))
+		return false;
+	return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
+
+static void release_pmd_page(struct page *page)
+{
+	mod_node_page_state(page_pgdat(page),
+		NR_ISOLATED_ANON + page_is_file_cache(page),
+		-hpage_nr_pages(page));
+	unlock_page(page);
+	putback_lru_page(page);
+}
+
+void release_pmd_pages(pmd_t *pmd, pmd_t *_pmd)
+{
+	while (--_pmd >= pmd) {
+		pmd_t pmdval = *_pmd;
+
+		if (!pmd_none(pmdval) && !is_zero_pfn(pmd_pfn(pmdval)))
+			release_pmd_page(pmd_page(pmdval));
+	}
+}
+
+/* write a __promote_huge_page_isolate(struct vm_area_struct *vma,
+ * unsigned long address, pte_t *pte) to isolate all subpages into a list,
+ * then call promote_list_to_huge_page() to promote in-place
+ */
+
+static int __promote_huge_pud_page_isolate(struct vm_area_struct *vma,
+					unsigned long haddr, pmd_t *pmd,
+					struct page **head, struct list_head *subpage_list)
+{
+	struct page *page = NULL;
+	pmd_t *_pmd;
+	bool writable = false;
+	unsigned long address = haddr;
+
+	*head = NULL;
+
+	lru_add_drain();
+	for (_pmd = pmd; _pmd < pmd+PTRS_PER_PMD;
+	     _pmd++, address += HPAGE_PMD_SIZE) {
+		pmd_t pmdval = *_pmd;
+
+		if (pmd_none(pmdval) || (pmd_trans_huge(pmdval) &&
+				is_zero_pfn(pmd_pfn(pmdval))))
+			goto out;
+		if (!pmd_present(pmdval))
+			goto out;
+		page = vm_normal_page_pmd(vma, address, pmdval);
+		if (unlikely(!page))
+			goto out;
+
+		if (address == haddr) {
+			*head = page;
+			if (page_to_pfn(page) & ((1<<HPAGE_PUD_ORDER) - 1))
+				goto out;
+		}
+
+		if ((*head + (address - haddr)/PAGE_SIZE) != page)
+			goto out;
+
+		if (!PageCompound(page) || compound_order(page) != HPAGE_PMD_ORDER)
+			goto out;
+
+		if (PageMlocked(page))
+			goto out;
+
+		VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+		/*
+		 * We can do it before isolate_lru_page because the
+		 * page can't be freed from under us. NOTE: PG_lock
+		 * is needed to serialize against split_huge_page
+		 * when invoked from the VM.
+		 */
+		if (!trylock_page(page))
+			goto out;
+
+		/*
+		 * cannot use mapcount: can't collapse if there's a gup pin.
+		 * The page must only be referenced by the scanned process
+		 * and page swap cache.
+		 */
+		if (page_count(page) != page_mapcount(page) + PageSwapCache(page)) {
+			unlock_page(page);
+			goto out;
+		}
+		if (pmd_write(pmdval)) {
+			writable = true;
+		} else {
+			if (PageSwapCache(page) &&
+			    !reuse_swap_page(page, NULL)) {
+				unlock_page(page);
+				goto out;
+			}
+			/*
+			 * Page is not in the swap cache. It can be collapsed
+			 * into a THP.
+			 */
+		}
+
+		/*
+		 * Isolate the page to avoid collapsing an hugepage
+		 * currently in use by the VM.
+		 */
+		if (isolate_lru_page(page)) {
+			unlock_page(page);
+			goto out;
+		}
+
+		mod_node_page_state(page_pgdat(page),
+				NR_ISOLATED_ANON + page_is_file_cache(page),
+				hpage_nr_pages(page));
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
+	}
+	if (likely(writable)) {
+		int i;
+
+		for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+			struct page *p = *head + i;
+
+			list_add_tail(&p->lru, subpage_list);
+			VM_BUG_ON_PAGE(!PageLocked(p), p);
+		}
+		return 1;
+	} else {
+		/*result = SCAN_PAGE_RO;*/
+	}
+
+out:
+	release_pmd_pages(pmd, _pmd);
+	return 0;
+}
+
+static int promote_huge_pud_page_isolate(struct vm_area_struct *vma,
+					unsigned long haddr,
+					struct page **head, struct list_head *subpage_list)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t *pud;
+	pmd_t *pmd;
+	spinlock_t *pmd_ptl;
+	int ret = -EBUSY;
+
+	pud = mm_find_pud(mm, haddr);
+	if (!pud)
+		goto out;
+
+	anon_vma_lock_write(vma->anon_vma);
+
+	pmd = pmd_offset(pud, haddr);
+	if (!pmd)
+		goto out_unlock;
+	pmd_ptl = pmd_lockptr(mm, pmd);
+
+	spin_lock(pmd_ptl);
+	ret = __promote_huge_pud_page_isolate(vma, haddr, pmd, head, subpage_list);
+	spin_unlock(pmd_ptl);
+
+	if (unlikely(!ret)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+	ret = 0;
+	/*
+	 * All pages are isolated and locked so anon_vma rmap
+	 * can't run anymore.
+	 */
+out_unlock:
+	anon_vma_unlock_write(vma->anon_vma);
+out:
+	return ret;
+}
+
+/*
+ * This function promotes normal pages into a huge page. @list point to all
+ * subpages of huge page to promote, @head point to the head page.
+ *
+ * Only caller must hold pin on the pages on @list, otherwise promotion
+ * fails with -EBUSY. All subpages must be locked.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transferred to @page. *
+ *
+ * Returns 0 if the hugepage is promoted successfully.
+ * Returns -EBUSY if any subpage is pinned or if anon_vma disappeared from
+ * under us.
+ */
+int promote_list_to_huge_pud_page(struct page *head, struct list_head *list)
+{
+	struct anon_vma *anon_vma = NULL;
+	int ret = 0;
+	DECLARE_BITMAP(subpage_bitmap, HPAGE_PMD_NR);
+	struct page *subpage;
+	int i;
+
+	/* no file-backed page support yet */
+	if (PageAnon(head)) {
+		/*
+		 * The caller does not necessarily hold an mmap_sem that would
+		 * prevent the anon_vma disappearing so we first we take a
+		 * reference to it and then lock the anon_vma for write. This
+		 * is similar to page_lock_anon_vma_read except the write lock
+		 * is taken to serialise against parallel split or collapse
+		 * operations.
+		 */
+		anon_vma = page_get_anon_vma(head);
+		if (!anon_vma) {
+			ret = -EBUSY;
+			goto out;
+		}
+		anon_vma_lock_write(anon_vma);
+	} else {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* Racy check each subpage to see if any has extra pin */
+	list_for_each_entry(subpage, list, lru) {
+		if (can_promote_huge_pud_page(subpage))
+			bitmap_set(subpage_bitmap, (subpage - head)/HPAGE_PMD_NR, 1);
+	}
+	/* Proceed only if none of subpages has extra pin.  */
+	if (!bitmap_full(subpage_bitmap, HPAGE_PMD_NR)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	list_for_each_entry(subpage, list, lru) {
+		enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
+			TTU_RMAP_LOCKED;
+		bool unmap_success;
+		struct pglist_data *pgdata = NULL;
+
+		if (PageAnon(subpage))
+			ttu_flags |= TTU_SPLIT_FREEZE;
+
+		unmap_success = try_to_unmap(subpage, ttu_flags);
+		VM_BUG_ON_PAGE(!unmap_success, subpage);
+
+		/* remove subpages from page_deferred_list */
+		pgdata = NODE_DATA(page_to_nid(subpage));
+		spin_lock(&pgdata->split_queue_lock);
+		if (!list_empty(page_deferred_list(subpage))) {
+			pgdata->split_queue_len--;
+			list_del_init(page_deferred_list(subpage));
+		}
+		spin_unlock(&pgdata->split_queue_lock);
+	}
+
+	/*first_compound_mapcount = compound_mapcount(head);*/
+	/* Take care of migration wait list:
+	 * make compound page first, since it is impossible to move waiting
+	 * process from subpage queues to the head page queue.
+	 */
+	set_compound_page_dtor(head, COMPOUND_PAGE_DTOR);
+	set_compound_order(head, HPAGE_PUD_ORDER);
+	__SetPageHead(head);
+	list_del(&head->lru);
+	for (i = 1; i < HPAGE_PUD_NR; i++) {
+		struct page *p = head + i;
+
+		if (i % HPAGE_PMD_NR == 0) {
+			list_del(&p->lru);
+			/* move subpage refcount to head page */
+			page_ref_add(head, page_count(p) - 1);
+		}
+		p->index = 0;
+		p->mapping = TAIL_MAPPING;
+		p->mem_cgroup = NULL;
+		ClearPageActive(p);
+		set_page_count(p, 0);
+		set_compound_head(p, head);
+	}
+	atomic_set(compound_mapcount_ptr(head), -1);
+	for (i = 0; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR)
+		atomic_set(sub_compound_mapcount_ptr(&head[i], 1), -1);
+	prep_transhuge_page(head);
+	/* Set first PMD-mapped page sub_compound_mapcount */
+
+	remap_pud_page(head);
+
+	for (i = HPAGE_PMD_NR; i < HPAGE_PUD_NR; i += HPAGE_PMD_NR) {
+		struct page *subpage = head + i;
+
+		__unlock_page(subpage);
+	}
+
+	INIT_LIST_HEAD(&head->lru);
+	unlock_page(head);
+	putback_lru_page(head);
+
+	mod_node_page_state(page_pgdat(head),
+			NR_ISOLATED_ANON + page_is_file_cache(head), -HPAGE_PUD_NR);
+out_unlock:
+	if (anon_vma) {
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
+out:
+	while (!list_empty(list)) {
+		struct page *p = list_first_entry(list, struct page, lru);
+		list_del(&p->lru);
+		unlock_page(p);
+		putback_lru_page(p);
+	}
+	return ret;
+}
+
+/* assume mmap_sem is down_write, wrapper for madvise */
+int promote_huge_pud_page_address(struct vm_area_struct *vma, unsigned long haddr)
+{
+	LIST_HEAD(subpage_list);
+	struct page *head;
+
+	if (haddr & (HPAGE_PUD_SIZE - 1))
+		return -EINVAL;
+	if (haddr < vma->vm_start || (haddr + HPAGE_PUD_SIZE) > vma->vm_end)
+		return -EINVAL;
+
+	if (promote_huge_pud_page_isolate(vma, haddr, &head, &subpage_list))
+		return -EBUSY;
+
+	return promote_list_to_huge_pud_page(head, &subpage_list);
+}
diff --git a/mm/internal.h b/mm/internal.h
index c5e5a0f1cc58..6d5ebcdcde4c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -584,7 +584,9 @@ void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
 void __unlock_page(struct page *page);
 
 int promote_huge_pmd_address(struct vm_area_struct *vma, unsigned long haddr);
+int promote_huge_pud_address(struct vm_area_struct *vma, unsigned long haddr);
 
 int promote_huge_page_address(struct vm_area_struct *vma, unsigned long haddr);
+int promote_huge_pud_page_address(struct vm_area_struct *vma, unsigned long haddr);
 
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 95af1d67f209..99c4fb526c04 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -266,4 +266,24 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 	return pmd;
 }
 #endif
+
+#ifndef pudp_collapse_flush
+pud_t pudp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
+			  pud_t *pudp)
+{
+	/*
+	 * pud and hugepage pte format are same. So we could
+	 * use the same function.
+	 */
+	pud_t pud;
+
+	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
+	VM_BUG_ON(pud_trans_huge(*pudp));
+	pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
+
+	/* collapse entails shooting down ptes not pmd */
+	flush_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return pud;
+}
+#endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/rmap.c b/mm/rmap.c
index 39f446a6775d..49ccbf0cfe4d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1112,12 +1112,13 @@ void do_page_add_anon_rmap(struct page *page,
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
+	struct page *head = compound_head(page);
 
 	if (compound) {
 		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		if (compound_order(page) == HPAGE_PUD_ORDER) {
+		VM_BUG_ON_PAGE(!PMDPageInPUD(page) && !PageTransHuge(page), page);
+		if (compound_order(head) == HPAGE_PUD_ORDER) {
 			if (order == HPAGE_PUD_ORDER) {
 				mapcount = compound_mapcount_ptr(page);
 			} else if (order == HPAGE_PMD_ORDER) {
@@ -1125,7 +1126,7 @@ void do_page_add_anon_rmap(struct page *page,
 				mapcount = sub_compound_mapcount_ptr(page, 1);
 			} else
 				VM_BUG_ON(1);
-		} else if (compound_order(page) == HPAGE_PMD_ORDER) {
+		} else if (compound_order(head) == HPAGE_PMD_ORDER) {
 			mapcount = compound_mapcount_ptr(page);
 		} else
 			VM_BUG_ON(1);
@@ -1135,7 +1136,8 @@ void do_page_add_anon_rmap(struct page *page,
 	}
 
 	if (first) {
-		int nr = compound ? hpage_nr_pages(page) : 1;
+		/*int nr = compound ? hpage_nr_pages(page) : 1;*/
+		int nr = 1<<order;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
@@ -1429,6 +1431,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	int order = 0;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1505,12 +1508,16 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 		/* Unexpected PMD-mapped THP? */
 
-		if (pvmw.pte)
+		if (pvmw.pte) {
 			subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
-		else if (!pvmw.pte && pvmw.pmd)
+			order = 0;
+		} else if (!pvmw.pte && pvmw.pmd) {
 			subpage = page - page_to_pfn(page) + pmd_pfn(*pvmw.pmd);
-		else if (!pvmw.pte && !pvmw.pmd && pvmw.pud)
+			order = HPAGE_PMD_ORDER;
+		} else if (!pvmw.pte && !pvmw.pmd && pvmw.pud) {
 			subpage = page - page_to_pfn(page) + pud_pfn(*pvmw.pud);
+			order = HPAGE_PUD_ORDER;
+		}
 		VM_BUG_ON(!subpage);
 		address = pvmw.address;
 
@@ -1794,7 +1801,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/vm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, PageHuge(page), 0);
+		page_remove_rmap(subpage, PageHuge(page) || order >= HPAGE_PMD_ORDER, order);
 		put_page(page);
 	}
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 28/31] mm: vmstats: add page promotion stats.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (26 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 27/31] mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 29/31] mm: madvise: add madvise options to split PMD and PUD THPs Zi Yan
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Count all four types of page promotion.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/vm_event_item.h | 4 ++++
 mm/huge_memory.c              | 8 ++++++++
 mm/vmstat.c                   | 4 ++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index df619262b1b4..f352e5cbfc9c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -81,6 +81,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SPLIT_PAGE_FAILED,
 		THP_DEFERRED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
+		THP_PROMOTE_PMD,
+		THP_PROMOTE_PAGE,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 		THP_FAULT_ALLOC_PUD,
 		THP_FAULT_FALLBACK_PUD,
@@ -89,6 +91,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SPLIT_PUD_PAGE_FAILED,
 		THP_ZERO_PUD_PAGE_ALLOC,
 		THP_ZERO_PUD_PAGE_ALLOC_FAILED,
+		THP_PROMOTE_PUD,
+		THP_PROMOTE_PUD_PAGE,
 #endif
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 67fd1821f4dc..911463c98bcc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4403,6 +4403,8 @@ int promote_huge_pmd_address(struct vm_area_struct *vma, unsigned long haddr)
 out_unlock:
 	anon_vma_unlock_write(vma->anon_vma);
 out:
+	if (!ret)
+		count_vm_event(THP_PROMOTE_PMD);
 	return ret;
 }
 
@@ -4644,6 +4646,8 @@ int promote_list_to_huge_page(struct page *head, struct list_head *list)
 		put_anon_vma(anon_vma);
 	}
 out:
+	if (!ret)
+		count_vm_event(THP_PROMOTE_PAGE);
 	return ret;
 }
 
@@ -4842,6 +4846,8 @@ int promote_huge_pud_address(struct vm_area_struct *vma, unsigned long haddr)
 out_unlock:
 	anon_vma_unlock_write(vma->anon_vma);
 out:
+	if (!ret)
+		count_vm_event(THP_PROMOTE_PUD);
 	return ret;
 }
 
@@ -5169,6 +5175,8 @@ int promote_list_to_huge_pud_page(struct page *head, struct list_head *list)
 		unlock_page(p);
 		putback_lru_page(p);
 	}
+	if (!ret)
+		count_vm_event(THP_PROMOTE_PUD_PAGE);
 	return ret;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1d185cf748a6..7dd1ff5805ef 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1259,6 +1259,8 @@ const char * const vmstat_text[] = {
 	"thp_split_page_failed",
 	"thp_deferred_split_page",
 	"thp_split_pmd",
+	"thp_promote_pmd",
+	"thp_promote_page",
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 	"thp_fault_alloc_pud",
 	"thp_fault_fallback_pud",
@@ -1267,6 +1269,8 @@ const char * const vmstat_text[] = {
 	"thp_split_pud_page_failed",
 	"thp_zero_pud_page_alloc",
 	"thp_zero_pud_page_alloc_failed",
+	"thp_promote_pud",
+	"thp_promote_pud_page",
 #endif
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 29/31] mm: madvise: add madvise options to split PMD and PUD THPs.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (27 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 28/31] mm: vmstats: add page promotion stats Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 30/31] mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support Zi Yan
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/uapi/asm-generic/mman-common.h |  12 +++
 mm/madvise.c                           | 106 +++++++++++++++++++++++++
 2 files changed, 118 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index d1ec94a1970d..33db8b6a2ce0 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -69,6 +69,18 @@
 #define MADV_MEMDEFRAG	20		/* Worth backing with hugepages */
 #define MADV_NOMEMDEFRAG	21		/* Not worth backing with hugepages */
 
+#define MADV_SPLITHUGEPAGE	24		/* Split huge page in range once */
+#define MADV_PROMOTEHUGEPAGE	25		/* Promote range into huge page */
+
+#define MADV_SPLITHUGEMAP	26		/* Split huge page table entry in range once */
+#define MADV_PROMOTEHUGEMAP	27		/* Promote range into huge page table entry */
+
+#define MADV_SPLITHUGEPUDPAGE	28		/* Split huge page in range once */
+#define MADV_PROMOTEHUGEPUDPAGE	29		/* Promote range into huge page */
+
+#define MADV_SPLITHUGEPUDMAP	30		/* Split huge page table entry in range once */
+#define MADV_PROMOTEHUGEPUDMAP	31		/* Promote range into huge page table entry */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 9cef96d633e8..be3818c06e17 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -624,6 +624,95 @@ static long madvise_memdefrag(struct vm_area_struct *vma,
 	*prev = vma;
 	return memdefrag_madvise(vma, &vma->vm_flags, behavior);
 }
+
+static long madvise_split_promote_hugepage(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end, int behavior)
+{
+	struct page *page;
+	unsigned long addr = start, haddr;
+	int ret = 0;
+	*prev = vma;
+
+	while (addr < end && !ret) {
+		switch (behavior) {
+		case MADV_SPLITHUGEMAP:
+			split_huge_pmd_address(vma, addr, false, NULL);
+			addr += HPAGE_PMD_SIZE;
+			break;
+		case MADV_SPLITHUGEPUDMAP:
+			split_huge_pud_address(vma, addr, false, NULL);
+			addr += HPAGE_PUD_SIZE;
+			break;
+		case MADV_SPLITHUGEPAGE:
+			page = follow_page(vma, addr, FOLL_GET);
+			if (page) {
+				lock_page(page);
+				if (split_huge_page(page)) {
+					pr_debug("%s: fail to split page\n", __func__);
+					ret = -EBUSY;
+				}
+				unlock_page(page);
+				put_page(page);
+			} else
+				ret = -ENODEV;
+			addr += HPAGE_PMD_SIZE;
+			break;
+		case MADV_SPLITHUGEPUDPAGE:
+			page = follow_page(vma, addr, FOLL_GET);
+			if (page) {
+				lock_page(page);
+				if (split_huge_pud_page(page)) {
+					pr_debug("%s: fail to split pud page\n", __func__);
+					ret = -EBUSY;
+				}
+				unlock_page(page);
+				put_page(page);
+			} else
+				ret = -ENODEV;
+			addr += HPAGE_PUD_SIZE;
+			break;
+		case MADV_PROMOTEHUGEMAP:
+			haddr = addr & HPAGE_PMD_MASK;
+			if (haddr >= start && (haddr + HPAGE_PMD_SIZE) <= end)
+				promote_huge_pmd_address(vma, haddr);
+			else
+				ret = -ENODEV;
+			addr += HPAGE_PMD_SIZE;
+			break;
+		case MADV_PROMOTEHUGEPUDMAP:
+			haddr = addr & HPAGE_PUD_MASK;
+			if (haddr >= start && (haddr + HPAGE_PUD_SIZE) <= end)
+				promote_huge_pud_address(vma, haddr);
+			else
+				ret = -ENODEV;
+			addr += HPAGE_PUD_SIZE;
+			break;
+		case MADV_PROMOTEHUGEPAGE:
+			haddr = addr & HPAGE_PMD_MASK;
+			if (haddr >= start && (haddr + HPAGE_PMD_SIZE) <= end)
+				promote_huge_page_address(vma, haddr);
+			else
+				ret = -ENODEV;
+			addr += HPAGE_PMD_SIZE;
+			break;
+		case MADV_PROMOTEHUGEPUDPAGE:
+			haddr = addr & HPAGE_PUD_MASK;
+			if (haddr >= start && (haddr + HPAGE_PUD_SIZE) <= end)
+				promote_huge_pud_page_address(vma, haddr);
+			else
+				ret = -ENODEV;
+			addr += HPAGE_PUD_SIZE;
+			break;
+		default:
+			ret = -EINVAL;
+			break;
+		}
+	}
+
+	return ret;
+}
+
 #ifdef CONFIG_MEMORY_FAILURE
 /*
  * Error injection support for memory error handling.
@@ -708,6 +797,15 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_MEMDEFRAG:
 	case MADV_NOMEMDEFRAG:
 		return madvise_memdefrag(vma, prev, start, end, behavior);
+	case MADV_SPLITHUGEPAGE:
+	case MADV_PROMOTEHUGEPAGE:
+	case MADV_SPLITHUGEMAP:
+	case MADV_PROMOTEHUGEMAP:
+	case MADV_SPLITHUGEPUDPAGE:
+	case MADV_PROMOTEHUGEPUDPAGE:
+	case MADV_SPLITHUGEPUDMAP:
+	case MADV_PROMOTEHUGEPUDMAP:
+		return madvise_split_promote_hugepage(vma, prev, start, end, behavior);
 	default:
 		return madvise_behavior(vma, prev, start, end, behavior);
 	}
@@ -744,6 +842,14 @@ madvise_behavior_valid(int behavior)
 #endif
 	case MADV_MEMDEFRAG:
 	case MADV_NOMEMDEFRAG:
+	case MADV_SPLITHUGEPAGE:
+	case MADV_PROMOTEHUGEPAGE:
+	case MADV_SPLITHUGEMAP:
+	case MADV_PROMOTEHUGEMAP:
+	case MADV_SPLITHUGEPUDPAGE:
+	case MADV_PROMOTEHUGEPUDPAGE:
+	case MADV_SPLITHUGEPUDMAP:
+	case MADV_PROMOTEHUGEPUDMAP:
 		return true;
 
 	default:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 30/31] mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (28 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 29/31] mm: madvise: add madvise options to split PMD and PUD THPs Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 31/31] sysctl: toggle to promote PUD-mapped 1GB THP or not Zi Yan
  2019-02-20  1:42 ` [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Mike Kravetz
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

PMD THPs will get PMD page table entry promotion as well.
PUD THPs only gets PUD page table entry promotion when the toggle is
on, which is off by default. Since 1GB THP performs not so good due to
shortage of 1GB TLB entries.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/mem_defrag.c | 79 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 73 insertions(+), 6 deletions(-)

diff --git a/mm/mem_defrag.c b/mm/mem_defrag.c
index 4d458b125c95..d7a579924d12 100644
--- a/mm/mem_defrag.c
+++ b/mm/mem_defrag.c
@@ -56,6 +56,7 @@ struct defrag_result_stats {
 	unsigned long dst_non_lru_failed;
 	unsigned long dst_non_moveable_failed;
 	unsigned long not_defrag_vpn;
+	unsigned int aligned_max_order;
 };
 
 enum {
@@ -689,6 +690,10 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		page_size = get_contig_page_size(scan_page);
 
+		if (compound_order(compound_head(scan_page)) == HPAGE_PUD_ORDER) {
+			defrag_stats->aligned_max_order = HPAGE_PUD_ORDER;
+			goto quit_defrag;
+		}
 		/* PTE-mapped THP not allowed  */
 		if ((scan_page == compound_head(scan_page)) &&
 			PageTransHuge(scan_page) && !PageHuge(scan_page))
@@ -714,6 +719,8 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* already in the contiguous pos  */
 		if (page_dist == (long long)(scan_page - anchor_page)) {
 			defrag_stats->aligned += (page_size/PAGE_SIZE);
+			defrag_stats->aligned_max_order = max(defrag_stats->aligned_max_order,
+				compound_order(scan_page));
 			continue;
 		} else { /* migrate pages according to the anchor pages in the vma.  */
 			struct page *dest_page = anchor_page + page_dist;
@@ -901,6 +908,10 @@ int defrag_address_range(struct mm_struct *mm, struct vm_area_struct *vma,
 			} else { /* exchange  */
 				int err = -EBUSY;
 
+				if (compound_order(compound_head(dest_page)) == HPAGE_PUD_ORDER) {
+					defrag_stats->aligned_max_order = HPAGE_PUD_ORDER;
+					goto quit_defrag;
+				}
 				/* PTE-mapped THP not allowed  */
 				if ((dest_page == compound_head(dest_page)) &&
 					PageTransHuge(dest_page) && !PageHuge(dest_page))
@@ -1486,10 +1497,13 @@ static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
 				up_read(&vma->vm_mm->mmap_sem);
 			} else if (sc->action == MEM_DEFRAG_DO_DEFRAG) {
 				/* go to nearest 1GB aligned address  */
+				unsigned long defrag_begin = *scan_address;
 				unsigned long defrag_end = min_t(unsigned long,
 							(*scan_address + HPAGE_PUD_SIZE) & HPAGE_PUD_MASK,
 							vend);
 				int defrag_result;
+				int nr_fails_in_1gb_range = 0;
+				int skip_promotion = 0;
 
 				anchor_node = get_anchor_page_node_from_vma(vma, *scan_address);
 
@@ -1583,14 +1597,47 @@ static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
 					 * skip the page which cannot be defragged and restart
 					 * from the next page
 					 */
-					if (defrag_stats.not_defrag_vpn &&
-						defrag_stats.not_defrag_vpn < defrag_sub_chunk_end) {
+					if (defrag_stats.not_defrag_vpn) {
 						VM_BUG_ON(defrag_sub_chunk_end != defrag_end &&
 							defrag_stats.not_defrag_vpn > defrag_sub_chunk_end);
-
-						*scan_address = defrag_stats.not_defrag_vpn;
-						defrag_stats.not_defrag_vpn = 0;
-						goto continue_defrag;
+						find_anchor_pages_in_vma(mm, vma, defrag_stats.not_defrag_vpn);
+
+						nr_fails_in_1gb_range += 1;
+						if (defrag_stats.not_defrag_vpn < defrag_sub_chunk_end) {
+							/* reset and continue  */
+							*scan_address = defrag_stats.not_defrag_vpn;
+							defrag_stats.not_defrag_vpn = 0;
+							goto continue_defrag;
+						}
+					} else {
+						/* defrag works for the whole chunk,
+						 * promote to THP in place
+						 */
+						if (!defrag_result &&
+							/* skip existing THPs */
+							defrag_stats.aligned_max_order < HPAGE_PMD_ORDER &&
+							!(*scan_address & (HPAGE_PMD_SIZE-1)) &&
+							!(defrag_sub_chunk_end & (HPAGE_PMD_SIZE-1))) {
+							int ret = 0;
+							/* find a range to promote pmd */
+							down_write(&mm->mmap_sem);
+							ret = promote_huge_page_address(vma, *scan_address);
+							if (!ret) {
+								/*
+								 * promote to 2MB THP successful, but it is
+								 * still PTE pointed
+								 */
+								/* promote PTE-mapped THP to PMD-mapped */
+								promote_huge_pmd_address(vma, *scan_address);
+							}
+							up_write(&mm->mmap_sem);
+						}
+						/* skip PUD pages */
+						if (defrag_stats.aligned_max_order == HPAGE_PUD_ORDER) {
+							*scan_address = defrag_end;
+							skip_promotion = 1;
+							continue;
+						}
 					}
 
 					/* Done with current 2MB chunk */
@@ -1606,6 +1653,26 @@ static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
 					}
 				}
 
+				/* defrag works for the whole chunk, promote to PUD THP in place */
+				if (!nr_fails_in_1gb_range &&
+					!skip_promotion && /* avoid existing THP */
+					!(defrag_begin & (HPAGE_PUD_SIZE-1)) &&
+					!(defrag_end & (HPAGE_PUD_SIZE-1))) {
+					int ret = 0;
+					/* find a range to promote pud */
+					down_write(&mm->mmap_sem);
+					ret = promote_huge_pud_page_address(vma, defrag_begin);
+					if (!ret) {
+						/*
+						 * promote to 1GB THP successful, but it is
+						 * still PMD pointed
+						 */
+						/* promote PMD-mapped THP to PUD-mapped */
+						if (mem_defrag_promote_1gb_thp)
+							promote_huge_pud_address(vma, defrag_begin);
+					}
+					up_write(&mm->mmap_sem);
+				}
 			}
 		}
 done_one_vma:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 31/31] sysctl: toggle to promote PUD-mapped 1GB THP or not.
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (29 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 30/31] mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support Zi Yan
@ 2019-02-15 22:08 ` Zi Yan
  2019-02-20  1:42 ` [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Mike Kravetz
  31 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Only promotion PMD THP by default.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 kernel/sysctl.c | 11 +++++++++++
 mm/mem_defrag.c | 17 +++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 762535a2c7d1..20263d2c39b9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -121,6 +121,7 @@ extern int vma_scan_threshold_type;
 extern int vma_no_repeat_defrag;
 extern int num_breakout_chunks;
 extern int defrag_size_threshold;
+extern int mem_defrag_promote_thp;
 
 extern int only_print_head_pfn;
 
@@ -135,6 +136,7 @@ static int zero;
 static int __maybe_unused one = 1;
 static int __maybe_unused two = 2;
 static int __maybe_unused four = 4;
+static int __maybe_unused fifteen = 15;
 static unsigned long one_ul = 1;
 static int one_hundred = 100;
 static int one_thousand = 1000;
@@ -1761,6 +1763,15 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "mem_defrag_promote_thp",
+		.data		= &mem_defrag_promote_thp,
+		.maxlen		= sizeof(mem_defrag_promote_thp),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &fifteen,
+	},
 	{ }
 };
 
diff --git a/mm/mem_defrag.c b/mm/mem_defrag.c
index d7a579924d12..7cfa99351925 100644
--- a/mm/mem_defrag.c
+++ b/mm/mem_defrag.c
@@ -64,12 +64,18 @@ enum {
 	VMA_THRESHOLD_TYPE_SIZE,
 };
 
+#define PROMOTE_PMD_MAP  (0x8)
+#define PROMOTE_PMD_PAGE (0x4)
+#define PROMOTE_PUD_MAP  (0x2)
+#define PROMOTE_PUD_PAGE (0x1)
+
 int num_breakout_chunks;
 int vma_scan_percentile = 100;
 int vma_scan_threshold_type = VMA_THRESHOLD_TYPE_TIME;
 int vma_no_repeat_defrag;
 int kmem_defragd_always;
 int defrag_size_threshold = 5;
+int mem_defrag_promote_thp = (PROMOTE_PMD_MAP|PROMOTE_PMD_PAGE);
 static DEFINE_SPINLOCK(kmem_defragd_mm_lock);
 
 #define MM_SLOTS_HASH_BITS 10
@@ -1613,7 +1619,8 @@ static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
 						/* defrag works for the whole chunk,
 						 * promote to THP in place
 						 */
-						if (!defrag_result &&
+						if ((mem_defrag_promote_thp & PROMOTE_PMD_PAGE) &&
+							!defrag_result &&
 							/* skip existing THPs */
 							defrag_stats.aligned_max_order < HPAGE_PMD_ORDER &&
 							!(*scan_address & (HPAGE_PMD_SIZE-1)) &&
@@ -1628,7 +1635,8 @@ static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
 								 * still PTE pointed
 								 */
 								/* promote PTE-mapped THP to PMD-mapped */
-								promote_huge_pmd_address(vma, *scan_address);
+								if (mem_defrag_promote_thp & PROMOTE_PMD_MAP)
+									promote_huge_pmd_address(vma, *scan_address);
 							}
 							up_write(&mm->mmap_sem);
 						}
@@ -1654,7 +1662,8 @@ static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
 				}
 
 				/* defrag works for the whole chunk, promote to PUD THP in place */
-				if (!nr_fails_in_1gb_range &&
+				if ((mem_defrag_promote_thp & PROMOTE_PUD_PAGE) &&
+					!nr_fails_in_1gb_range &&
 					!skip_promotion && /* avoid existing THP */
 					!(defrag_begin & (HPAGE_PUD_SIZE-1)) &&
 					!(defrag_end & (HPAGE_PUD_SIZE-1))) {
@@ -1668,7 +1677,7 @@ static int kmem_defragd_scan_mm(struct defrag_scan_control *sc)
 						 * still PMD pointed
 						 */
 						/* promote PMD-mapped THP to PUD-mapped */
-						if (mem_defrag_promote_1gb_thp)
+						if (mem_defrag_promote_thp & PROMOTE_PUD_MAP)
 							promote_huge_pud_address(vma, defrag_begin);
 					}
 					up_write(&mm->mmap_sem);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
@ 2019-02-17 11:29   ` Matthew Wilcox
  2019-02-18 17:31     ` Zi Yan
  2019-02-21 21:10   ` Jerome Glisse
  1 sibling, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2019-02-17 11:29 UTC (permalink / raw)
  To: ziy
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
> +struct page_flags {
> +	unsigned int page_error :1;
> +	unsigned int page_referenced:1;
> +	unsigned int page_uptodate:1;
> +	unsigned int page_active:1;
> +	unsigned int page_unevictable:1;
> +	unsigned int page_checked:1;
> +	unsigned int page_mappedtodisk:1;
> +	unsigned int page_dirty:1;
> +	unsigned int page_is_young:1;
> +	unsigned int page_is_idle:1;
> +	unsigned int page_swapcache:1;
> +	unsigned int page_writeback:1;
> +	unsigned int page_private:1;
> +	unsigned int __pad:3;
> +};

I'm not sure how to feel about this.  It's a bit fragile versus somebody adding
new page flags.  I don't know whether it's needed or whether you can just
copy page->flags directly because you're holding PageLock.

> +static void exchange_page(char *to, char *from)
> +{
> +	u64 tmp;
> +	int i;
> +
> +	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
> +		tmp = *((u64 *)(from + i));
> +		*((u64 *)(from + i)) = *((u64 *)(to + i));
> +		*((u64 *)(to + i)) = tmp;
> +	}
> +}

I have a suspicion you'd be better off allocating a temporary page and
using copy_page().  Some architectures have put a lot of effort into
making copy_page() run faster.

> +		xa_lock_irq(&to_mapping->i_pages);
> +
> +		to_pslot = radix_tree_lookup_slot(&to_mapping->i_pages,
> +			page_index(to_page));

This needs to be converted to the XArray.  radix_tree_lookup_slot() is
going away soon.  You probably need:

	XA_STATE(to_xas, &to_mapping->i_pages, page_index(to_page));

This is a lot of code and I'm still trying to get my head aroud it all.
Thanks for putting in this work; it's good to see this approach being
explored.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-17 11:29   ` Matthew Wilcox
@ 2019-02-18 17:31     ` Zi Yan
  2019-02-18 17:42       ` Vlastimil Babka
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2019-02-18 17:31 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On 17 Feb 2019, at 3:29, Matthew Wilcox wrote:

> On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
>> +struct page_flags {
>> +	unsigned int page_error :1;
>> +	unsigned int page_referenced:1;
>> +	unsigned int page_uptodate:1;
>> +	unsigned int page_active:1;
>> +	unsigned int page_unevictable:1;
>> +	unsigned int page_checked:1;
>> +	unsigned int page_mappedtodisk:1;
>> +	unsigned int page_dirty:1;
>> +	unsigned int page_is_young:1;
>> +	unsigned int page_is_idle:1;
>> +	unsigned int page_swapcache:1;
>> +	unsigned int page_writeback:1;
>> +	unsigned int page_private:1;
>> +	unsigned int __pad:3;
>> +};
>
> I'm not sure how to feel about this.  It's a bit fragile versus 
> somebody adding
> new page flags.  I don't know whether it's needed or whether you can 
> just
> copy page->flags directly because you're holding PageLock.

I agree with you that current way of copying page flags individually 
could miss
new page flags. I will try to come up with something better. Copying 
page->flags as a whole
might not simply work, since the upper part of page->flags has the page 
node information,
which should not be changed. I think I need to add a helper function to 
just copy/exchange
all page flags, like calling migrate_page_stats() twice.

>> +static void exchange_page(char *to, char *from)
>> +{
>> +	u64 tmp;
>> +	int i;
>> +
>> +	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
>> +		tmp = *((u64 *)(from + i));
>> +		*((u64 *)(from + i)) = *((u64 *)(to + i));
>> +		*((u64 *)(to + i)) = tmp;
>> +	}
>> +}
>
> I have a suspicion you'd be better off allocating a temporary page and
> using copy_page().  Some architectures have put a lot of effort into
> making copy_page() run faster.

When I am doing exchange_pages() between two NUMA nodes on a x86_64 
machine,
I actually can saturate the QPI bandwidth with this operation. I think 
cache
prefetching was doing its job.

The purpose of proposing exchange_pages() is to avoid allocating any new 
page,
so that we would not trigger any potential page reclaim or memory 
compaction.
Allocating a temporary page defeats the purpose.

>
>> +		xa_lock_irq(&to_mapping->i_pages);
>> +
>> +		to_pslot = radix_tree_lookup_slot(&to_mapping->i_pages,
>> +			page_index(to_page));
>
> This needs to be converted to the XArray.  radix_tree_lookup_slot() is
> going away soon.  You probably need:
>
> 	XA_STATE(to_xas, &to_mapping->i_pages, page_index(to_page));

Thank you for pointing this out. I will do the change.

>
> This is a lot of code and I'm still trying to get my head aroud it 
> all.
> Thanks for putting in this work; it's good to see this approach being
> explored.

Thank you for taking a look at the code.

--
Best Regards,
Yan Zi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-18 17:31     ` Zi Yan
@ 2019-02-18 17:42       ` Vlastimil Babka
  2019-02-18 17:51         ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Vlastimil Babka @ 2019-02-18 17:42 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Mel Gorman, John Hubbard,
	Mark Hairgrove, Nitin Gupta, David Nellans

On 2/18/19 6:31 PM, Zi Yan wrote:
> The purpose of proposing exchange_pages() is to avoid allocating any new 
> page,
> so that we would not trigger any potential page reclaim or memory 
> compaction.
> Allocating a temporary page defeats the purpose.

Compaction can only happen for order > 0 temporary pages. Even if you used
single order = 0 page to gradually exchange e.g. a THP, it should be better than
u64. Allocating order = 0 should be a non-issue. If it's an issue, then the
system is in a bad state and physically contiguous layout is a secondary concern.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-18 17:42       ` Vlastimil Babka
@ 2019-02-18 17:51         ` Zi Yan
  2019-02-18 17:52           ` Matthew Wilcox
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2019-02-18 17:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, linux-mm, linux-kernel, Dave Hansen,
	Michal Hocko, Kirill A . Shutemov, Andrew Morton, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:

> On 2/18/19 6:31 PM, Zi Yan wrote:
>> The purpose of proposing exchange_pages() is to avoid allocating any 
>> new
>> page,
>> so that we would not trigger any potential page reclaim or memory
>> compaction.
>> Allocating a temporary page defeats the purpose.
>
> Compaction can only happen for order > 0 temporary pages. Even if you 
> used
> single order = 0 page to gradually exchange e.g. a THP, it should be 
> better than
> u64. Allocating order = 0 should be a non-issue. If it's an issue, 
> then the
> system is in a bad state and physically contiguous layout is a 
> secondary concern.

You are right if we only need to allocate one order-0 page. But this 
also means
we can only exchange two pages at a time. We need to add a lock to make 
sure
the temporary page is used exclusively or we need to keep allocating 
temporary pages
when multiple exchange_pages() are happening at the same time.

--
Best Regards,
Yan Zi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-18 17:51         ` Zi Yan
@ 2019-02-18 17:52           ` Matthew Wilcox
  2019-02-18 17:59             ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2019-02-18 17:52 UTC (permalink / raw)
  To: Zi Yan
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Dave Hansen,
	Michal Hocko, Kirill A . Shutemov, Andrew Morton, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
> > On 2/18/19 6:31 PM, Zi Yan wrote:
> > > The purpose of proposing exchange_pages() is to avoid allocating any
> > > new
> > > page,
> > > so that we would not trigger any potential page reclaim or memory
> > > compaction.
> > > Allocating a temporary page defeats the purpose.
> > 
> > Compaction can only happen for order > 0 temporary pages. Even if you
> > used
> > single order = 0 page to gradually exchange e.g. a THP, it should be
> > better than
> > u64. Allocating order = 0 should be a non-issue. If it's an issue, then
> > the
> > system is in a bad state and physically contiguous layout is a secondary
> > concern.
> 
> You are right if we only need to allocate one order-0 page. But this also
> means
> we can only exchange two pages at a time. We need to add a lock to make sure
> the temporary page is used exclusively or we need to keep allocating
> temporary pages
> when multiple exchange_pages() are happening at the same time.

You allocate one temporary page per thread that's doing an exchange_page().


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-18 17:52           ` Matthew Wilcox
@ 2019-02-18 17:59             ` Zi Yan
  2019-02-19  7:42               ` Anshuman Khandual
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2019-02-18 17:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Dave Hansen,
	Michal Hocko, Kirill A . Shutemov, Andrew Morton, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

[-- Attachment #1: Type: text/plain, Size: 1302 bytes --]

On 18 Feb 2019, at 9:52, Matthew Wilcox wrote:

> On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
>> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
>>> On 2/18/19 6:31 PM, Zi Yan wrote:
>>>> The purpose of proposing exchange_pages() is to avoid allocating any
>>>> new
>>>> page,
>>>> so that we would not trigger any potential page reclaim or memory
>>>> compaction.
>>>> Allocating a temporary page defeats the purpose.
>>>
>>> Compaction can only happen for order > 0 temporary pages. Even if you
>>> used
>>> single order = 0 page to gradually exchange e.g. a THP, it should be
>>> better than
>>> u64. Allocating order = 0 should be a non-issue. If it's an issue, then
>>> the
>>> system is in a bad state and physically contiguous layout is a secondary
>>> concern.
>>
>> You are right if we only need to allocate one order-0 page. But this also
>> means
>> we can only exchange two pages at a time. We need to add a lock to make sure
>> the temporary page is used exclusively or we need to keep allocating
>> temporary pages
>> when multiple exchange_pages() are happening at the same time.
>
> You allocate one temporary page per thread that's doing an exchange_page().

Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try
it. Thanks.

--
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-18 17:59             ` Zi Yan
@ 2019-02-19  7:42               ` Anshuman Khandual
  2019-02-19 12:56                 ` Matthew Wilcox
  0 siblings, 1 reply; 50+ messages in thread
From: Anshuman Khandual @ 2019-02-19  7:42 UTC (permalink / raw)
  To: Zi Yan, Matthew Wilcox
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Dave Hansen,
	Michal Hocko, Kirill A . Shutemov, Andrew Morton, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans



On 02/18/2019 11:29 PM, Zi Yan wrote:
> On 18 Feb 2019, at 9:52, Matthew Wilcox wrote:
> 
>> On Mon, Feb 18, 2019 at 09:51:33AM -0800, Zi Yan wrote:
>>> On 18 Feb 2019, at 9:42, Vlastimil Babka wrote:
>>>> On 2/18/19 6:31 PM, Zi Yan wrote:
>>>>> The purpose of proposing exchange_pages() is to avoid allocating any
>>>>> new
>>>>> page,
>>>>> so that we would not trigger any potential page reclaim or memory
>>>>> compaction.
>>>>> Allocating a temporary page defeats the purpose.
>>>>
>>>> Compaction can only happen for order > 0 temporary pages. Even if you
>>>> used
>>>> single order = 0 page to gradually exchange e.g. a THP, it should be
>>>> better than
>>>> u64. Allocating order = 0 should be a non-issue. If it's an issue, then
>>>> the
>>>> system is in a bad state and physically contiguous layout is a secondary
>>>> concern.
>>>
>>> You are right if we only need to allocate one order-0 page. But this also
>>> means
>>> we can only exchange two pages at a time. We need to add a lock to make sure
>>> the temporary page is used exclusively or we need to keep allocating
>>> temporary pages
>>> when multiple exchange_pages() are happening at the same time.
>>
>> You allocate one temporary page per thread that's doing an exchange_page().
> 
> Yeah, you are right. I think at most I need NR_CPU order-0 pages. I will try
> it. Thanks.

But the location of this temp page matters as well because you would like to
saturate the inter node interface. It needs to be either of the nodes where
the source or destination page belongs. Any other node would generate two
internode copy process which is not what you intend here I guess.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-19  7:42               ` Anshuman Khandual
@ 2019-02-19 12:56                 ` Matthew Wilcox
  2019-02-20  4:38                   ` Anshuman Khandual
  0 siblings, 1 reply; 50+ messages in thread
From: Matthew Wilcox @ 2019-02-19 12:56 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Zi Yan, Vlastimil Babka, linux-mm, linux-kernel, Dave Hansen,
	Michal Hocko, Kirill A . Shutemov, Andrew Morton, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
> But the location of this temp page matters as well because you would like to
> saturate the inter node interface. It needs to be either of the nodes where
> the source or destination page belongs. Any other node would generate two
> internode copy process which is not what you intend here I guess.

That makes no sense.  It should be allocated on the local node of the CPU
performing the copy.  If the CPU is in node A, the destination is in node B
and the source is in node C, then you're doing 4k worth of reads from node C,
4k worth of reads from node B, 4k worth of writes to node C followed by
4k worth of writes to node B.  Eventually the 4k of dirty cachelines on
node A will be written back from cache to the local memory (... or not,
if that page gets reused for some other purpose first).

If you allocate the page on node B or node C, that's an extra 4k of writes
to be sent across the inter-node link.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
  2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
                   ` (30 preceding siblings ...)
  2019-02-15 22:08 ` [RFC PATCH 31/31] sysctl: toggle to promote PUD-mapped 1GB THP or not Zi Yan
@ 2019-02-20  1:42 ` Mike Kravetz
  2019-02-20  2:33   ` Zi Yan
  31 siblings, 1 reply; 50+ messages in thread
From: Mike Kravetz @ 2019-02-20  1:42 UTC (permalink / raw)
  To: ziy, linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans

On 2/15/19 2:08 PM, Zi Yan wrote:

Thanks for working on this issue!

I have not yet had a chance to take a look at the code.  However, I do have
some general questions/comments on the approach.

> Patch structure 
> ---- 
> 
> The patchset I developed to generate physically contiguous memory/arbitrary
> sized pages merely moves pages around. There are three components in this
> patchset:
> 
> 1) a new page migration mechanism, called exchange pages, that exchanges the
> content of two in-use pages instead of performing two back-to-back page
> migration. It saves on overheads and avoids page reclaim and memory compaction
> in the page allocation path, although it is not strictly required if enough
> free memory is available in the system.
> 
> 2) a new mechanism that utilizes both page migration and exchange pages to
> produce physically contiguous memory/arbitrary sized pages without allocating
> any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
> physically contiguous memory out of each VMA, which is virtually contiguous.
> A simple range tree is used to ensure no two VMAs are overlapping with each
> other in the physical address space.

This appears to be a new approach to generating contiguous areas.  Previous
attempts had relied on finding a contiguous area that can then be used for
various purposes including user mappings.  Here, you take an existing mapping
and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag functionality
talks about creating a (VPN, PFN) anchor pair for each vma and then using
this pair as the base for creating a contiguous area.

I'm curious, how 'fixed' is the anchor?  As you know, there could be a
non-movable page in the PFN range.  As a result, you will not be able to
create a contiguous area starting at that PFN.  In such a case, do we try
another PFN?  I know this could result in much page shuffling.  I'm just
trying to figure out how we satisfy a user who really wants a contiguous
area.  Is there some method to keep trying?

My apologies if this is addressed in the code.  This was just one of the
first thoughts that came to mine when giving the series a quick look.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
  2019-02-20  1:42 ` [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Mike Kravetz
@ 2019-02-20  2:33   ` Zi Yan
  2019-02-20  3:18     ` Mike Kravetz
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2019-02-20  2:33 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On 19 Feb 2019, at 17:42, Mike Kravetz wrote:

> On 2/15/19 2:08 PM, Zi Yan wrote:
>
> Thanks for working on this issue!
>
> I have not yet had a chance to take a look at the code.  However, I do 
> have
> some general questions/comments on the approach.

Thanks for replying. The code is very intrusive and has a lot of hacks, 
so it is
OK for us to discuss the general idea first. :)


>> Patch structure
>> ----
>>
>> The patchset I developed to generate physically contiguous 
>> memory/arbitrary
>> sized pages merely moves pages around. There are three components in 
>> this
>> patchset:
>>
>> 1) a new page migration mechanism, called exchange pages, that 
>> exchanges the
>> content of two in-use pages instead of performing two back-to-back 
>> page
>> migration. It saves on overheads and avoids page reclaim and memory 
>> compaction
>> in the page allocation path, although it is not strictly required if 
>> enough
>> free memory is available in the system.
>>
>> 2) a new mechanism that utilizes both page migration and exchange 
>> pages to
>> produce physically contiguous memory/arbitrary sized pages without 
>> allocating
>> any new pages, unlike what khugepaged does. It works on per-VMA 
>> basis, creating
>> physically contiguous memory out of each VMA, which is virtually 
>> contiguous.
>> A simple range tree is used to ensure no two VMAs are overlapping 
>> with each
>> other in the physical address space.
>
> This appears to be a new approach to generating contiguous areas.  
> Previous
> attempts had relied on finding a contiguous area that can then be used 
> for
> various purposes including user mappings.  Here, you take an existing 
> mapping
> and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag 
> functionality
> talks about creating a (VPN, PFN) anchor pair for each vma and then 
> using
> this pair as the base for creating a contiguous area.
>
> I'm curious, how 'fixed' is the anchor?  As you know, there could be a
> non-movable page in the PFN range.  As a result, you will not be able 
> to
> create a contiguous area starting at that PFN.  In such a case, do we 
> try
> another PFN?  I know this could result in much page shuffling.  I'm 
> just
> trying to figure out how we satisfy a user who really wants a 
> contiguous
> area.  Is there some method to keep trying?

Good question. The anchor is determined on a per-VMA basis, which can be 
changed easily,
but in this patchiest, I used a very simple strategy — making all VMAs 
not overlapping
in the physical address space to get maximum overall contiguity and not 
changing anchors
even if non-moveable pages are encountered when generating physically 
contiguous pages.

Basically, first VMA1 in the virtual address space has its anchor as 
(VMA1_start_VPN, ZONE_start_PFN),
second VMA1 has its anchor as (VMA2_start_VPN, ZONE_start_PFN + 
VMA1_size), and so on.
This makes all VMA not overlapping in physical address space during 
contiguous memory
generation. When there is a non-moveable page, the anchor will not be 
changed, because
no matter whether we assign a new anchor or not, the contiguous pages 
stops at
the non-moveable page. If we are trying to get a new anchor, more effort 
is needed to
avoid overlapping new anchor with existing contiguous pages. Any 
overlapping will
nullify the existing contiguous pages.

To satisfy a user who wants a contiguous area with N pages, the minimal 
distance between
any two non-moveable pages should be bigger than N pages in the system 
memory. Otherwise,
nothing would work. If there is such an area (PFN1, PFN1+N) in the 
physical address space,
you can set the anchor to (VPN_USER, PFN1) and use exchange_pages() to 
generate a contiguous
area with N pages. Instead, alloc_contig_pages(PFN1, PFN1+N, …) could 
also work, but
only at page allocation time. It also requires the system has N free 
pages when
alloc_contig_pages() are migrating the pages in (PFN1, PFN1+N) away, or 
you need to swap
pages to make the space.

Let me know if this makes sense to you.

--
Best Regards,
Yan Zi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
  2019-02-20  2:33   ` Zi Yan
@ 2019-02-20  3:18     ` Mike Kravetz
  2019-02-20  5:19       ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Mike Kravetz @ 2019-02-20  3:18 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On 2/19/19 6:33 PM, Zi Yan wrote:
> On 19 Feb 2019, at 17:42, Mike Kravetz wrote:
> 
>> On 2/15/19 2:08 PM, Zi Yan wrote:
>>
>> Thanks for working on this issue!
>>
>> I have not yet had a chance to take a look at the code.  However, I do have
>> some general questions/comments on the approach.
> 
> Thanks for replying. The code is very intrusive and has a lot of hacks, so it is
> OK for us to discuss the general idea first. :)
> 
> 
>>> Patch structure
>>> ----
>>>
>>> The patchset I developed to generate physically contiguous memory/arbitrary
>>> sized pages merely moves pages around. There are three components in this
>>> patchset:
>>>
>>> 1) a new page migration mechanism, called exchange pages, that exchanges the
>>> content of two in-use pages instead of performing two back-to-back page
>>> migration. It saves on overheads and avoids page reclaim and memory compaction
>>> in the page allocation path, although it is not strictly required if enough
>>> free memory is available in the system.
>>>
>>> 2) a new mechanism that utilizes both page migration and exchange pages to
>>> produce physically contiguous memory/arbitrary sized pages without allocating
>>> any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
>>> physically contiguous memory out of each VMA, which is virtually contiguous.
>>> A simple range tree is used to ensure no two VMAs are overlapping with each
>>> other in the physical address space.
>>
>> This appears to be a new approach to generating contiguous areas.  Previous
>> attempts had relied on finding a contiguous area that can then be used for
>> various purposes including user mappings.  Here, you take an existing mapping
>> and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag functionality
>> talks about creating a (VPN, PFN) anchor pair for each vma and then using
>> this pair as the base for creating a contiguous area.
>>
>> I'm curious, how 'fixed' is the anchor?  As you know, there could be a
>> non-movable page in the PFN range.  As a result, you will not be able to
>> create a contiguous area starting at that PFN.  In such a case, do we try
>> another PFN?  I know this could result in much page shuffling.  I'm just
>> trying to figure out how we satisfy a user who really wants a contiguous
>> area.  Is there some method to keep trying?
> 
> Good question. The anchor is determined on a per-VMA basis, which can be changed
> easily,
> but in this patchiest, I used a very simple strategy — making all VMAs not
> overlapping
> in the physical address space to get maximum overall contiguity and not changing
> anchors
> even if non-moveable pages are encountered when generating physically contiguous
> pages.
> 
> Basically, first VMA1 in the virtual address space has its anchor as
> (VMA1_start_VPN, ZONE_start_PFN),
> second VMA1 has its anchor as (VMA2_start_VPN, ZONE_start_PFN + VMA1_size), and
> so on.
> This makes all VMA not overlapping in physical address space during contiguous
> memory
> generation. When there is a non-moveable page, the anchor will not be changed,
> because
> no matter whether we assign a new anchor or not, the contiguous pages stops at
> the non-moveable page. If we are trying to get a new anchor, more effort is
> needed to
> avoid overlapping new anchor with existing contiguous pages. Any overlapping will
> nullify the existing contiguous pages.
> 
> To satisfy a user who wants a contiguous area with N pages, the minimal distance
> between
> any two non-moveable pages should be bigger than N pages in the system memory.
> Otherwise,
> nothing would work. If there is such an area (PFN1, PFN1+N) in the physical
> address space,
> you can set the anchor to (VPN_USER, PFN1) and use exchange_pages() to generate
> a contiguous
> area with N pages. Instead, alloc_contig_pages(PFN1, PFN1+N, …) could also work,
> but
> only at page allocation time. It also requires the system has N free pages when
> alloc_contig_pages() are migrating the pages in (PFN1, PFN1+N) away, or you need
> to swap
> pages to make the space.
> 
> Let me know if this makes sense to you.
> 

Yes, that is how I expected the implementation would work.  Thank you.

Another high level question.  One of the benefits of this approach is
that exchanging pages does not require N free pages as you describe
above.  This assumes that the vma which we are trying to make contiguous
is already populated.  If it is not populated, then you also need to
have N free pages.  Correct?  If this is true, then is the expected use
case to first populate a vma, and then try to make contiguous?  I would
assume that if it is not populated and a request to make contiguous is
given, we should try to allocate/populate the vma with contiguous pages
at that time?
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-19 12:56                 ` Matthew Wilcox
@ 2019-02-20  4:38                   ` Anshuman Khandual
  2019-03-14  2:39                     ` Zi Yan
  0 siblings, 1 reply; 50+ messages in thread
From: Anshuman Khandual @ 2019-02-20  4:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, Vlastimil Babka, linux-mm, linux-kernel, Dave Hansen,
	Michal Hocko, Kirill A . Shutemov, Andrew Morton, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans



On 02/19/2019 06:26 PM, Matthew Wilcox wrote:
> On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
>> But the location of this temp page matters as well because you would like to
>> saturate the inter node interface. It needs to be either of the nodes where
>> the source or destination page belongs. Any other node would generate two
>> internode copy process which is not what you intend here I guess.
> That makes no sense.  It should be allocated on the local node of the CPU
> performing the copy.  If the CPU is in node A, the destination is in node B
> and the source is in node C, then you're doing 4k worth of reads from node C,
> 4k worth of reads from node B, 4k worth of writes to node C followed by
> 4k worth of writes to node B.  Eventually the 4k of dirty cachelines on
> node A will be written back from cache to the local memory (... or not,
> if that page gets reused for some other purpose first).
> 
> If you allocate the page on node B or node C, that's an extra 4k of writes
> to be sent across the inter-node link.

Thats right there will be an extra remote write. My assumption was that the CPU
performing the copy belongs to either node B or node C.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
  2019-02-20  3:18     ` Mike Kravetz
@ 2019-02-20  5:19       ` Zi Yan
  2019-02-20  5:27         ` Mike Kravetz
  0 siblings, 1 reply; 50+ messages in thread
From: Zi Yan @ 2019-02-20  5:19 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On 19 Feb 2019, at 19:18, Mike Kravetz wrote:

> On 2/19/19 6:33 PM, Zi Yan wrote:
>> On 19 Feb 2019, at 17:42, Mike Kravetz wrote:
>>
>>> On 2/15/19 2:08 PM, Zi Yan wrote:
>>>
>>> Thanks for working on this issue!
>>>
>>> I have not yet had a chance to take a look at the code.  However, I 
>>> do have
>>> some general questions/comments on the approach.
>>
>> Thanks for replying. The code is very intrusive and has a lot of 
>> hacks, so it is
>> OK for us to discuss the general idea first. :)
>>
>>
>>>> Patch structure
>>>> ----
>>>>
>>>> The patchset I developed to generate physically contiguous 
>>>> memory/arbitrary
>>>> sized pages merely moves pages around. There are three components 
>>>> in this
>>>> patchset:
>>>>
>>>> 1) a new page migration mechanism, called exchange pages, that 
>>>> exchanges the
>>>> content of two in-use pages instead of performing two back-to-back 
>>>> page
>>>> migration. It saves on overheads and avoids page reclaim and memory 
>>>> compaction
>>>> in the page allocation path, although it is not strictly required 
>>>> if enough
>>>> free memory is available in the system.
>>>>
>>>> 2) a new mechanism that utilizes both page migration and exchange 
>>>> pages to
>>>> produce physically contiguous memory/arbitrary sized pages without 
>>>> allocating
>>>> any new pages, unlike what khugepaged does. It works on per-VMA 
>>>> basis, creating
>>>> physically contiguous memory out of each VMA, which is virtually 
>>>> contiguous.
>>>> A simple range tree is used to ensure no two VMAs are overlapping 
>>>> with each
>>>> other in the physical address space.
>>>
>>> This appears to be a new approach to generating contiguous areas.  
>>> Previous
>>> attempts had relied on finding a contiguous area that can then be 
>>> used for
>>> various purposes including user mappings.  Here, you take an 
>>> existing mapping
>>> and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag 
>>> functionality
>>> talks about creating a (VPN, PFN) anchor pair for each vma and then 
>>> using
>>> this pair as the base for creating a contiguous area.
>>>
>>> I'm curious, how 'fixed' is the anchor?  As you know, there could be 
>>> a
>>> non-movable page in the PFN range.  As a result, you will not be 
>>> able to
>>> create a contiguous area starting at that PFN.  In such a case, do 
>>> we try
>>> another PFN?  I know this could result in much page shuffling.  I'm 
>>> just
>>> trying to figure out how we satisfy a user who really wants a 
>>> contiguous
>>> area.  Is there some method to keep trying?
>>
>> Good question. The anchor is determined on a per-VMA basis, which can 
>> be changed
>> easily,
>> but in this patchiest, I used a very simple strategy — making all 
>> VMAs not
>> overlapping
>> in the physical address space to get maximum overall contiguity and 
>> not changing
>> anchors
>> even if non-moveable pages are encountered when generating physically 
>> contiguous
>> pages.
>>
>> Basically, first VMA1 in the virtual address space has its anchor as
>> (VMA1_start_VPN, ZONE_start_PFN),
>> second VMA1 has its anchor as (VMA2_start_VPN, ZONE_start_PFN + 
>> VMA1_size), and
>> so on.
>> This makes all VMA not overlapping in physical address space during 
>> contiguous
>> memory
>> generation. When there is a non-moveable page, the anchor will not be 
>> changed,
>> because
>> no matter whether we assign a new anchor or not, the contiguous pages 
>> stops at
>> the non-moveable page. If we are trying to get a new anchor, more 
>> effort is
>> needed to
>> avoid overlapping new anchor with existing contiguous pages. Any 
>> overlapping will
>> nullify the existing contiguous pages.
>>
>> To satisfy a user who wants a contiguous area with N pages, the 
>> minimal distance
>> between
>> any two non-moveable pages should be bigger than N pages in the 
>> system memory.
>> Otherwise,
>> nothing would work. If there is such an area (PFN1, PFN1+N) in the 
>> physical
>> address space,
>> you can set the anchor to (VPN_USER, PFN1) and use exchange_pages() 
>> to generate
>> a contiguous
>> area with N pages. Instead, alloc_contig_pages(PFN1, PFN1+N, …) 
>> could also work,
>> but
>> only at page allocation time. It also requires the system has N free 
>> pages when
>> alloc_contig_pages() are migrating the pages in (PFN1, PFN1+N) away, 
>> or you need
>> to swap
>> pages to make the space.
>>
>> Let me know if this makes sense to you.
>>
>
> Yes, that is how I expected the implementation would work.  Thank you.
>
> Another high level question.  One of the benefits of this approach is
> that exchanging pages does not require N free pages as you describe
> above.  This assumes that the vma which we are trying to make 
> contiguous
> is already populated.  If it is not populated, then you also need to
> have N free pages.  Correct?  If this is true, then is the expected 
> use
> case to first populate a vma, and then try to make contiguous?  I 
> would
> assume that if it is not populated and a request to make contiguous is
> given, we should try to allocate/populate the vma with contiguous 
> pages
> at that time?

Yes, I assume the pages within the VMA are already populated but not 
contiguous yet.

My approach considers memory contiguity as an on-demand resource. In 
some phases
of an application, accelerators or RDMA controllers would 
process/transfer data in one
or more VMAs, at which time contiguous memory can help reduce address 
translation
overheads or lift certain constraints. And different VMAs could be 
processed at
different program phases, thus it might be hard to get contiguous memory 
for all
these VMAs at the allocation time using alloc_contig_pages(). My 
approach can
help get contiguous memory later, when the demand comes.

For some cases, you definitely can use alloc_contig_pages() to give 
users
a contiguous area at page allocation time, if you know the user is going 
to use this
area for accelerator data processing or as a RDMA buffer and the area 
size is fixed.

In addition, we can also use khugepaged approach, having a daemon 
periodically
scan VMAs and use alloc_contig_pages() to convert non-contiguous pages 
in a VMA
to contiguous pages, but it would require N free pages during the 
conversion.

In sum, my approach complements alloc_contig_pages() and provides more 
flexibility.
It is not trying to replaces alloc_contig_pages().


--
Best Regards,
Yan Zi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
  2019-02-20  5:19       ` Zi Yan
@ 2019-02-20  5:27         ` Mike Kravetz
  0 siblings, 0 replies; 50+ messages in thread
From: Mike Kravetz @ 2019-02-20  5:27 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On 2/19/19 9:19 PM, Zi Yan wrote:
> On 19 Feb 2019, at 19:18, Mike Kravetz wrote:
>> Another high level question.  One of the benefits of this approach is
>> that exchanging pages does not require N free pages as you describe
>> above.  This assumes that the vma which we are trying to make contiguous
>> is already populated.  If it is not populated, then you also need to
>> have N free pages.  Correct?  If this is true, then is the expected use
>> case to first populate a vma, and then try to make contiguous?  I would
>> assume that if it is not populated and a request to make contiguous is
>> given, we should try to allocate/populate the vma with contiguous pages
>> at that time?
> 
> Yes, I assume the pages within the VMA are already populated but not contiguous
> yet.
> 
> My approach considers memory contiguity as an on-demand resource. In some phases
> of an application, accelerators or RDMA controllers would process/transfer data
> in one
> or more VMAs, at which time contiguous memory can help reduce address translation
> overheads or lift certain constraints. And different VMAs could be processed at
> different program phases, thus it might be hard to get contiguous memory for all
> these VMAs at the allocation time using alloc_contig_pages(). My approach can
> help get contiguous memory later, when the demand comes.
> 
> For some cases, you definitely can use alloc_contig_pages() to give users
> a contiguous area at page allocation time, if you know the user is going to use
> this
> area for accelerator data processing or as a RDMA buffer and the area size is
> fixed.
> 
> In addition, we can also use khugepaged approach, having a daemon periodically
> scan VMAs and use alloc_contig_pages() to convert non-contiguous pages in a VMA
> to contiguous pages, but it would require N free pages during the conversion.
> 
> In sum, my approach complements alloc_contig_pages() and provides more flexibility.
> It is not trying to replaces alloc_contig_pages().

Thank you for the explanation.  That makes sense.  I have mostly been
thinking about contiguous memory from an allocation perspective and did
not really consider other use cases.

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
  2019-02-17 11:29   ` Matthew Wilcox
@ 2019-02-21 21:10   ` Jerome Glisse
  2019-02-21 21:25     ` Zi Yan
  1 sibling, 1 reply; 50+ messages in thread
From: Jerome Glisse @ 2019-02-21 21:10 UTC (permalink / raw)
  To: ziy
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> In stead of using two migrate_pages(), a single exchange_pages() would
> be sufficient and without allocating new pages.

So i believe it would be better to arrange the code differently instead
of having one function that special case combination, define function for
each one ie:
    exchange_anon_to_share()
    exchange_anon_to_anon()
    exchange_share_to_share()

Then you could define function to test if a page is in correct states:
    can_exchange_anon_page() // return true if page can be exchange
    can_exchange_share_page()

In fact both of this function can be factor out as common helpers with the
existing migrate code within migrate.c This way we would have one place
only where we need to handle all the special casing, test and exceptions.

Other than that i could not spot anything obviously wrong but i did not
spent enough time to check everything. Re-architecturing the code like
i propose above would make this a lot easier to review i believe.

Cheers,
Jérôme

> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/ksm.h |   5 +
>  mm/Makefile         |   1 +
>  mm/exchange.c       | 846 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/internal.h       |   6 +
>  mm/ksm.c            |  35 ++
>  mm/migrate.c        |   4 +-
>  6 files changed, 895 insertions(+), 2 deletions(-)
>  create mode 100644 mm/exchange.c

[...]

> +	from_page_count = page_count(from_page);
> +	from_map_count = page_mapcount(from_page);
> +	to_page_count = page_count(to_page);
> +	to_map_count = page_mapcount(to_page);
> +	from_flags = from_page->flags;
> +	to_flags = to_page->flags;
> +	from_mapping = from_page->mapping;
> +	to_mapping = to_page->mapping;
> +	from_index = from_page->index;
> +	to_index = to_page->index;

Those are not use anywhere ...


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-21 21:10   ` Jerome Glisse
@ 2019-02-21 21:25     ` Zi Yan
  0 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-21 21:25 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	John Hubbard, Mark Hairgrove, Nitin Gupta, David Nellans

On 21 Feb 2019, at 13:10, Jerome Glisse wrote:

> On Fri, Feb 15, 2019 at 02:08:26PM -0800, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> In stead of using two migrate_pages(), a single exchange_pages() would
>> be sufficient and without allocating new pages.
>
> So i believe it would be better to arrange the code differently instead
> of having one function that special case combination, define function for
> each one ie:
>     exchange_anon_to_share()
>     exchange_anon_to_anon()
>     exchange_share_to_share()
>
> Then you could define function to test if a page is in correct states:
>     can_exchange_anon_page() // return true if page can be exchange
>     can_exchange_share_page()
>
> In fact both of this function can be factor out as common helpers with the
> existing migrate code within migrate.c This way we would have one place
> only where we need to handle all the special casing, test and exceptions.
>
> Other than that i could not spot anything obviously wrong but i did not
> spent enough time to check everything. Re-architecturing the code like
> i propose above would make this a lot easier to review i believe.
>

Thank you for reviewing the patch. Your suggestions are very helpful.
I will restructure the code to help people review it.


>> +	from_page_count = page_count(from_page);
>> +	from_map_count = page_mapcount(from_page);
>> +	to_page_count = page_count(to_page);
>> +	to_map_count = page_mapcount(to_page);
>> +	from_flags = from_page->flags;
>> +	to_flags = to_page->flags;
>> +	from_mapping = from_page->mapping;
>> +	to_mapping = to_page->mapping;
>> +	from_index = from_page->index;
>> +	to_index = to_page->index;
>
> Those are not use anywhere ...

Will remove them. Thanks.

--
Best Regards,
Yan Zi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
  2019-02-20  4:38                   ` Anshuman Khandual
@ 2019-03-14  2:39                     ` Zi Yan
  0 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-03-14  2:39 UTC (permalink / raw)
  To: Anshuman Khandual, Matthew Wilcox, Vlastimil Babka
  Cc: linux-mm, linux-kernel, Dave Hansen, Michal Hocko,
	Kirill A . Shutemov, Andrew Morton, Mel Gorman, John Hubbard,
	Mark Hairgrove, Nitin Gupta, David Nellans

On 19 Feb 2019, at 20:38, Anshuman Khandual wrote:

> On 02/19/2019 06:26 PM, Matthew Wilcox wrote:
>> On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
>>> But the location of this temp page matters as well because you would 
>>> like to
>>> saturate the inter node interface. It needs to be either of the 
>>> nodes where
>>> the source or destination page belongs. Any other node would 
>>> generate two
>>> internode copy process which is not what you intend here I guess.
>> That makes no sense.  It should be allocated on the local node of the 
>> CPU
>> performing the copy.  If the CPU is in node A, the destination is in 
>> node B
>> and the source is in node C, then you're doing 4k worth of reads from 
>> node C,
>> 4k worth of reads from node B, 4k worth of writes to node C followed 
>> by
>> 4k worth of writes to node B.  Eventually the 4k of dirty cachelines 
>> on
>> node A will be written back from cache to the local memory (... or 
>> not,
>> if that page gets reused for some other purpose first).
>>
>> If you allocate the page on node B or node C, that's an extra 4k of 
>> writes
>> to be sent across the inter-node link.
>
> Thats right there will be an extra remote write. My assumption was 
> that the CPU
> performing the copy belongs to either node B or node C.


I have some interesting throughput results for exchange per u64 and 
exchange per 4KB page.
What I discovered is that using a 4KB page as the temporary storage for 
exchanging
2MB THPs does not improve the throughput. On contrary, when we are 
exchanging more than 2^4=16 THPs,
exchanging per 4KB page has lower throughput than exchanging per u64. 
Please see results below.

The experiments are done on a two socket machine with two Intel Xeon 
E5-2640 v3 CPUs.
All exchanges are done via the QPI link across two sockets.


Results
===

Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA 
nodes

| 2mb_page_order | 0    | 1    | 2    | 3    | 4    | 5    | 6    | 7    
| 8    | 9
|     u64        | 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 
| 9.57 | 9.62
|     per_page   | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 
| 7.32 | 7.31

Normalized throughput (to per_page)

  2mb_page_order | 0    | 1    | 2    | 3    | 4    | 5    | 6    | 7    
| 8    | 9
      u64        | 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26  | 1.30 
| 1.30 | 1.31



Exchange page code
===

For exchanging per u64, I use the following function:

static void exchange_page(char *to, char *from)
{
	u64 tmp;
	int i;

	for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
		tmp = *((u64 *)(from + i));
		*((u64 *)(from + i)) = *((u64 *)(to + i));
		*((u64 *)(to + i)) = tmp;
	}
}


For exchange per 4KB, I use the following function:

static void exchange_page2(char *to, char *from)
{
	int cpu = smp_processor_id();

	VM_BUG_ON(!in_atomic());

	if (!page_tmp[cpu]) {
		int nid = cpu_to_node(cpu);
		struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 0);
		if (!page_tmp_page) {
			exchange_page(to, from);
			return;
		}
		page_tmp[cpu] = kmap(page_tmp_page);
	}

	copy_page(page_tmp[cpu], to);
	copy_page(to, from);
	copy_page(from, page_tmp[cpu]);
}

where page_tmp is pre-allocated local to each CPU and alloc_pages_node() 
above
is for hot-added CPUs, which is not used in the tests.


The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc
To do a comparison, you can clone this repo: 
https://gitlab.com/ziy/thp-migration-bench,
then make, ./run_test.sh, and ./get_results.sh using the kernel from 
above.

Let me know if I missed anything or did something wrong. Thanks.


--
Best Regards,
Yan Zi


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
@ 2019-02-15 22:03 Zi Yan
  0 siblings, 0 replies; 50+ messages in thread
From: Zi Yan @ 2019-02-15 22:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

Hi all,

This patchset produces physically contiguous memory by moving in-use pages
without allocating any new pages. It targets two scenarios that complements
khugepaged use cases: 1) avoiding page reclaim and memory compaction when the
system is under memory pressure because this patchset does not allocate any new
pages, 2) generating pages larger than 2^MAX_ORDER without changing the buddy
allocator.

To demonstrate its use, I add very basic 1GB THP support and enable promoting
512 2MB THPs to a 1GB THP in my patchset. Promoting 512 4KB pages to a 2MB
THP is also implemented.

The patches are on top of v5.0-rc5. They are posted as part of my upcoming
LSF/MM proposal.

Motivation 
---- 

The goal of this patchset is to provide alternative way of generating physically
contiguous memory and making it available as arbitrary sized large pages. This
patchset generates physically contiguous memory/arbitrary size pages after pages
are allocated by moving virtually-contiguous pages to become physically
contiguous at any size, thus it does not require changes to memory allocators.
On the other hand, it works only for moveable pages, so it also faces the same
fragmentation issues as memory compaction, i.e., if non-moveable pages spread
across the entire memory, this patchset can only generate contiguity between
any two non-moveable pages. 

Large pages and physically contiguous memory are important to devices, such as
GPUs, FPGAs, NICs and RDMA controllers, because they can often achieve better
performance when operating on large pages. The same can be said of CPU
performance, of course, but there is an important difference: GPUs and
high-throughput devices often take a more severe performance hit, in the event
of a TLB miss and subsequent page table walks, as compared to a CPU. The effect
is sufficiently large that such devices *really* want a highly reliable way to
allocate large pages to minimize the number of potential TLB misses and the time
spent on the induced page table walks. 

Vendors (like Oracle, Mellanox, IBM, NVIDIA) are interested in generating
physically contiguous memory beyond THP sizes and looking for solutions [1],[2],[3].
This patchset provides an alternative approach, compared to allocating
physically contiguous memory at page allocation time, to generating physically
contiguous memory after pages are allocated. This approach can avoid page
reclaim and memory compaction, which happen during the process of page
allocation, but still produces comparable physically contiguous memory. 

In terms of THPs, it helps, but we are interested in even larger contiguous
ranges (or page size support) to further reduce the address translation overheads.
With this patchset, we can generate pages larger than PMD-level THPs without
requiring MAX_ORDER changes in the buddy allocators. 


Patch structure 
---- 

The patchset I developed to generate physically contiguous memory/arbitrary
sized pages merely moves pages around. There are three components in this
patchset:

1) a new page migration mechanism, called exchange pages, that exchanges the
content of two in-use pages instead of performing two back-to-back page
migration. It saves on overheads and avoids page reclaim and memory compaction
in the page allocation path, although it is not strictly required if enough
free memory is available in the system.

2) a new mechanism that utilizes both page migration and exchange pages to
produce physically contiguous memory/arbitrary sized pages without allocating
any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
physically contiguous memory out of each VMA, which is virtually contiguous.
A simple range tree is used to ensure no two VMAs are overlapping with each
other in the physical address space.

3) a use case of the new physically contiguous memory producing mechanism that
generates 1GB THPs by migrating and exchanging pages and promoting 512
contiguous 2MB THPs to a 1GB THP, although even larger physically contiguous
memory ranges can be generated. The 1GB THP implement is very basic, which can
handle 1GB THP faults when buddy allocator is modified to allocate 1GB pages,
support 1GB THP split to 2MB THP and in-place promotion from 2MB THP to 1GB THP,
and PMD/PTE-mapped 1GB THP. These are not fully tested.


[1] https://lwn.net/Articles/736170/ 
[2] https://lwn.net/Articles/753167/ 
[3] https://blogs.nvidia.com/blog/2018/06/08/worlds-fastest-exascale-ai-supercomputer-summit/ 

Zi Yan (31):
  mm: migrate: Add exchange_pages to exchange two lists of pages.
  mm: migrate: Add THP exchange support.
  mm: migrate: Add tmpfs exchange support.
  mm: add mem_defrag functionality.
  mem_defrag: split a THP if either src or dst is THP only.
  mm: Make MAX_ORDER configurable in Kconfig for buddy allocator.
  mm: deallocate pages with order > MAX_ORDER.
  mm: add pagechain container for storing multiple pages.
  mm: thp: 1GB anonymous page implementation.
  mm: proc: add 1GB THP kpageflag.
  mm: debug: print compound page order in dump_page().
  mm: stats: Separate PMD THP and PUD THP stats.
  mm: thp: 1GB THP copy on write implementation.
  mm: thp: handling 1GB THP reference bit.
  mm: thp: add 1GB THP split_huge_pud_page() function.
  mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time.
  mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: 1GB THP support in try_to_unmap().
  mm: thp: split 1GB THPs at page reclaim.
  mm: thp: 1GB zero page shrinker.
  mm: thp: 1GB THP follow_p*d_page() support.
  mm: support 1GB THP pagemap support.
  sysctl: add an option to only print the head page virtual address.
  mm: thp: add a knob to enable/disable 1GB THPs.
  mm: thp: promote PTE-mapped THP to PMD-mapped THP.
  mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages.
  mm: vmstats: add page promotion stats.
  mm: madvise: add madvise options to split PMD and PUD THPs.
  mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support.
  sysctl: toggle to promote PUD-mapped 1GB THP or not.

 arch/x86/Kconfig                       |   15 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 arch/x86/include/asm/pgalloc.h         |   69 +
 arch/x86/include/asm/pgtable.h         |   20 +
 arch/x86/include/asm/sparsemem.h       |    4 +-
 arch/x86/mm/pgtable.c                  |   38 +
 drivers/base/node.c                    |    3 +
 fs/exec.c                              |    4 +
 fs/proc/meminfo.c                      |    2 +
 fs/proc/page.c                         |    2 +
 fs/proc/task_mmu.c                     |   47 +-
 include/asm-generic/pgtable.h          |  110 +
 include/linux/huge_mm.h                |   78 +-
 include/linux/khugepaged.h             |    1 +
 include/linux/ksm.h                    |    5 +
 include/linux/mem_defrag.h             |   60 +
 include/linux/memcontrol.h             |    5 +
 include/linux/mm.h                     |   34 +
 include/linux/mm_types.h               |    5 +
 include/linux/mmu_notifier.h           |   13 +
 include/linux/mmzone.h                 |    1 +
 include/linux/page-flags.h             |   79 +-
 include/linux/pagechain.h              |   73 +
 include/linux/rmap.h                   |   10 +-
 include/linux/sched/coredump.h         |    4 +
 include/linux/swap.h                   |    2 +
 include/linux/syscalls.h               |    3 +
 include/linux/vm_event_item.h          |   33 +
 include/uapi/asm-generic/mman-common.h |   15 +
 include/uapi/linux/kernel-page-flags.h |    2 +
 kernel/events/uprobes.c                |    4 +-
 kernel/fork.c                          |   14 +
 kernel/sysctl.c                        |  101 +-
 mm/Makefile                            |    2 +
 mm/compaction.c                        |   17 +-
 mm/debug.c                             |    8 +-
 mm/exchange.c                          |  878 +++++++
 mm/filemap.c                           |    8 +
 mm/gup.c                               |   60 +-
 mm/huge_memory.c                       | 3360 ++++++++++++++++++++----
 mm/hugetlb.c                           |    4 +-
 mm/internal.h                          |   46 +
 mm/khugepaged.c                        |    7 +-
 mm/ksm.c                               |   39 +-
 mm/madvise.c                           |  121 +
 mm/mem_defrag.c                        | 1941 ++++++++++++++
 mm/memcontrol.c                        |   13 +
 mm/memory.c                            |   55 +-
 mm/migrate.c                           |   14 +-
 mm/mmap.c                              |   29 +
 mm/page_alloc.c                        |  108 +-
 mm/page_vma_mapped.c                   |  129 +-
 mm/pgtable-generic.c                   |   78 +-
 mm/rmap.c                              |  283 +-
 mm/swap.c                              |   38 +
 mm/swap_slots.c                        |    2 +
 mm/swapfile.c                          |    4 +-
 mm/userfaultfd.c                       |    2 +-
 mm/util.c                              |    7 +
 mm/vmscan.c                            |   55 +-
 mm/vmstat.c                            |   32 +
 61 files changed, 7452 insertions(+), 745 deletions(-)
 create mode 100644 include/linux/mem_defrag.h
 create mode 100644 include/linux/pagechain.h
 create mode 100644 mm/exchange.c
 create mode 100644 mm/mem_defrag.c

--
2.20.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, back to index

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
2019-02-17 11:29   ` Matthew Wilcox
2019-02-18 17:31     ` Zi Yan
2019-02-18 17:42       ` Vlastimil Babka
2019-02-18 17:51         ` Zi Yan
2019-02-18 17:52           ` Matthew Wilcox
2019-02-18 17:59             ` Zi Yan
2019-02-19  7:42               ` Anshuman Khandual
2019-02-19 12:56                 ` Matthew Wilcox
2019-02-20  4:38                   ` Anshuman Khandual
2019-03-14  2:39                     ` Zi Yan
2019-02-21 21:10   ` Jerome Glisse
2019-02-21 21:25     ` Zi Yan
2019-02-15 22:08 ` [RFC PATCH 02/31] mm: migrate: Add THP exchange support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 03/31] mm: migrate: Add tmpfs " Zi Yan
2019-02-15 22:08 ` [RFC PATCH 04/31] mm: add mem_defrag functionality Zi Yan
2019-02-15 22:08 ` [RFC PATCH 05/31] mem_defrag: split a THP if either src or dst is THP only Zi Yan
2019-02-15 22:08 ` [RFC PATCH 06/31] mm: Make MAX_ORDER configurable in Kconfig for buddy allocator Zi Yan
2019-02-15 22:08 ` [RFC PATCH 07/31] mm: deallocate pages with order > MAX_ORDER Zi Yan
2019-02-15 22:08 ` [RFC PATCH 08/31] mm: add pagechain container for storing multiple pages Zi Yan
2019-02-15 22:08 ` [RFC PATCH 09/31] mm: thp: 1GB anonymous page implementation Zi Yan
2019-02-15 22:08 ` [RFC PATCH 10/31] mm: proc: add 1GB THP kpageflag Zi Yan
2019-02-15 22:08 ` [RFC PATCH 11/31] mm: debug: print compound page order in dump_page() Zi Yan
2019-02-15 22:08 ` [RFC PATCH 12/31] mm: stats: Separate PMD THP and PUD THP stats Zi Yan
2019-02-15 22:08 ` [RFC PATCH 13/31] mm: thp: 1GB THP copy on write implementation Zi Yan
2019-02-15 22:08 ` [RFC PATCH 14/31] mm: thp: handling 1GB THP reference bit Zi Yan
2019-02-15 22:08 ` [RFC PATCH 15/31] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
2019-02-15 22:08 ` [RFC PATCH 16/31] mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time Zi Yan
2019-02-15 22:08 ` [RFC PATCH 17/31] mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP Zi Yan
2019-02-15 22:08 ` [RFC PATCH 18/31] mm: page_vma_walk: teach it about PMD-mapped " Zi Yan
2019-02-15 22:08 ` [RFC PATCH 19/31] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
2019-02-15 22:08 ` [RFC PATCH 20/31] mm: thp: split 1GB THPs at page reclaim Zi Yan
2019-02-15 22:08 ` [RFC PATCH 21/31] mm: thp: 1GB zero page shrinker Zi Yan
2019-02-15 22:08 ` [RFC PATCH 22/31] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 23/31] mm: support 1GB THP pagemap support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 24/31] sysctl: add an option to only print the head page virtual address Zi Yan
2019-02-15 22:08 ` [RFC PATCH 25/31] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
2019-02-15 22:08 ` [RFC PATCH 26/31] mm: thp: promote PTE-mapped THP to PMD-mapped THP Zi Yan
2019-02-15 22:08 ` [RFC PATCH 27/31] mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages Zi Yan
2019-02-15 22:08 ` [RFC PATCH 28/31] mm: vmstats: add page promotion stats Zi Yan
2019-02-15 22:08 ` [RFC PATCH 29/31] mm: madvise: add madvise options to split PMD and PUD THPs Zi Yan
2019-02-15 22:08 ` [RFC PATCH 30/31] mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 31/31] sysctl: toggle to promote PUD-mapped 1GB THP or not Zi Yan
2019-02-20  1:42 ` [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Mike Kravetz
2019-02-20  2:33   ` Zi Yan
2019-02-20  3:18     ` Mike Kravetz
2019-02-20  5:19       ` Zi Yan
2019-02-20  5:27         ` Mike Kravetz
  -- strict thread matches above, loose matches on Subject: below --
2019-02-15 22:03 Zi Yan

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git