linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/31] Generating physically contiguous memory after page allocation
@ 2019-02-15 22:08 Zi Yan
  2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
                   ` (31 more replies)
  0 siblings, 32 replies; 49+ messages in thread
From: Zi Yan @ 2019-02-15 22:08 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Dave Hansen, Michal Hocko, Kirill A . Shutemov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, John Hubbard, Mark Hairgrove,
	Nitin Gupta, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset produces physically contiguous memory by moving in-use pages
without allocating any new pages. It targets two scenarios that complements
khugepaged use cases: 1) avoiding page reclaim and memory compaction when the
system is under memory pressure because this patchset does not allocate any new
pages, 2) generating pages larger than 2^MAX_ORDER without changing the buddy
allocator.

To demonstrate its use, I add very basic 1GB THP support and enable promoting
512 2MB THPs to a 1GB THP in my patchset. Promoting 512 4KB pages to a 2MB
THP is also implemented.

The patches are on top of v5.0-rc5. They are posted as part of my upcoming
LSF/MM proposal.

Motivation 
---- 

The goal of this patchset is to provide alternative way of generating physically
contiguous memory and making it available as arbitrary sized large pages. This
patchset generates physically contiguous memory/arbitrary size pages after pages
are allocated by moving virtually-contiguous pages to become physically
contiguous at any size, thus it does not require changes to memory allocators.
On the other hand, it works only for moveable pages, so it also faces the same
fragmentation issues as memory compaction, i.e., if non-moveable pages spread
across the entire memory, this patchset can only generate contiguity between
any two non-moveable pages. 

Large pages and physically contiguous memory are important to devices, such as
GPUs, FPGAs, NICs and RDMA controllers, because they can often achieve better
performance when operating on large pages. The same can be said of CPU
performance, of course, but there is an important difference: GPUs and
high-throughput devices often take a more severe performance hit, in the event
of a TLB miss and subsequent page table walks, as compared to a CPU. The effect
is sufficiently large that such devices *really* want a highly reliable way to
allocate large pages to minimize the number of potential TLB misses and the time
spent on the induced page table walks. 

Vendors (like Oracle, Mellanox, IBM, NVIDIA) are interested in generating
physically contiguous memory beyond THP sizes and looking for solutions [1],[2],[3].
This patchset provides an alternative approach, compared to allocating
physically contiguous memory at page allocation time, to generating physically
contiguous memory after pages are allocated. This approach can avoid page
reclaim and memory compaction, which happen during the process of page
allocation, but still produces comparable physically contiguous memory. 

In terms of THPs, it helps, but we are interested in even larger contiguous
ranges (or page size support) to further reduce the address translation overheads.
With this patchset, we can generate pages larger than PMD-level THPs without
requiring MAX_ORDER changes in the buddy allocators. 


Patch structure 
---- 

The patchset I developed to generate physically contiguous memory/arbitrary
sized pages merely moves pages around. There are three components in this
patchset:

1) a new page migration mechanism, called exchange pages, that exchanges the
content of two in-use pages instead of performing two back-to-back page
migration. It saves on overheads and avoids page reclaim and memory compaction
in the page allocation path, although it is not strictly required if enough
free memory is available in the system.

2) a new mechanism that utilizes both page migration and exchange pages to
produce physically contiguous memory/arbitrary sized pages without allocating
any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
physically contiguous memory out of each VMA, which is virtually contiguous.
A simple range tree is used to ensure no two VMAs are overlapping with each
other in the physical address space.

3) a use case of the new physically contiguous memory producing mechanism that
generates 1GB THPs by migrating and exchanging pages and promoting 512
contiguous 2MB THPs to a 1GB THP, although even larger physically contiguous
memory ranges can be generated. The 1GB THP implement is very basic, which can
handle 1GB THP faults when buddy allocator is modified to allocate 1GB pages,
support 1GB THP split to 2MB THP and in-place promotion from 2MB THP to 1GB THP,
and PMD/PTE-mapped 1GB THP. These are not fully tested.


[1] https://lwn.net/Articles/736170/ 
[2] https://lwn.net/Articles/753167/ 
[3] https://blogs.nvidia.com/blog/2018/06/08/worlds-fastest-exascale-ai-supercomputer-summit/ 

Zi Yan (31):
  mm: migrate: Add exchange_pages to exchange two lists of pages.
  mm: migrate: Add THP exchange support.
  mm: migrate: Add tmpfs exchange support.
  mm: add mem_defrag functionality.
  mem_defrag: split a THP if either src or dst is THP only.
  mm: Make MAX_ORDER configurable in Kconfig for buddy allocator.
  mm: deallocate pages with order > MAX_ORDER.
  mm: add pagechain container for storing multiple pages.
  mm: thp: 1GB anonymous page implementation.
  mm: proc: add 1GB THP kpageflag.
  mm: debug: print compound page order in dump_page().
  mm: stats: Separate PMD THP and PUD THP stats.
  mm: thp: 1GB THP copy on write implementation.
  mm: thp: handling 1GB THP reference bit.
  mm: thp: add 1GB THP split_huge_pud_page() function.
  mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time.
  mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: 1GB THP support in try_to_unmap().
  mm: thp: split 1GB THPs at page reclaim.
  mm: thp: 1GB zero page shrinker.
  mm: thp: 1GB THP follow_p*d_page() support.
  mm: support 1GB THP pagemap support.
  sysctl: add an option to only print the head page virtual address.
  mm: thp: add a knob to enable/disable 1GB THPs.
  mm: thp: promote PTE-mapped THP to PMD-mapped THP.
  mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages.
  mm: vmstats: add page promotion stats.
  mm: madvise: add madvise options to split PMD and PUD THPs.
  mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support.
  sysctl: toggle to promote PUD-mapped 1GB THP or not.

 arch/x86/Kconfig                       |   15 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 arch/x86/include/asm/pgalloc.h         |   69 +
 arch/x86/include/asm/pgtable.h         |   20 +
 arch/x86/include/asm/sparsemem.h       |    4 +-
 arch/x86/mm/pgtable.c                  |   38 +
 drivers/base/node.c                    |    3 +
 fs/exec.c                              |    4 +
 fs/proc/meminfo.c                      |    2 +
 fs/proc/page.c                         |    2 +
 fs/proc/task_mmu.c                     |   47 +-
 include/asm-generic/pgtable.h          |  110 +
 include/linux/huge_mm.h                |   78 +-
 include/linux/khugepaged.h             |    1 +
 include/linux/ksm.h                    |    5 +
 include/linux/mem_defrag.h             |   60 +
 include/linux/memcontrol.h             |    5 +
 include/linux/mm.h                     |   34 +
 include/linux/mm_types.h               |    5 +
 include/linux/mmu_notifier.h           |   13 +
 include/linux/mmzone.h                 |    1 +
 include/linux/page-flags.h             |   79 +-
 include/linux/pagechain.h              |   73 +
 include/linux/rmap.h                   |   10 +-
 include/linux/sched/coredump.h         |    4 +
 include/linux/swap.h                   |    2 +
 include/linux/syscalls.h               |    3 +
 include/linux/vm_event_item.h          |   33 +
 include/uapi/asm-generic/mman-common.h |   15 +
 include/uapi/linux/kernel-page-flags.h |    2 +
 kernel/events/uprobes.c                |    4 +-
 kernel/fork.c                          |   14 +
 kernel/sysctl.c                        |  101 +-
 mm/Makefile                            |    2 +
 mm/compaction.c                        |   17 +-
 mm/debug.c                             |    8 +-
 mm/exchange.c                          |  878 +++++++
 mm/filemap.c                           |    8 +
 mm/gup.c                               |   60 +-
 mm/huge_memory.c                       | 3360 ++++++++++++++++++++----
 mm/hugetlb.c                           |    4 +-
 mm/internal.h                          |   46 +
 mm/khugepaged.c                        |    7 +-
 mm/ksm.c                               |   39 +-
 mm/madvise.c                           |  121 +
 mm/mem_defrag.c                        | 1941 ++++++++++++++
 mm/memcontrol.c                        |   13 +
 mm/memory.c                            |   55 +-
 mm/migrate.c                           |   14 +-
 mm/mmap.c                              |   29 +
 mm/page_alloc.c                        |  108 +-
 mm/page_vma_mapped.c                   |  129 +-
 mm/pgtable-generic.c                   |   78 +-
 mm/rmap.c                              |  283 +-
 mm/swap.c                              |   38 +
 mm/swap_slots.c                        |    2 +
 mm/swapfile.c                          |    4 +-
 mm/userfaultfd.c                       |    2 +-
 mm/util.c                              |    7 +
 mm/vmscan.c                            |   55 +-
 mm/vmstat.c                            |   32 +
 61 files changed, 7452 insertions(+), 745 deletions(-)
 create mode 100644 include/linux/mem_defrag.h
 create mode 100644 include/linux/pagechain.h
 create mode 100644 mm/exchange.c
 create mode 100644 mm/mem_defrag.c

--
2.20.1


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2019-03-14  2:39 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-15 22:08 [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Zi Yan
2019-02-15 22:08 ` [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages Zi Yan
2019-02-17 11:29   ` Matthew Wilcox
2019-02-18 17:31     ` Zi Yan
2019-02-18 17:42       ` Vlastimil Babka
2019-02-18 17:51         ` Zi Yan
2019-02-18 17:52           ` Matthew Wilcox
2019-02-18 17:59             ` Zi Yan
2019-02-19  7:42               ` Anshuman Khandual
2019-02-19 12:56                 ` Matthew Wilcox
2019-02-20  4:38                   ` Anshuman Khandual
2019-03-14  2:39                     ` Zi Yan
2019-02-21 21:10   ` Jerome Glisse
2019-02-21 21:25     ` Zi Yan
2019-02-15 22:08 ` [RFC PATCH 02/31] mm: migrate: Add THP exchange support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 03/31] mm: migrate: Add tmpfs " Zi Yan
2019-02-15 22:08 ` [RFC PATCH 04/31] mm: add mem_defrag functionality Zi Yan
2019-02-15 22:08 ` [RFC PATCH 05/31] mem_defrag: split a THP if either src or dst is THP only Zi Yan
2019-02-15 22:08 ` [RFC PATCH 06/31] mm: Make MAX_ORDER configurable in Kconfig for buddy allocator Zi Yan
2019-02-15 22:08 ` [RFC PATCH 07/31] mm: deallocate pages with order > MAX_ORDER Zi Yan
2019-02-15 22:08 ` [RFC PATCH 08/31] mm: add pagechain container for storing multiple pages Zi Yan
2019-02-15 22:08 ` [RFC PATCH 09/31] mm: thp: 1GB anonymous page implementation Zi Yan
2019-02-15 22:08 ` [RFC PATCH 10/31] mm: proc: add 1GB THP kpageflag Zi Yan
2019-02-15 22:08 ` [RFC PATCH 11/31] mm: debug: print compound page order in dump_page() Zi Yan
2019-02-15 22:08 ` [RFC PATCH 12/31] mm: stats: Separate PMD THP and PUD THP stats Zi Yan
2019-02-15 22:08 ` [RFC PATCH 13/31] mm: thp: 1GB THP copy on write implementation Zi Yan
2019-02-15 22:08 ` [RFC PATCH 14/31] mm: thp: handling 1GB THP reference bit Zi Yan
2019-02-15 22:08 ` [RFC PATCH 15/31] mm: thp: add 1GB THP split_huge_pud_page() function Zi Yan
2019-02-15 22:08 ` [RFC PATCH 16/31] mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time Zi Yan
2019-02-15 22:08 ` [RFC PATCH 17/31] mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP Zi Yan
2019-02-15 22:08 ` [RFC PATCH 18/31] mm: page_vma_walk: teach it about PMD-mapped " Zi Yan
2019-02-15 22:08 ` [RFC PATCH 19/31] mm: thp: 1GB THP support in try_to_unmap() Zi Yan
2019-02-15 22:08 ` [RFC PATCH 20/31] mm: thp: split 1GB THPs at page reclaim Zi Yan
2019-02-15 22:08 ` [RFC PATCH 21/31] mm: thp: 1GB zero page shrinker Zi Yan
2019-02-15 22:08 ` [RFC PATCH 22/31] mm: thp: 1GB THP follow_p*d_page() support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 23/31] mm: support 1GB THP pagemap support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 24/31] sysctl: add an option to only print the head page virtual address Zi Yan
2019-02-15 22:08 ` [RFC PATCH 25/31] mm: thp: add a knob to enable/disable 1GB THPs Zi Yan
2019-02-15 22:08 ` [RFC PATCH 26/31] mm: thp: promote PTE-mapped THP to PMD-mapped THP Zi Yan
2019-02-15 22:08 ` [RFC PATCH 27/31] mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages Zi Yan
2019-02-15 22:08 ` [RFC PATCH 28/31] mm: vmstats: add page promotion stats Zi Yan
2019-02-15 22:08 ` [RFC PATCH 29/31] mm: madvise: add madvise options to split PMD and PUD THPs Zi Yan
2019-02-15 22:08 ` [RFC PATCH 30/31] mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support Zi Yan
2019-02-15 22:08 ` [RFC PATCH 31/31] sysctl: toggle to promote PUD-mapped 1GB THP or not Zi Yan
2019-02-20  1:42 ` [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Mike Kravetz
2019-02-20  2:33   ` Zi Yan
2019-02-20  3:18     ` Mike Kravetz
2019-02-20  5:19       ` Zi Yan
2019-02-20  5:27         ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).