nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
@ 2021-06-17 18:44 Joao Martins
  2021-06-17 18:44 ` [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
                   ` (13 more replies)
  0 siblings, 14 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Changes since v1 [0]:

 (New patches 7, 10, 11)
 * Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
 * Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
 * Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
 that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan]
 * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
 * Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
 tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
 * Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan]
 * Simplify patch 9 as a result of having compound initialization differently [Dan]
 * Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan]
 * Rename Subject of patch 6 [Dan]
 * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
 * Add some type-safety to @block and use 'struct page *' rather than
 void, Patch 8 [Dan]
 * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
 * Remove vmemmap lookup function in place of
 pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
 serialization, Patch 8
 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
 * Add docs about device-dax usage of tail dedup technique in newly added
 compound_pagemaps.rst doc entry.
 * Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan]
 * Add cleanup patch for device-dax for using ALIGN() [Dan]
 * Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan]
 * Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
 * Move PUD compound page to be last patch [Dan]
 * Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan]

[0] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/

Full changelog of previous versions at the bottom of cover letter.

---

This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE which is now
in mmotm. 

[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).

So to summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.

Patches 5-12, 14: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap. Patch 14 comes last as it's an improvement, not
mandated for the initial functionality. Also move the very nice docs of
hugetlb_vmemmap.c into a Documentation/vm/ entry.

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally.

Patch 13: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:

                                                   before     after
    (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
    (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
    (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
    (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms

    altmap performance gets specially interesting when pinning a pmem dimm:

                                                   before     after
    (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
    (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
    (128G get_user_pages_fast altmap 2M page size)  ~3.91 s -> ~70 ms
    (128G pin_user_pages_fast altmap 2M page size)  ~3.97 s -> ~74 ms

I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.

Patches apply on top of linux-next tag next-20210617 (commit 7d9c6b8147bd).

Comments and suggestions very much appreciated!

Older Changelog,

 RFC[1] -> v1:
 (New patches 1-3, 5-8 but the diffstat isn't that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

[1] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/

Thanks,
	Joao

Joao Martins (14):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to
    helper
  mm/hugetlb_vmemmap: move comment block to Documentation/vm
  mm/sparse-vmemmap: populate compound pagemaps
  mm/page_alloc: reuse tail struct pages for compound pagemaps
  device-dax: use ALIGN() for determining pgoff
  device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  device-dax: compound pagemap support
  mm/gup: grab head page refcount once for group of subpages
  mm/sparse-vmemmap: improve memory savings for compound pud geometry

 Documentation/vm/compound_pagemaps.rst | 300 +++++++++++++++++++++++++
 Documentation/vm/index.rst             |   1 +
 drivers/dax/device.c                   |  58 +++--
 include/linux/memory_hotplug.h         |   5 +-
 include/linux/memremap.h               |  17 ++
 include/linux/mm.h                     |   8 +-
 mm/gup.c                               |  53 +++--
 mm/hugetlb_vmemmap.c                   | 162 +------------
 mm/memory-failure.c                    |   6 +
 mm/memory_hotplug.c                    |   3 +-
 mm/memremap.c                          |   9 +-
 mm/page_alloc.c                        | 148 ++++++++----
 mm/sparse-vmemmap.c                    | 226 +++++++++++++++++--
 mm/sparse.c                            |  24 +-
 14 files changed, 743 insertions(+), 277 deletions(-)
 create mode 100644 Documentation/vm/compound_pagemaps.rst

-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
@ 2021-06-17 18:44 ` Joao Martins
  2021-06-20 23:56   ` HORIGUCHI NAOYA(堀口 直也)
  2021-06-17 18:44 ` [PATCH v2 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()).  For pagemap with compound pages fetch the
compound_head in case a tail page memory failure is being handled.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory-failure.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e684b3d5c6a6..f1be578e488f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1519,6 +1519,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto out;
 	}
 
+	/*
+	 * Pages instantiated by device-dax (not filesystem-dax)
+	 * may be compound pages.
+	 */
+	page = compound_head(page);
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-06-17 18:44 ` [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-06-17 18:44 ` Joao Martins
  2021-07-13  0:02   ` Mike Kravetz
  2021-06-17 18:44 ` [PATCH v2 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

This is in preparation for sharing the storage for / deduplicating
compound page metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8836e54721ae..95967ce55829 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -741,24 +741,34 @@ void free_compound_page(struct page *page)
 	free_the_page(page, compound_order(page));
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	set_page_count(p, 0);
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		set_page_count(p, 0);
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-06-17 18:44 ` [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-06-17 18:44 ` [PATCH v2 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-06-17 18:44 ` Joao Martins
  2021-06-17 18:44 ` [PATCH v2 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Move struct page init to an helper function __init_zone_device_page().

This is in preparation for sharing the storage for / deduplicating
compound page metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 95967ce55829..1264c025becb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6565,6 +6565,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap)
+{
+
+	__init_single_page(page, pfn, zone_idx, nid);
+
+	/*
+	 * Mark page reserved as it will need to wait for onlining
+	 * phase for it to be fully associated with a zone.
+	 *
+	 * We can use the non-atomic __set_bit operation for setting
+	 * the flag as we are still initializing the pages.
+	 */
+	__SetPageReserved(page);
+
+	/*
+	 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
+	 * ever freed or placed on a driver-private list.
+	 */
+	page->pgmap = pgmap;
+	page->zone_device_data = NULL;
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		cond_resched();
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6593,39 +6633,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-		 * because this is done early in section_activate()
-		 */
-		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (2 preceding siblings ...)
  2021-06-17 18:44 ` [PATCH v2 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-06-17 18:44 ` Joao Martins
  2021-06-17 18:44 ` [PATCH v2 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Add a new align property for struct dev_pagemap which specifies that a
pagemap is composed of a set of compound pages of size @align, instead of
base pages. When a compound page geometry is requested, all but the first
page are initialised as tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page size,
this creates an opportunity to optimize GUP and GUP-fast walkers, treating
it the same way as THP or hugetlb pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memremap.h | 17 +++++++++++++++++
 mm/memremap.c            |  8 ++++++--
 mm/page_alloc.c          | 34 +++++++++++++++++++++++++++++++++-
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 119f130ef8f1..e5ab6d4525c1 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -99,6 +99,10 @@ struct dev_pagemap_ops {
  * @done: completion for @internal_ref
  * @type: memory type: see MEMORY_* in memory_hotplug.h
  * @flags: PGMAP_* flags to specify defailed behavior
+ * @geometry: structural definition of how the vmemmap metadata is populated.
+ *	A zero or PAGE_SIZE defaults to using base pages as the memmap metadata
+ *	representation. A bigger value but also multiple of PAGE_SIZE will set
+ *	up compound struct pages representative of the requested geometry size.
  * @ops: method table
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
@@ -114,6 +118,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long geometry;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
+{
+	if (!pgmap || !pgmap->geometry)
+		return PAGE_SIZE;
+	return pgmap->geometry;
+}
+
+static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
+{
+	return PHYS_PFN(pgmap_geometry(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..ffcb924eb6a5 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	if (pgmap_geometry(pgmap) > PAGE_SIZE)
+		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));
+	else
+		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+				- pfn_first(pgmap, range_id));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1264c025becb..42611c206d0a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6605,6 +6605,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
+					unsigned long zone_idx, int nid,
+					struct dev_pagemap *pgmap,
+					unsigned long nr_pages)
+{
+	unsigned int order_align = order_base_2(nr_pages);
+	unsigned long i;
+
+	__SetPageHead(page);
+
+	for (i = 1; i < nr_pages; i++) {
+		__init_zone_device_page(page + i, pfn + i, zone_idx,
+					nid, pgmap);
+		prep_compound_tail(page, i);
+
+		/*
+		 * The first and second tail pages need to
+		 * initialized first, hence the head page is
+		 * prepared last.
+		 */
+		if (i == 2)
+			prep_compound_head(page, order_align);
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6613,6 +6638,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6630,10 +6656,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
 		struct page *page = pfn_to_page(pfn);
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfns_per_compound == 1)
+			continue;
+
+		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+				     pfns_per_compound);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (3 preceding siblings ...)
  2021-06-17 18:44 ` [PATCH v2 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-06-17 18:44 ` Joao Martins
  2021-06-17 18:44 ` [PATCH v2 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

In support of using compound pages for devmap mappings, plumb the pgmap
down to the vmemmap_populate implementation. Note that while altmap is
retrievable from pgmap the memory hotplug code passes altmap without
pgmap[*], so both need to be independently plumbed.

So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:

	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap

Passing @pgmap allows __populate_section_memmap() to both fetch the
geometry in which memmap metadata is created for and also to let
sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
whether to just reuse tail pages from past onlined sections.

[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memory_hotplug.h |  5 ++++-
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  3 ++-
 mm/sparse.c                    | 24 +++++++++++++++---------
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a7fd2c3ccb77..9b1bca80224d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,7 @@ struct mem_section;
 struct memory_block;
 struct resource;
 struct vmem_altmap;
+struct dev_pagemap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
@@ -60,6 +61,7 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	struct dev_pagemap *pgmap;
 };
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -333,7 +335,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a127d93612fa..bb3b814e1860 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3083,7 +3083,8 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8cb75b26ea4f..c728a8ff38ad 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -268,7 +268,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+					 params->pgmap);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bdce883f9286..80d3ba30d345 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -603,7 +603,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index 6326cdf36c4f..5310be6171f1 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -453,7 +453,8 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -552,7 +553,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			break;
 
 		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-				nid, NULL);
+				nid, NULL, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -657,9 +658,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -728,7 +730,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 }
 #else
 struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	return kvmalloc_node(array_size(sizeof(struct page),
 					PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -851,7 +854,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -883,7 +887,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
@@ -898,6 +902,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
  * @altmap: device page map
+ * @pgmap: device page map object that owns the section
  *
  * This is only intended for hotplug.
  *
@@ -911,7 +916,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -922,7 +928,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (4 preceding siblings ...)
  2021-06-17 18:44 ` [PATCH v2 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-06-17 18:44 ` Joao Martins
  2021-06-17 18:45 ` [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

In preparation for describing a memmap with compound pages, move the
actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages() walk
through all base pages it needs to populate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 80d3ba30d345..76f4158f6301 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -570,33 +570,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap)
 {
-	unsigned long addr = start;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	pgd = vmemmap_pgd_populate(addr, node);
+	if (!pgd)
+		return -ENOMEM;
+	p4d = vmemmap_p4d_populate(pgd, addr, node);
+	if (!p4d)
+		return -ENOMEM;
+	pud = vmemmap_pud_populate(p4d, addr, node);
+	if (!pud)
+		return -ENOMEM;
+	pmd = vmemmap_pmd_populate(pud, addr, node);
+	if (!pmd)
+		return -ENOMEM;
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	if (!pte)
+		return -ENOMEM;
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	unsigned long addr = start;
+
 	for (; addr < end; addr += PAGE_SIZE) {
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
-			return -ENOMEM;
-		p4d = vmemmap_p4d_populate(pgd, addr, node);
-		if (!p4d)
-			return -ENOMEM;
-		pud = vmemmap_pud_populate(p4d, addr, node);
-		if (!pud)
-			return -ENOMEM;
-		pmd = vmemmap_pmd_populate(pud, addr, node);
-		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
-		if (!pte)
+		if (vmemmap_populate_address(addr, node, altmap))
 			return -ENOMEM;
-		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 	}
 
 	return 0;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (5 preceding siblings ...)
  2021-06-17 18:44 ` [PATCH v2 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  2021-06-21 13:12   ` [External] " Muchun Song
  2021-06-17 18:45 ` [PATCH v2 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

In preparation for device-dax for using hugetlbfs compound page tail
deduplication technique, move the comment block explanation into a
common place in Documentation/vm.

Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/compound_pagemaps.rst | 170 +++++++++++++++++++++++++
 Documentation/vm/index.rst             |   1 +
 mm/hugetlb_vmemmap.c                   | 162 +----------------------
 3 files changed, 172 insertions(+), 161 deletions(-)
 create mode 100644 Documentation/vm/compound_pagemaps.rst

diff --git a/Documentation/vm/compound_pagemaps.rst b/Documentation/vm/compound_pagemaps.rst
new file mode 100644
index 000000000000..6b1af50e8201
--- /dev/null
+++ b/Documentation/vm/compound_pagemaps.rst
@@ -0,0 +1,170 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _commpound_pagemaps:
+
+==================================
+Free some vmemmap pages of HugeTLB
+==================================
+
+The struct page structures (page structs) are used to describe a physical
+page frame. By default, there is a one-to-one mapping from a page frame to
+it's corresponding page struct.
+
+HugeTLB pages consist of multiple base page size pages and is supported by
+many architectures. See hugetlbpage.rst in the Documentation directory for
+more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
+are currently supported. Since the base page size on x86 is 4KB, a 2MB
+HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
+4096 base pages. For each base page, there is a corresponding page struct.
+
+Within the HugeTLB subsystem, only the first 4 page structs are used to
+contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
+this upper limit. The only 'useful' information in the remaining page structs
+is the compound_head field, and this field is the same for all tail pages.
+
+By removing redundant page structs for HugeTLB pages, memory can be returned
+to the buddy allocator for other uses.
+
+Different architectures support different HugeTLB pages. For example, the
+following table is the HugeTLB page size supported by x86 and arm64
+architectures. Because arm64 supports 4k, 16k, and 64k base pages and
+supports contiguous entries, so it supports many kinds of sizes of HugeTLB
+page.
+
++--------------+-----------+-----------------------------------------------+
+| Architecture | Page Size |                HugeTLB Page Size              |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
+|              +-----------+-----------+-----------+-----------+-----------+
+|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
+|              +-----------+-----------+-----------+-----------+-----------+
+|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+
+When the system boot up, every HugeTLB page has more than one struct page
+structs which size is (unit: pages):
+
+   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+
+Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
+of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
+relationship.
+
+   HugeTLB_Size = n * PAGE_SIZE
+
+Then,
+
+   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+               = n * sizeof(struct page) / PAGE_SIZE
+
+We can use huge mapping at the pud/pmd level for the HugeTLB page.
+
+For the HugeTLB page of the pmd level mapping, then
+
+   struct_size = n * sizeof(struct page) / PAGE_SIZE
+               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
+               = sizeof(struct page) / sizeof(pte_t)
+               = 64 / 8
+               = 8 (pages)
+
+Where n is how many pte entries which one page can contains. So the value of
+n is (PAGE_SIZE / sizeof(pte_t)).
+
+This optimization only supports 64-bit system, so the value of sizeof(pte_t)
+is 8. And this optimization also applicable only when the size of struct page
+is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
+x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
+size of struct page structs of it is 8 page frames which size depends on the
+size of the base page.
+
+For the HugeTLB page of the pud level mapping, then
+
+   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
+               = PAGE_SIZE / 8 * 8 (pages)
+               = PAGE_SIZE (pages)
+
+Where the struct_size(pmd) is the size of the struct page structs of a
+HugeTLB page of the pmd level mapping.
+
+E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
+HugeTLB page consists in 4096.
+
+Next, we take the pmd level mapping of the HugeTLB page as an example to
+show the internal implementation of this optimization. There are 8 pages
+struct page structs associated with a HugeTLB page which is pmd mapped.
+
+Here is how things look before optimization.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | -------------> |     2     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     3     | -------------> |     3     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     4     | -------------> |     4     |
+ |    PMD    |                     +-----------+                +-----------+
+ |   level   |                     |     5     | -------------> |     5     |
+ |  mapping  |                     +-----------+                +-----------+
+ |           |                     |     6     | -------------> |     6     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     7     | -------------> |     7     |
+ |           |                     +-----------+                +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+The value of page->compound_head is the same for all tail pages. The first
+page of page structs (page 0) associated with the HugeTLB page contains the 4
+page structs necessary to describe the HugeTLB. The only use of the remaining
+pages of page structs (page 1 to page 7) is to point to page->compound_head.
+Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
+will be used for each HugeTLB page. This will allow us to free the remaining
+6 pages to the buddy allocator.
+
+Here is how things look after remapping.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PMD    |                     +-----------+                       | | |
+ |   level   |                     |     5     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     6     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |     7     | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
+vmemmap pages and restore the previous mapping relationship.
+
+For the HugeTLB page of the pud level mapping. It is similar to the former.
+We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
+
+Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
+(e.g. aarch64) provides a contiguous bit in the translation table entries
+that hints to the MMU to indicate that it is one of a contiguous set of
+entries that can be cached in a single TLB entry.
+
+The contiguous bit is used to increase the mapping size at the pmd and pte
+(last) level. So this type of HugeTLB page can be optimized only when its
+size of the struct page structs is greater than 2 pages.
+
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..19f981a73a54 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -31,6 +31,7 @@ descriptions of data structures and algorithms.
    active_mm
    arch_pgtable_helpers
    balance
+   commpound_pagemaps
    cleancache
    free_page_reporting
    frontswap
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c540c21e26f5..69d1f0a90e02 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -6,167 +6,7 @@
  *
  *     Author: Muchun Song <songmuchun@bytedance.com>
  *
- * The struct page structures (page structs) are used to describe a physical
- * page frame. By default, there is a one-to-one mapping from a page frame to
- * it's corresponding page struct.
- *
- * HugeTLB pages consist of multiple base page size pages and is supported by
- * many architectures. See hugetlbpage.rst in the Documentation directory for
- * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
- * are currently supported. Since the base page size on x86 is 4KB, a 2MB
- * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
- * 4096 base pages. For each base page, there is a corresponding page struct.
- *
- * Within the HugeTLB subsystem, only the first 4 page structs are used to
- * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
- * this upper limit. The only 'useful' information in the remaining page structs
- * is the compound_head field, and this field is the same for all tail pages.
- *
- * By removing redundant page structs for HugeTLB pages, memory can be returned
- * to the buddy allocator for other uses.
- *
- * Different architectures support different HugeTLB pages. For example, the
- * following table is the HugeTLB page size supported by x86 and arm64
- * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
- * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
- * page.
- *
- * +--------------+-----------+-----------------------------------------------+
- * | Architecture | Page Size |                HugeTLB Page Size              |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- *
- * When the system boot up, every HugeTLB page has more than one struct page
- * structs which size is (unit: pages):
- *
- *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *
- * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
- * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
- * relationship.
- *
- *    HugeTLB_Size = n * PAGE_SIZE
- *
- * Then,
- *
- *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *                = n * sizeof(struct page) / PAGE_SIZE
- *
- * We can use huge mapping at the pud/pmd level for the HugeTLB page.
- *
- * For the HugeTLB page of the pmd level mapping, then
- *
- *    struct_size = n * sizeof(struct page) / PAGE_SIZE
- *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
- *                = sizeof(struct page) / sizeof(pte_t)
- *                = 64 / 8
- *                = 8 (pages)
- *
- * Where n is how many pte entries which one page can contains. So the value of
- * n is (PAGE_SIZE / sizeof(pte_t)).
- *
- * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
- * is 8. And this optimization also applicable only when the size of struct page
- * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
- * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
- * size of struct page structs of it is 8 page frames which size depends on the
- * size of the base page.
- *
- * For the HugeTLB page of the pud level mapping, then
- *
- *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
- *                = PAGE_SIZE / 8 * 8 (pages)
- *                = PAGE_SIZE (pages)
- *
- * Where the struct_size(pmd) is the size of the struct page structs of a
- * HugeTLB page of the pmd level mapping.
- *
- * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
- * HugeTLB page consists in 4096.
- *
- * Next, we take the pmd level mapping of the HugeTLB page as an example to
- * show the internal implementation of this optimization. There are 8 pages
- * struct page structs associated with a HugeTLB page which is pmd mapped.
- *
- * Here is how things look before optimization.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | -------------> |     1     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     2     | -------------> |     2     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     3     | -------------> |     3     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     4     | -------------> |     4     |
- * |    PMD    |                     +-----------+                +-----------+
- * |   level   |                     |     5     | -------------> |     5     |
- * |  mapping  |                     +-----------+                +-----------+
- * |           |                     |     6     | -------------> |     6     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     7     | -------------> |     7     |
- * |           |                     +-----------+                +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * The value of page->compound_head is the same for all tail pages. The first
- * page of page structs (page 0) associated with the HugeTLB page contains the 4
- * page structs necessary to describe the HugeTLB. The only use of the remaining
- * pages of page structs (page 1 to page 7) is to point to page->compound_head.
- * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
- * will be used for each HugeTLB page. This will allow us to free the remaining
- * 6 pages to the buddy allocator.
- *
- * Here is how things look after remapping.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | -------------> |     1     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
- * |           |                     +-----------+                   | | | | |
- * |           |                     |     3     | ------------------+ | | | |
- * |           |                     +-----------+                     | | | |
- * |           |                     |     4     | --------------------+ | | |
- * |    PMD    |                     +-----------+                       | | |
- * |   level   |                     |     5     | ----------------------+ | |
- * |  mapping  |                     +-----------+                         | |
- * |           |                     |     6     | ------------------------+ |
- * |           |                     +-----------+                           |
- * |           |                     |     7     | --------------------------+
- * |           |                     +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
- * vmemmap pages and restore the previous mapping relationship.
- *
- * For the HugeTLB page of the pud level mapping. It is similar to the former.
- * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
- *
- * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
- * (e.g. aarch64) provides a contiguous bit in the translation table entries
- * that hints to the MMU to indicate that it is one of a contiguous set of
- * entries that can be cached in a single TLB entry.
- *
- * The contiguous bit is used to increase the mapping size at the pmd and pte
- * (last) level. So this type of HugeTLB page can be optimized only when its
- * size of the struct page structs is greater than 2 pages.
+ * See Documentation/vm/compound_pagemaps.rst
  */
 #define pr_fmt(fmt)	"HugeTLB: " fmt
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 08/14] mm/sparse-vmemmap: populate compound pagemaps
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (6 preceding siblings ...)
  2021-06-17 18:45 ` [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  2021-06-17 18:45 ` [PATCH v2 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead. Allocate a separate
page for the vmemmap area which contains the head page and separate for
the next 64 pages. The rest of the subsections then reuse this tail
vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound pagemap with big enough @align (e.g. 1G
PUD) it will cross various sections. To be able to reuse tail pages
across sections belonging to the same gigantic page, fetch the
@range being mapped (nr_ranges + 1).  If the section being mapped is
not offset 0 of the @align, then lookup the PFN of the struct page
address that precedes it and use that to populate the entire
section.

On compound pagemaps with 2M align, this mechanism lets 6 pages be
saved out of the 8 necessary PFNs necessary to set the subsection's
512 struct pages being mapped. On a 1G compound pagemap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().  It
is worth noting that altmap for devmap mappings was there to relieve the
pressure of inordinate amounts of memmap space to map terabytes of pmem.
With compound pages the motivation for altmaps for pmem gets reduced.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/compound_pagemaps.rst |  27 ++++-
 include/linux/mm.h                     |   2 +-
 mm/memremap.c                          |   1 +
 mm/sparse-vmemmap.c                    | 133 +++++++++++++++++++++++--
 4 files changed, 151 insertions(+), 12 deletions(-)

diff --git a/Documentation/vm/compound_pagemaps.rst b/Documentation/vm/compound_pagemaps.rst
index 6b1af50e8201..c81123327eea 100644
--- a/Documentation/vm/compound_pagemaps.rst
+++ b/Documentation/vm/compound_pagemaps.rst
@@ -2,9 +2,12 @@
 
 .. _commpound_pagemaps:
 
-==================================
-Free some vmemmap pages of HugeTLB
-==================================
+=================================================
+Free some vmemmap pages of HugeTLB and Device DAX
+=================================================
+
+HugeTLB
+=======
 
 The struct page structures (page structs) are used to describe a physical
 page frame. By default, there is a one-to-one mapping from a page frame to
@@ -168,3 +171,21 @@ The contiguous bit is used to increase the mapping size at the pmd and pte
 (last) level. So this type of HugeTLB page can be optimized only when its
 size of the struct page structs is greater than 2 pages.
 
+Device DAX
+==========
+
+The device-dax interface uses the same tail deduplication technique explained
+in the previous chapter, except when used with the vmemmap in the device (altmap).
+
+The differences with HugeTLB are relatively minor.
+
+The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
+PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+
+There's no remapping of vmemmap given that device-dax memory is not part of
+System RAM ranges initialized at boot, hence the tail deduplication happens
+at a later stage when we populate the sections.
+
+It only use 3 page structs for storing all information as opposed
+to 4 on HugeTLB pages. This does not affect memory savings between both.
+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb3b814e1860..f1454525f4a8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3090,7 +3090,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, struct page *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index ffcb924eb6a5..9198fdace903 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 76f4158f6301..aacc6148aec3 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -495,16 +495,31 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap,
+				       struct page *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
 		void *p;
 
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else {
+			/*
+			 * When a PTE/PMD entry is freed from the init_mm
+			 * there's a a free_pages() call to this page allocated
+			 * above. Thus this get_page() is paired with the
+			 * put_page_testzero() on the freeing path.
+			 * This can only called by certain ZONE_DEVICE path,
+			 * and through vmemmap_populate_compound_pages() when
+			 * slab is available.
+			 */
+			get_page(block);
+			p = page_to_virt(block);
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -571,7 +586,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      struct page *reuse, struct page **page)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -591,10 +607,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	if (page)
+		*page = pte_page(*pte);
+	return 0;
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
@@ -603,7 +623,97 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	unsigned long addr = start;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		if (vmemmap_populate_address(addr, node, altmap))
+		if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, struct page *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		if (vmemmap_populate_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  struct page **page)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, page);
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long offset, size, addr;
+
+	/*
+	 * For compound pages bigger than section size (e.g. x86 1G compound
+	 * pages with 2M subsection size) fill the rest of sections as tail
+	 * pages.
+	 *
+	 * Note that memremap_pages() resets @nr_range value and will increment
+	 * it after each range successful onlining. Thus the value or @nr_range
+	 * at section memmap populate corresponds to the in-progress range
+	 * being onlined here.
+	 */
+	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
+	if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
+	    pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
+		pte_t *ptep;
+
+		addr = start - PAGE_SIZE;
+
+		/*
+		 * Sections are populated sequently and in sucession meaning
+		 * this section being populated wouldn't start if the
+		 * preceding one wasn't successful. So there is a guarantee that
+		 * the previous struct pages are mapped when trying to lookup
+		 * the last tail page.
+		 */
+		ptep = pte_offset_kernel(pmd_off_k(addr), addr);
+		if (!ptep)
+			return -ENOMEM;
+
+		/*
+		 * Reuse the page that was populated in the prior iteration
+		 * with just tail struct pages.
+		 */
+		return vmemmap_populate_range(start, end, node,
+					      pte_page(*ptep));
+	}
+
+	size = min(end - start, pgmap_pfn_geometry(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		struct page *block;
+
+		/* Populate the head page vmemmap page */
+		if (vmemmap_populate_page(addr, node, NULL))
+			return -ENOMEM;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		if (vmemmap_populate_page(next, node, &block))
+			return -ENOMEM;
+
+		/*
+		 * Reuse the previous page for the rest of tail pages
+		 * See layout diagram in Documentation/vm/compound_pagemaps.rst
+		 */
+		next += PAGE_SIZE;
+		if (vmemmap_populate_range(next, last, node, block))
 			return -ENOMEM;
 	}
 
@@ -616,12 +726,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	unsigned int geometry = pgmap_geometry(pgmap);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (geometry > PAGE_SIZE && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 09/14] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (7 preceding siblings ...)
  2021-06-17 18:45 ` [PATCH v2 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  2021-06-17 18:45 ` [PATCH v2 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Currently memmap_init_zone_device() ends up initializing 32768 pages
when it only needs to initialize 128 given tail page reuse. That
number is worse with 1GB compound page geometries, 262144 instead of
128. Update memmap_init_zone_device() to skip redundant
initialization, detailed below.

When a pgmap @geometry is set, all pages are mapped at a given huge page
alignment and use compound pages to describe them as opposed to a
struct per 4K.

With @geometry > PAGE_SIZE and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of unique
struct pages is a lot smaller that the total amount of struct pages
being mapped.

The altmap path is left alone since it does not support memory savings
based on compound pagemap geometries.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 42611c206d0a..cf4c2cd32874 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6608,11 +6608,23 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
 					unsigned long zone_idx, int nid,
 					struct dev_pagemap *pgmap,
+					struct vmem_altmap *altmap,
 					unsigned long nr_pages)
 {
 	unsigned int order_align = order_base_2(nr_pages);
 	unsigned long i;
 
+	/*
+	 * With compound page geometry and when struct pages are stored in ram
+	 * (!altmap) most tail pages are reused. Consequently, the amount of
+	 * unique struct pages to initialize is a lot smaller that the total
+	 * amount of struct pages being mapped.
+	 * See vmemmap_populate_compound_pages().
+	 */
+	if (!altmap)
+		nr_pages = min_t(unsigned long, nr_pages,
+				 2 * (PAGE_SIZE/sizeof(struct page)));
+
 	__SetPageHead(page);
 
 	for (i = 1; i < nr_pages; i++) {
@@ -6665,7 +6677,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     pfns_per_compound);
+				     altmap, pfns_per_compound);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 10/14] device-dax: use ALIGN() for determining pgoff
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (8 preceding siblings ...)
  2021-06-17 18:45 ` [PATCH v2 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  2021-06-17 18:45 ` [PATCH v2 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Rather than calculating @pgoff manually, switch to ALIGN() instead.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index dd8222a42808..0b82159b3564 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -234,8 +234,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
+		pgoff = linear_page_index(vmf->vma,
+				ALIGN(vmf->address, fault_size));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (9 preceding siblings ...)
  2021-06-17 18:45 ` [PATCH v2 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  2021-06-17 18:45 ` [PATCH v2 12/14] device-dax: compound pagemap support Joao Martins
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Right now, only static dax regions have a valid @pgmap pointer in its
struct dev_dax. Dynamic dax case however, do not.

In preparation for device-dax compound pagemap support, make sure that
dev_dax pgmap field is set after it has been allocated and initialized.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 0b82159b3564..6e348b5f9d45 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	dev_dax->pgmap = pgmap;
+
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 12/14] device-dax: compound pagemap support
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (10 preceding siblings ...)
  2021-06-17 18:45 ` [PATCH v2 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  2021-06-17 18:45 ` [PATCH v2 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
  2021-06-17 18:45 ` [PATCH v2 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Use the newly added compound pagemap facility which maps the assigned dax
ranges as compound pages at a page size of @align. Currently, this means,
that region/namespace bootstrap would take considerably less, given that
you would initialize considerably less pages.

On setups with 128G NVDIMMs the initialization with DRAM stored struct
pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
than a 1msec with 1G pages.

dax devices are created with a fixed @align (huge page size) which is
enforced through as well at mmap() of the device. Faults, consequently
happen too at the specified @align specified at the creation, and those
don't change through out dax device lifetime. MCEs poisons a whole dax
huge page, as well as splits occurring at the configured page size.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 56 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6e348b5f9d45..149627c922cc 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned long fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned long fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head = compound_head(head);
+	if (head->mapping)
+		return;
+
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma,
+			ALIGN(vmf->address, fault_size));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +261,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = dev_dax->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +269,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma,
-				ALIGN(vmf->address, fault_size));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap_geometry(pgmap) > PAGE_SIZE)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -426,6 +454,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (dev_dax->align > PAGE_SIZE)
+		pgmap->geometry = dev_dax->align;
 	dev_dax->pgmap = pgmap;
 
 	addr = devm_memremap_pages(dev, pgmap);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 13/14] mm/gup: grab head page refcount once for group of subpages
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (11 preceding siblings ...)
  2021-06-17 18:45 ` [PATCH v2 12/14] device-dax: compound pagemap support Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  2021-06-17 18:45 ` [PATCH v2 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Use try_grab_compound_head() for device-dax GUP when configured with a
compound pagemap.

Rather than incrementing the refcount for each page, do one atomic
addition for all the pages to be pinned.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:

 $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
(pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
[altmap]
(get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
(pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

 $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
(pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
[altmap with -m 127004]
(get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
(pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 53 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..9baaa1c0b7f3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2234,31 +2234,55 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			     unsigned long end, unsigned int flags,
 			     struct page **pages, int *nr)
 {
-	int nr_start = *nr;
+	int refs, nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 
 	do {
-		struct page *page = pfn_to_page(pfn);
+		struct page *pinned_head, *head, *page = pfn_to_page(pfn);
+		unsigned long next;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
-		SetPageReferenced(page);
-		pages[*nr] = page;
-		if (unlikely(!try_grab_page(page, flags))) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
+
+		head = compound_head(page);
+		/* @end is assumed to be limited at most one compound page */
+		next = PageCompound(head) ? end : addr + PAGE_SIZE;
+		refs = record_subpages(page, addr, next, pages + *nr);
+
+		SetPageReferenced(head);
+		pinned_head = try_grab_compound_head(head, refs, flags);
+		if (!pinned_head) {
+			if (PageCompound(head)) {
+				ClearPageReferenced(head);
+				put_dev_pagemap(pgmap);
+			} else {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+			}
 			return 0;
 		}
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
+		*nr += refs;
+		pfn += refs;
+	} while (addr += (refs << PAGE_SHIFT), addr != end);
 
 	if (pgmap)
 		put_dev_pagemap(pgmap);
@@ -2318,17 +2342,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry
  2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (12 preceding siblings ...)
  2021-06-17 18:45 ` [PATCH v2 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-06-17 18:45 ` Joao Martins
  13 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-17 18:45 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Currently, for compound PUD mappings, the implementation consumes 40MB
per TB but it can be optimized to 16MB per TB with the approach
detailed below.

Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that precedes
the memmap being initialized.  This is done when a given memmap
address isn't aligned to the pgmap @geometry (which is safe to do because
@ranges are guaranteed to be aligned to @geometry).

For pagemaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages.  Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being  allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD previously populated which only contain
tail pages).

After this scheme for an 1GB pagemap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages.  The latter page gets
its PMD reused across future section mapping of the same pagemap.

Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. Rather than
requiring 8 PMD page allocations only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/compound_pagemaps.rst | 109 +++++++++++++++++++++++++
 include/linux/mm.h                     |   3 +-
 mm/sparse-vmemmap.c                    |  74 ++++++++++++++---
 3 files changed, 174 insertions(+), 12 deletions(-)

diff --git a/Documentation/vm/compound_pagemaps.rst b/Documentation/vm/compound_pagemaps.rst
index c81123327eea..a6603b7165f7 100644
--- a/Documentation/vm/compound_pagemaps.rst
+++ b/Documentation/vm/compound_pagemaps.rst
@@ -189,3 +189,112 @@ at a later stage when we populate the sections.
 It only use 3 page structs for storing all information as opposed
 to 4 on HugeTLB pages. This does not affect memory savings between both.
 
+Additionally, it further extends the tail page deduplication with 1GB
+device-dax compound pages.
+
+E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split
+across 8 PMD page frames, with the first PMD having 2 PTE page frames.
+In total this represents a total of 40960 bytes per 1GB page.
+
+Here is how things look after the previously described tail page deduplication
+technique.
+
+   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
+ +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
+ |           |    |    0     |     |     0     | -------------> |      0      |
+ |           |    +----------+     +-----------+                +-------------+
+ |           |                     |     1     | -------------> |      1      |
+ |           |                     +-----------+                +-------------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | | |
+ |           |                     |     3     | ------------------+ | | | | |
+ |           |                     +-----------+                     | | | | |
+ |           |                     |     4     | --------------------+ | | | |
+ |   PMD 0   |                     +-----------+                       | | | |
+ |           |                     |     5     | ----------------------+ | | |
+ |           |                     +-----------+                         | | |
+ |           |                     |     ..    | ------------------------+ | |
+ |           |                     +-----------+                           | |
+ |           |                     |     511   | --------------------------+ |
+ |           |                     +-----------+                             |
+ |           |                                                               |
+ |           |                                                               |
+ |           |                                                               |
+ +-----------+     page frames                                               |
+ +-----------+ -> +----------+ --> +-----------+    mapping to               |
+ |           |    |  1 .. 7  |     |    512    | ----------------------------+
+ |           |    +----------+     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |    PMD    |                     +-----------+                             |
+ |  1 .. 7   |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    4095   | ----------------------------+
+ +-----------+                     +-----------+
+
+Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame
+that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7
+all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame.
+This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per
+1GB page from 40960 bytes to 16384 bytes.
+
+Here is how things look after PMD tail page deduplication.
+
+   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
+ +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
+ |           |    |    0     |     |     0     | -------------> |      0      |
+ |           |    +----------+     +-----------+                +-------------+
+ |           |                     |     1     | -------------> |      1      |
+ |           |                     +-----------+                +-------------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | | |
+ |           |                     |     3     | ------------------+ | | | | |
+ |           |                     +-----------+                     | | | | |
+ |           |                     |     4     | --------------------+ | | | |
+ |   PMD 0   |                     +-----------+                       | | | |
+ |           |                     |     5     | ----------------------+ | | |
+ |           |                     +-----------+                         | | |
+ |           |                     |     ..    | ------------------------+ | |
+ |           |                     +-----------+                           | |
+ |           |                     |     511   | --------------------------+ |
+ |           |                     +-----------+                             |
+ |           |                                                               |
+ |           |                                                               |
+ |           |                                                               |
+ +-----------+     page frames                                               |
+ +-----------+ -> +----------+ --> +-----------+    mapping to               |
+ |           |    |    1     |     |    512    | ----------------------------+
+ |           |    +----------+     +-----------+                             |
+ |           |     ^ ^ ^ ^ ^ ^     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |   PMD 1   |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    4095   | ----------------------------+
+ +-----------+     | | | | | |     +-----------+
+ |   PMD 2   | ----+ | | | | |
+ +-----------+       | | | | |
+ |   PMD 3   | ------+ | | | |
+ +-----------+         | | | |
+ |   PMD 4   | --------+ | | |
+ +-----------+           | | |
+ |   PMD 5   | ----------+ | |
+ +-----------+             | |
+ |   PMD 6   | ------------+ |
+ +-----------+               |
+ |   PMD 7   | --------------+
+ +-----------+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1454525f4a8..3f3a5c308939 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+			    struct page *block);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, struct page *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index aacc6148aec3..2eba2da31b91 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+				       struct page *block)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-		if (!p)
-			return NULL;
+		void *p;
+
+		if (!block) {
+			p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+			if (!p)
+				return NULL;
+		} else {
+			/* See comment in vmemmap_pte_populate(). */
+			get_page(block);
+			p = page_to_virt(block);
+		}
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      struct page *reuse, struct page **page)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+						  struct vmem_altmap *altmap,
+						  struct page *reuse, pmd_t **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = vmemmap_pgd_populate(addr, node);
 	if (!pgd)
@@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pud = vmemmap_pud_populate(p4d, addr, node);
 	if (!pud)
 		return -ENOMEM;
-	pmd = vmemmap_pmd_populate(pud, addr, node);
+	pmd = vmemmap_pmd_populate(pud, addr, node, reuse);
 	if (!pmd)
 		return -ENOMEM;
+	if (ptr)
+		*ptr = pmd;
+	return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap,
+					      struct page *reuse, struct page **page)
+{
+	pmd_t *pmd;
+	pte_t *pte;
+
+	if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
+		return -ENOMEM;
+
 	pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
 	if (!pte)
 		return -ENOMEM;
@@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
 	return vmemmap_populate_address(addr, node, NULL, NULL, page);
 }
 
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+						unsigned long end,
+						int node, struct page *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PMD_SIZE) {
+		if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 						     unsigned long start,
 						     unsigned long end, int node,
@@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
 	if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
 	    pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
+		pmd_t *pmdp;
 		pte_t *ptep;
 
 		addr = start - PAGE_SIZE;
@@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 		 * the previous struct pages are mapped when trying to lookup
 		 * the last tail page.
 		 */
-		ptep = pte_offset_kernel(pmd_off_k(addr), addr);
-		if (!ptep)
+		pmdp = pmd_off_k(addr);
+		if (!pmdp)
+			return -ENOMEM;
+
+		/*
+		 * Reuse the tail pages vmemmap pmd page
+		 * See layout diagram in Documentation/vm/compound_pagemaps.rst
+		 */
+		if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION))
+			return vmemmap_populate_pmd_range(start, end, node,
+							  pmd_page(*pmdp));
+
+		/* See comment above when pmd_off_k() is called. */
+		ptep = pte_offset_kernel(pmdp, addr);
+		if (pte_none(*ptep))
 			return -ENOMEM;
 
 		/*
+		 * Populate the tail pages vmemmap pmd page.
 		 * Reuse the page that was populated in the prior iteration
 		 * with just tail struct pages.
 		 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-06-17 18:44 ` [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-06-20 23:56   ` HORIGUCHI NAOYA(堀口 直也)
  2021-06-21 13:50     ` Joao Martins
  0 siblings, 1 reply; 23+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-20 23:56 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jason Gunthorpe, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, nvdimm, linux-doc

On Thu, Jun 17, 2021 at 07:44:54PM +0100, Joao Martins wrote:
> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
> dax_lock_page()).  For pagemap with compound pages fetch the
> compound_head in case a tail page memory failure is being handled.
> 
> Currently this is a nop, but in the advent of compound pages in
> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
> 
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Looks good to me.

Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

> ---
>  mm/memory-failure.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e684b3d5c6a6..f1be578e488f 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1519,6 +1519,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  		goto out;
>  	}
>  
> +	/*
> +	 * Pages instantiated by device-dax (not filesystem-dax)
> +	 * may be compound pages.
> +	 */
> +	page = compound_head(page);
> +
>  	/*
>  	 * Prevent the inode from being freed while we are interrogating
>  	 * the address_space, typically this would be handled by
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-06-17 18:45 ` [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
@ 2021-06-21 13:12   ` Muchun Song
  2021-06-21 13:42     ` Joao Martins
  0 siblings, 1 reply; 23+ messages in thread
From: Muchun Song @ 2021-06-21 13:12 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On Fri, Jun 18, 2021 at 2:46 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> In preparation for device-dax for using hugetlbfs compound page tail
> deduplication technique, move the comment block explanation into a
> common place in Documentation/vm.
>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  Documentation/vm/compound_pagemaps.rst | 170 +++++++++++++++++++++++++
>  Documentation/vm/index.rst             |   1 +
>  mm/hugetlb_vmemmap.c                   | 162 +----------------------
>  3 files changed, 172 insertions(+), 161 deletions(-)
>  create mode 100644 Documentation/vm/compound_pagemaps.rst

IMHO, how about the name of vmemmap_remap.rst? page_frags.rst seems
to tell people it's about the page mapping not its vmemmap mapping.

Thanks.

>
> diff --git a/Documentation/vm/compound_pagemaps.rst b/Documentation/vm/compound_pagemaps.rst
> new file mode 100644
> index 000000000000..6b1af50e8201
> --- /dev/null
> +++ b/Documentation/vm/compound_pagemaps.rst
> @@ -0,0 +1,170 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _commpound_pagemaps:
> +
> +==================================
> +Free some vmemmap pages of HugeTLB
> +==================================
> +
> +The struct page structures (page structs) are used to describe a physical
> +page frame. By default, there is a one-to-one mapping from a page frame to
> +it's corresponding page struct.
> +
> +HugeTLB pages consist of multiple base page size pages and is supported by
> +many architectures. See hugetlbpage.rst in the Documentation directory for
> +more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
> +are currently supported. Since the base page size on x86 is 4KB, a 2MB
> +HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> +4096 base pages. For each base page, there is a corresponding page struct.
> +
> +Within the HugeTLB subsystem, only the first 4 page structs are used to
> +contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
> +this upper limit. The only 'useful' information in the remaining page structs
> +is the compound_head field, and this field is the same for all tail pages.
> +
> +By removing redundant page structs for HugeTLB pages, memory can be returned
> +to the buddy allocator for other uses.
> +
> +Different architectures support different HugeTLB pages. For example, the
> +following table is the HugeTLB page size supported by x86 and arm64
> +architectures. Because arm64 supports 4k, 16k, and 64k base pages and
> +supports contiguous entries, so it supports many kinds of sizes of HugeTLB
> +page.
> +
> ++--------------+-----------+-----------------------------------------------+
> +| Architecture | Page Size |                HugeTLB Page Size              |
> ++--------------+-----------+-----------+-----------+-----------+-----------+
> +|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
> ++--------------+-----------+-----------+-----------+-----------+-----------+
> +|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
> +|              +-----------+-----------+-----------+-----------+-----------+
> +|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
> +|              +-----------+-----------+-----------+-----------+-----------+
> +|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
> ++--------------+-----------+-----------+-----------+-----------+-----------+
> +
> +When the system boot up, every HugeTLB page has more than one struct page
> +structs which size is (unit: pages):
> +
> +   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> +
> +Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> +of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> +relationship.
> +
> +   HugeTLB_Size = n * PAGE_SIZE
> +
> +Then,
> +
> +   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> +               = n * sizeof(struct page) / PAGE_SIZE
> +
> +We can use huge mapping at the pud/pmd level for the HugeTLB page.
> +
> +For the HugeTLB page of the pmd level mapping, then
> +
> +   struct_size = n * sizeof(struct page) / PAGE_SIZE
> +               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> +               = sizeof(struct page) / sizeof(pte_t)
> +               = 64 / 8
> +               = 8 (pages)
> +
> +Where n is how many pte entries which one page can contains. So the value of
> +n is (PAGE_SIZE / sizeof(pte_t)).
> +
> +This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> +is 8. And this optimization also applicable only when the size of struct page
> +is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
> +x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
> +size of struct page structs of it is 8 page frames which size depends on the
> +size of the base page.
> +
> +For the HugeTLB page of the pud level mapping, then
> +
> +   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> +               = PAGE_SIZE / 8 * 8 (pages)
> +               = PAGE_SIZE (pages)
> +
> +Where the struct_size(pmd) is the size of the struct page structs of a
> +HugeTLB page of the pmd level mapping.
> +
> +E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
> +HugeTLB page consists in 4096.
> +
> +Next, we take the pmd level mapping of the HugeTLB page as an example to
> +show the internal implementation of this optimization. There are 8 pages
> +struct page structs associated with a HugeTLB page which is pmd mapped.
> +
> +Here is how things look before optimization.
> +
> +    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> + +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> + |           |                     |     0     | -------------> |     0     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     1     | -------------> |     1     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     2     | -------------> |     2     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     3     | -------------> |     3     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     4     | -------------> |     4     |
> + |    PMD    |                     +-----------+                +-----------+
> + |   level   |                     |     5     | -------------> |     5     |
> + |  mapping  |                     +-----------+                +-----------+
> + |           |                     |     6     | -------------> |     6     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     7     | -------------> |     7     |
> + |           |                     +-----------+                +-----------+
> + |           |
> + |           |
> + |           |
> + +-----------+
> +
> +The value of page->compound_head is the same for all tail pages. The first
> +page of page structs (page 0) associated with the HugeTLB page contains the 4
> +page structs necessary to describe the HugeTLB. The only use of the remaining
> +pages of page structs (page 1 to page 7) is to point to page->compound_head.
> +Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> +will be used for each HugeTLB page. This will allow us to free the remaining
> +6 pages to the buddy allocator.
> +
> +Here is how things look after remapping.
> +
> +    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> + +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> + |           |                     |     0     | -------------> |     0     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     1     | -------------> |     1     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
> + |           |                     +-----------+                   | | | | |
> + |           |                     |     3     | ------------------+ | | | |
> + |           |                     +-----------+                     | | | |
> + |           |                     |     4     | --------------------+ | | |
> + |    PMD    |                     +-----------+                       | | |
> + |   level   |                     |     5     | ----------------------+ | |
> + |  mapping  |                     +-----------+                         | |
> + |           |                     |     6     | ------------------------+ |
> + |           |                     +-----------+                           |
> + |           |                     |     7     | --------------------------+
> + |           |                     +-----------+
> + |           |
> + |           |
> + |           |
> + +-----------+
> +
> +When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> +vmemmap pages and restore the previous mapping relationship.
> +
> +For the HugeTLB page of the pud level mapping. It is similar to the former.
> +We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
> +
> +Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
> +(e.g. aarch64) provides a contiguous bit in the translation table entries
> +that hints to the MMU to indicate that it is one of a contiguous set of
> +entries that can be cached in a single TLB entry.
> +
> +The contiguous bit is used to increase the mapping size at the pmd and pte
> +(last) level. So this type of HugeTLB page can be optimized only when its
> +size of the struct page structs is greater than 2 pages.
> +
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index eff5fbd492d0..19f981a73a54 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -31,6 +31,7 @@ descriptions of data structures and algorithms.
>     active_mm
>     arch_pgtable_helpers
>     balance
> +   commpound_pagemaps
>     cleancache
>     free_page_reporting
>     frontswap
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index c540c21e26f5..69d1f0a90e02 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -6,167 +6,7 @@
>   *
>   *     Author: Muchun Song <songmuchun@bytedance.com>
>   *
> - * The struct page structures (page structs) are used to describe a physical
> - * page frame. By default, there is a one-to-one mapping from a page frame to
> - * it's corresponding page struct.
> - *
> - * HugeTLB pages consist of multiple base page size pages and is supported by
> - * many architectures. See hugetlbpage.rst in the Documentation directory for
> - * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
> - * are currently supported. Since the base page size on x86 is 4KB, a 2MB
> - * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> - * 4096 base pages. For each base page, there is a corresponding page struct.
> - *
> - * Within the HugeTLB subsystem, only the first 4 page structs are used to
> - * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
> - * this upper limit. The only 'useful' information in the remaining page structs
> - * is the compound_head field, and this field is the same for all tail pages.
> - *
> - * By removing redundant page structs for HugeTLB pages, memory can be returned
> - * to the buddy allocator for other uses.
> - *
> - * Different architectures support different HugeTLB pages. For example, the
> - * following table is the HugeTLB page size supported by x86 and arm64
> - * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
> - * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
> - * page.
> - *
> - * +--------------+-----------+-----------------------------------------------+
> - * | Architecture | Page Size |                HugeTLB Page Size              |
> - * +--------------+-----------+-----------+-----------+-----------+-----------+
> - * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
> - * +--------------+-----------+-----------+-----------+-----------+-----------+
> - * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
> - * |              +-----------+-----------+-----------+-----------+-----------+
> - * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
> - * |              +-----------+-----------+-----------+-----------+-----------+
> - * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
> - * +--------------+-----------+-----------+-----------+-----------+-----------+
> - *
> - * When the system boot up, every HugeTLB page has more than one struct page
> - * structs which size is (unit: pages):
> - *
> - *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> - *
> - * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> - * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> - * relationship.
> - *
> - *    HugeTLB_Size = n * PAGE_SIZE
> - *
> - * Then,
> - *
> - *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> - *                = n * sizeof(struct page) / PAGE_SIZE
> - *
> - * We can use huge mapping at the pud/pmd level for the HugeTLB page.
> - *
> - * For the HugeTLB page of the pmd level mapping, then
> - *
> - *    struct_size = n * sizeof(struct page) / PAGE_SIZE
> - *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> - *                = sizeof(struct page) / sizeof(pte_t)
> - *                = 64 / 8
> - *                = 8 (pages)
> - *
> - * Where n is how many pte entries which one page can contains. So the value of
> - * n is (PAGE_SIZE / sizeof(pte_t)).
> - *
> - * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> - * is 8. And this optimization also applicable only when the size of struct page
> - * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
> - * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
> - * size of struct page structs of it is 8 page frames which size depends on the
> - * size of the base page.
> - *
> - * For the HugeTLB page of the pud level mapping, then
> - *
> - *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> - *                = PAGE_SIZE / 8 * 8 (pages)
> - *                = PAGE_SIZE (pages)
> - *
> - * Where the struct_size(pmd) is the size of the struct page structs of a
> - * HugeTLB page of the pmd level mapping.
> - *
> - * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
> - * HugeTLB page consists in 4096.
> - *
> - * Next, we take the pmd level mapping of the HugeTLB page as an example to
> - * show the internal implementation of this optimization. There are 8 pages
> - * struct page structs associated with a HugeTLB page which is pmd mapped.
> - *
> - * Here is how things look before optimization.
> - *
> - *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> - * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> - * |           |                     |     0     | -------------> |     0     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     1     | -------------> |     1     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     2     | -------------> |     2     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     3     | -------------> |     3     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     4     | -------------> |     4     |
> - * |    PMD    |                     +-----------+                +-----------+
> - * |   level   |                     |     5     | -------------> |     5     |
> - * |  mapping  |                     +-----------+                +-----------+
> - * |           |                     |     6     | -------------> |     6     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     7     | -------------> |     7     |
> - * |           |                     +-----------+                +-----------+
> - * |           |
> - * |           |
> - * |           |
> - * +-----------+
> - *
> - * The value of page->compound_head is the same for all tail pages. The first
> - * page of page structs (page 0) associated with the HugeTLB page contains the 4
> - * page structs necessary to describe the HugeTLB. The only use of the remaining
> - * pages of page structs (page 1 to page 7) is to point to page->compound_head.
> - * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> - * will be used for each HugeTLB page. This will allow us to free the remaining
> - * 6 pages to the buddy allocator.
> - *
> - * Here is how things look after remapping.
> - *
> - *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> - * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> - * |           |                     |     0     | -------------> |     0     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     1     | -------------> |     1     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
> - * |           |                     +-----------+                   | | | | |
> - * |           |                     |     3     | ------------------+ | | | |
> - * |           |                     +-----------+                     | | | |
> - * |           |                     |     4     | --------------------+ | | |
> - * |    PMD    |                     +-----------+                       | | |
> - * |   level   |                     |     5     | ----------------------+ | |
> - * |  mapping  |                     +-----------+                         | |
> - * |           |                     |     6     | ------------------------+ |
> - * |           |                     +-----------+                           |
> - * |           |                     |     7     | --------------------------+
> - * |           |                     +-----------+
> - * |           |
> - * |           |
> - * |           |
> - * +-----------+
> - *
> - * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> - * vmemmap pages and restore the previous mapping relationship.
> - *
> - * For the HugeTLB page of the pud level mapping. It is similar to the former.
> - * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
> - *
> - * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
> - * (e.g. aarch64) provides a contiguous bit in the translation table entries
> - * that hints to the MMU to indicate that it is one of a contiguous set of
> - * entries that can be cached in a single TLB entry.
> - *
> - * The contiguous bit is used to increase the mapping size at the pmd and pte
> - * (last) level. So this type of HugeTLB page can be optimized only when its
> - * size of the struct page structs is greater than 2 pages.
> + * See Documentation/vm/compound_pagemaps.rst
>   */
>  #define pr_fmt(fmt)    "HugeTLB: " fmt
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-06-21 13:12   ` [External] " Muchun Song
@ 2021-06-21 13:42     ` Joao Martins
  2021-07-13  0:14       ` Mike Kravetz
  0 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-21 13:42 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On 6/21/21 2:12 PM, Muchun Song wrote:
> On Fri, Jun 18, 2021 at 2:46 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> In preparation for device-dax for using hugetlbfs compound page tail
>> deduplication technique, move the comment block explanation into a
>> common place in Documentation/vm.
>>
>> Cc: Muchun Song <songmuchun@bytedance.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  Documentation/vm/compound_pagemaps.rst | 170 +++++++++++++++++++++++++
>>  Documentation/vm/index.rst             |   1 +
>>  mm/hugetlb_vmemmap.c                   | 162 +----------------------
>>  3 files changed, 172 insertions(+), 161 deletions(-)
>>  create mode 100644 Documentation/vm/compound_pagemaps.rst
> 
> IMHO, how about the name of vmemmap_remap.rst? page_frags.rst seems
> to tell people it's about the page mapping not its vmemmap mapping.
> 

Good point.

FWIW, I wanted to avoid the use of the word 'remap' solely because that might be
implementation specific e.g. hugetlbfs remaps struct pages, whereas device-dax will
populate struct pages already with the tail dedup.

Me using 'compound_pagemaps' was short of 'compound struct page map' or 'compound vmemmap'.

Maybe one other alternative is 'tail_dedup.rst' or 'metadata_dedup.rst' ? That's probably
more generic to what really is being done.

Regardless, I am also good with 'vmemmap_remap.rst' if that's what folks prefer.


>>
>> diff --git a/Documentation/vm/compound_pagemaps.rst b/Documentation/vm/compound_pagemaps.rst
>> new file mode 100644
>> index 000000000000..6b1af50e8201
>> --- /dev/null
>> +++ b/Documentation/vm/compound_pagemaps.rst
>> @@ -0,0 +1,170 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +.. _commpound_pagemaps:
>> +
>> +==================================
>> +Free some vmemmap pages of HugeTLB
>> +==================================
>> +
>> +The struct page structures (page structs) are used to describe a physical
>> +page frame. By default, there is a one-to-one mapping from a page frame to
>> +it's corresponding page struct.
>> +
>> +HugeTLB pages consist of multiple base page size pages and is supported by
>> +many architectures. See hugetlbpage.rst in the Documentation directory for
>> +more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
>> +are currently supported. Since the base page size on x86 is 4KB, a 2MB
>> +HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
>> +4096 base pages. For each base page, there is a corresponding page struct.
>> +
>> +Within the HugeTLB subsystem, only the first 4 page structs are used to
>> +contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
>> +this upper limit. The only 'useful' information in the remaining page structs
>> +is the compound_head field, and this field is the same for all tail pages.
>> +
>> +By removing redundant page structs for HugeTLB pages, memory can be returned
>> +to the buddy allocator for other uses.
>> +
>> +Different architectures support different HugeTLB pages. For example, the
>> +following table is the HugeTLB page size supported by x86 and arm64
>> +architectures. Because arm64 supports 4k, 16k, and 64k base pages and
>> +supports contiguous entries, so it supports many kinds of sizes of HugeTLB
>> +page.
>> +
>> ++--------------+-----------+-----------------------------------------------+
>> +| Architecture | Page Size |                HugeTLB Page Size              |
>> ++--------------+-----------+-----------+-----------+-----------+-----------+
>> +|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
>> ++--------------+-----------+-----------+-----------+-----------+-----------+
>> +|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
>> +|              +-----------+-----------+-----------+-----------+-----------+
>> +|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
>> +|              +-----------+-----------+-----------+-----------+-----------+
>> +|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
>> ++--------------+-----------+-----------+-----------+-----------+-----------+
>> +
>> +When the system boot up, every HugeTLB page has more than one struct page
>> +structs which size is (unit: pages):
>> +
>> +   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
>> +
>> +Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
>> +of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
>> +relationship.
>> +
>> +   HugeTLB_Size = n * PAGE_SIZE
>> +
>> +Then,
>> +
>> +   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
>> +               = n * sizeof(struct page) / PAGE_SIZE
>> +
>> +We can use huge mapping at the pud/pmd level for the HugeTLB page.
>> +
>> +For the HugeTLB page of the pmd level mapping, then
>> +
>> +   struct_size = n * sizeof(struct page) / PAGE_SIZE
>> +               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
>> +               = sizeof(struct page) / sizeof(pte_t)
>> +               = 64 / 8
>> +               = 8 (pages)
>> +
>> +Where n is how many pte entries which one page can contains. So the value of
>> +n is (PAGE_SIZE / sizeof(pte_t)).
>> +
>> +This optimization only supports 64-bit system, so the value of sizeof(pte_t)
>> +is 8. And this optimization also applicable only when the size of struct page
>> +is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
>> +x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
>> +size of struct page structs of it is 8 page frames which size depends on the
>> +size of the base page.
>> +
>> +For the HugeTLB page of the pud level mapping, then
>> +
>> +   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
>> +               = PAGE_SIZE / 8 * 8 (pages)
>> +               = PAGE_SIZE (pages)
>> +
>> +Where the struct_size(pmd) is the size of the struct page structs of a
>> +HugeTLB page of the pmd level mapping.
>> +
>> +E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
>> +HugeTLB page consists in 4096.
>> +
>> +Next, we take the pmd level mapping of the HugeTLB page as an example to
>> +show the internal implementation of this optimization. There are 8 pages
>> +struct page structs associated with a HugeTLB page which is pmd mapped.
>> +
>> +Here is how things look before optimization.
>> +
>> +    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
>> + +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
>> + |           |                     |     0     | -------------> |     0     |
>> + |           |                     +-----------+                +-----------+
>> + |           |                     |     1     | -------------> |     1     |
>> + |           |                     +-----------+                +-----------+
>> + |           |                     |     2     | -------------> |     2     |
>> + |           |                     +-----------+                +-----------+
>> + |           |                     |     3     | -------------> |     3     |
>> + |           |                     +-----------+                +-----------+
>> + |           |                     |     4     | -------------> |     4     |
>> + |    PMD    |                     +-----------+                +-----------+
>> + |   level   |                     |     5     | -------------> |     5     |
>> + |  mapping  |                     +-----------+                +-----------+
>> + |           |                     |     6     | -------------> |     6     |
>> + |           |                     +-----------+                +-----------+
>> + |           |                     |     7     | -------------> |     7     |
>> + |           |                     +-----------+                +-----------+
>> + |           |
>> + |           |
>> + |           |
>> + +-----------+
>> +
>> +The value of page->compound_head is the same for all tail pages. The first
>> +page of page structs (page 0) associated with the HugeTLB page contains the 4
>> +page structs necessary to describe the HugeTLB. The only use of the remaining
>> +pages of page structs (page 1 to page 7) is to point to page->compound_head.
>> +Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
>> +will be used for each HugeTLB page. This will allow us to free the remaining
>> +6 pages to the buddy allocator.
>> +
>> +Here is how things look after remapping.
>> +
>> +    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
>> + +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
>> + |           |                     |     0     | -------------> |     0     |
>> + |           |                     +-----------+                +-----------+
>> + |           |                     |     1     | -------------> |     1     |
>> + |           |                     +-----------+                +-----------+
>> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
>> + |           |                     +-----------+                   | | | | |
>> + |           |                     |     3     | ------------------+ | | | |
>> + |           |                     +-----------+                     | | | |
>> + |           |                     |     4     | --------------------+ | | |
>> + |    PMD    |                     +-----------+                       | | |
>> + |   level   |                     |     5     | ----------------------+ | |
>> + |  mapping  |                     +-----------+                         | |
>> + |           |                     |     6     | ------------------------+ |
>> + |           |                     +-----------+                           |
>> + |           |                     |     7     | --------------------------+
>> + |           |                     +-----------+
>> + |           |
>> + |           |
>> + |           |
>> + +-----------+
>> +
>> +When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
>> +vmemmap pages and restore the previous mapping relationship.
>> +
>> +For the HugeTLB page of the pud level mapping. It is similar to the former.
>> +We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
>> +
>> +Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
>> +(e.g. aarch64) provides a contiguous bit in the translation table entries
>> +that hints to the MMU to indicate that it is one of a contiguous set of
>> +entries that can be cached in a single TLB entry.
>> +
>> +The contiguous bit is used to increase the mapping size at the pmd and pte
>> +(last) level. So this type of HugeTLB page can be optimized only when its
>> +size of the struct page structs is greater than 2 pages.
>> +
>> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
>> index eff5fbd492d0..19f981a73a54 100644
>> --- a/Documentation/vm/index.rst
>> +++ b/Documentation/vm/index.rst
>> @@ -31,6 +31,7 @@ descriptions of data structures and algorithms.
>>     active_mm
>>     arch_pgtable_helpers
>>     balance
>> +   commpound_pagemaps
>>     cleancache
>>     free_page_reporting
>>     frontswap
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index c540c21e26f5..69d1f0a90e02 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -6,167 +6,7 @@
>>   *
>>   *     Author: Muchun Song <songmuchun@bytedance.com>
>>   *
>> - * The struct page structures (page structs) are used to describe a physical
>> - * page frame. By default, there is a one-to-one mapping from a page frame to
>> - * it's corresponding page struct.
>> - *
>> - * HugeTLB pages consist of multiple base page size pages and is supported by
>> - * many architectures. See hugetlbpage.rst in the Documentation directory for
>> - * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
>> - * are currently supported. Since the base page size on x86 is 4KB, a 2MB
>> - * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
>> - * 4096 base pages. For each base page, there is a corresponding page struct.
>> - *
>> - * Within the HugeTLB subsystem, only the first 4 page structs are used to
>> - * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
>> - * this upper limit. The only 'useful' information in the remaining page structs
>> - * is the compound_head field, and this field is the same for all tail pages.
>> - *
>> - * By removing redundant page structs for HugeTLB pages, memory can be returned
>> - * to the buddy allocator for other uses.
>> - *
>> - * Different architectures support different HugeTLB pages. For example, the
>> - * following table is the HugeTLB page size supported by x86 and arm64
>> - * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
>> - * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
>> - * page.
>> - *
>> - * +--------------+-----------+-----------------------------------------------+
>> - * | Architecture | Page Size |                HugeTLB Page Size              |
>> - * +--------------+-----------+-----------+-----------+-----------+-----------+
>> - * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
>> - * +--------------+-----------+-----------+-----------+-----------+-----------+
>> - * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
>> - * |              +-----------+-----------+-----------+-----------+-----------+
>> - * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
>> - * |              +-----------+-----------+-----------+-----------+-----------+
>> - * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
>> - * +--------------+-----------+-----------+-----------+-----------+-----------+
>> - *
>> - * When the system boot up, every HugeTLB page has more than one struct page
>> - * structs which size is (unit: pages):
>> - *
>> - *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
>> - *
>> - * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
>> - * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
>> - * relationship.
>> - *
>> - *    HugeTLB_Size = n * PAGE_SIZE
>> - *
>> - * Then,
>> - *
>> - *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
>> - *                = n * sizeof(struct page) / PAGE_SIZE
>> - *
>> - * We can use huge mapping at the pud/pmd level for the HugeTLB page.
>> - *
>> - * For the HugeTLB page of the pmd level mapping, then
>> - *
>> - *    struct_size = n * sizeof(struct page) / PAGE_SIZE
>> - *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
>> - *                = sizeof(struct page) / sizeof(pte_t)
>> - *                = 64 / 8
>> - *                = 8 (pages)
>> - *
>> - * Where n is how many pte entries which one page can contains. So the value of
>> - * n is (PAGE_SIZE / sizeof(pte_t)).
>> - *
>> - * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
>> - * is 8. And this optimization also applicable only when the size of struct page
>> - * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
>> - * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
>> - * size of struct page structs of it is 8 page frames which size depends on the
>> - * size of the base page.
>> - *
>> - * For the HugeTLB page of the pud level mapping, then
>> - *
>> - *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
>> - *                = PAGE_SIZE / 8 * 8 (pages)
>> - *                = PAGE_SIZE (pages)
>> - *
>> - * Where the struct_size(pmd) is the size of the struct page structs of a
>> - * HugeTLB page of the pmd level mapping.
>> - *
>> - * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
>> - * HugeTLB page consists in 4096.
>> - *
>> - * Next, we take the pmd level mapping of the HugeTLB page as an example to
>> - * show the internal implementation of this optimization. There are 8 pages
>> - * struct page structs associated with a HugeTLB page which is pmd mapped.
>> - *
>> - * Here is how things look before optimization.
>> - *
>> - *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
>> - * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
>> - * |           |                     |     0     | -------------> |     0     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |                     |     1     | -------------> |     1     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |                     |     2     | -------------> |     2     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |                     |     3     | -------------> |     3     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |                     |     4     | -------------> |     4     |
>> - * |    PMD    |                     +-----------+                +-----------+
>> - * |   level   |                     |     5     | -------------> |     5     |
>> - * |  mapping  |                     +-----------+                +-----------+
>> - * |           |                     |     6     | -------------> |     6     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |                     |     7     | -------------> |     7     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |
>> - * |           |
>> - * |           |
>> - * +-----------+
>> - *
>> - * The value of page->compound_head is the same for all tail pages. The first
>> - * page of page structs (page 0) associated with the HugeTLB page contains the 4
>> - * page structs necessary to describe the HugeTLB. The only use of the remaining
>> - * pages of page structs (page 1 to page 7) is to point to page->compound_head.
>> - * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
>> - * will be used for each HugeTLB page. This will allow us to free the remaining
>> - * 6 pages to the buddy allocator.
>> - *
>> - * Here is how things look after remapping.
>> - *
>> - *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
>> - * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
>> - * |           |                     |     0     | -------------> |     0     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |                     |     1     | -------------> |     1     |
>> - * |           |                     +-----------+                +-----------+
>> - * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
>> - * |           |                     +-----------+                   | | | | |
>> - * |           |                     |     3     | ------------------+ | | | |
>> - * |           |                     +-----------+                     | | | |
>> - * |           |                     |     4     | --------------------+ | | |
>> - * |    PMD    |                     +-----------+                       | | |
>> - * |   level   |                     |     5     | ----------------------+ | |
>> - * |  mapping  |                     +-----------+                         | |
>> - * |           |                     |     6     | ------------------------+ |
>> - * |           |                     +-----------+                           |
>> - * |           |                     |     7     | --------------------------+
>> - * |           |                     +-----------+
>> - * |           |
>> - * |           |
>> - * |           |
>> - * +-----------+
>> - *
>> - * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
>> - * vmemmap pages and restore the previous mapping relationship.
>> - *
>> - * For the HugeTLB page of the pud level mapping. It is similar to the former.
>> - * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
>> - *
>> - * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
>> - * (e.g. aarch64) provides a contiguous bit in the translation table entries
>> - * that hints to the MMU to indicate that it is one of a contiguous set of
>> - * entries that can be cached in a single TLB entry.
>> - *
>> - * The contiguous bit is used to increase the mapping size at the pmd and pte
>> - * (last) level. So this type of HugeTLB page can be optimized only when its
>> - * size of the struct page structs is greater than 2 pages.
>> + * See Documentation/vm/compound_pagemaps.rst
>>   */
>>  #define pr_fmt(fmt)    "HugeTLB: " fmt
>>
>> --
>> 2.17.1
>>
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-06-20 23:56   ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-06-21 13:50     ` Joao Martins
  0 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-21 13:50 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang, Matthew Wilcox,
	Jason Gunthorpe, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, nvdimm, linux-doc

On 6/21/21 12:56 AM, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Jun 17, 2021 at 07:44:54PM +0100, Joao Martins wrote:
>> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
>> dax_lock_page()).  For pagemap with compound pages fetch the
>> compound_head in case a tail page memory failure is being handled.
>>
>> Currently this is a nop, but in the advent of compound pages in
>> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>>
>> Reported-by: Jane Chu <jane.chu@oracle.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> Looks good to me.
> 
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-06-17 18:44 ` [PATCH v2 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-07-13  0:02   ` Mike Kravetz
  2021-07-13  1:11     ` Joao Martins
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Kravetz @ 2021-07-13  0:02 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Andrew Morton, Jonathan Corbet, nvdimm, linux-doc

On 6/17/21 11:44 AM, Joao Martins wrote:
> Split the utility function prep_compound_page() into head and tail
> counterparts, and use them accordingly.
> 
> This is in preparation for sharing the storage for / deduplicating
> compound page metadata.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8836e54721ae..95967ce55829 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -741,24 +741,34 @@ void free_compound_page(struct page *page)
>  	free_the_page(page, compound_order(page));
>  }
>  
> +static void prep_compound_head(struct page *page, unsigned int order)
> +{
> +	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> +	set_compound_order(page, order);
> +	atomic_set(compound_mapcount_ptr(page), -1);
> +	if (hpage_pincount_available(page))
> +		atomic_set(compound_pincount_ptr(page), 0);
> +}
> +
> +static void prep_compound_tail(struct page *head, int tail_idx)
> +{
> +	struct page *p = head + tail_idx;
> +
> +	set_page_count(p, 0);

When you rebase, you should notice this has been removed from
prep_compound_page as all tail pages should have zero ref count.

> +	p->mapping = TAIL_MAPPING;
> +	set_compound_head(p, head);
> +}
> +
>  void prep_compound_page(struct page *page, unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
>  
>  	__SetPageHead(page);
> -	for (i = 1; i < nr_pages; i++) {
> -		struct page *p = page + i;
> -		set_page_count(p, 0);
> -		p->mapping = TAIL_MAPPING;
> -		set_compound_head(p, page);
> -	}
> +	for (i = 1; i < nr_pages; i++)
> +		prep_compound_tail(page, i);
>  
> -	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> -	set_compound_order(page, order);
> -	atomic_set(compound_mapcount_ptr(page), -1);
> -	if (hpage_pincount_available(page))
> -		atomic_set(compound_pincount_ptr(page), 0);
> +	prep_compound_head(page, order);
>  }
>  
>  #ifdef CONFIG_DEBUG_PAGEALLOC
> 

I'll need something like this for demote hugetlb page fuinctionality
when the pages being demoted have been optimized for minimal vmemmap
usage.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-06-21 13:42     ` Joao Martins
@ 2021-07-13  0:14       ` Mike Kravetz
  2021-07-13  1:11         ` Joao Martins
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Kravetz @ 2021-07-13  0:14 UTC (permalink / raw)
  To: Joao Martins, Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Andrew Morton, Jonathan Corbet, nvdimm,
	linux-doc

On 6/21/21 6:42 AM, Joao Martins wrote:
> On 6/21/21 2:12 PM, Muchun Song wrote:
>> On Fri, Jun 18, 2021 at 2:46 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> In preparation for device-dax for using hugetlbfs compound page tail
>>> deduplication technique, move the comment block explanation into a
>>> common place in Documentation/vm.
>>>
>>> Cc: Muchun Song <songmuchun@bytedance.com>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>  Documentation/vm/compound_pagemaps.rst | 170 +++++++++++++++++++++++++
>>>  Documentation/vm/index.rst             |   1 +
>>>  mm/hugetlb_vmemmap.c                   | 162 +----------------------
>>>  3 files changed, 172 insertions(+), 161 deletions(-)
>>>  create mode 100644 Documentation/vm/compound_pagemaps.rst
>>
>> IMHO, how about the name of vmemmap_remap.rst? page_frags.rst seems
>> to tell people it's about the page mapping not its vmemmap mapping.
>>
> 
> Good point.
> 
> FWIW, I wanted to avoid the use of the word 'remap' solely because that might be
> implementation specific e.g. hugetlbfs remaps struct pages, whereas device-dax will
> populate struct pages already with the tail dedup.
> 
> Me using 'compound_pagemaps' was short of 'compound struct page map' or 'compound vmemmap'.
> 
> Maybe one other alternative is 'tail_dedup.rst' or 'metadata_dedup.rst' ? That's probably
> more generic to what really is being done.
> 
> Regardless, I am also good with 'vmemmap_remap.rst' if that's what folks prefer.
> 

How about vmemmap_dedup?

I do think it is a good idea to move this to a common documentation file
if Device DAX is going to use the same technique.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-07-13  0:02   ` Mike Kravetz
@ 2021-07-13  1:11     ` Joao Martins
  0 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-07-13  1:11 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Andrew Morton, Jonathan Corbet, nvdimm, linux-doc



On 7/13/21 1:02 AM, Mike Kravetz wrote:
> On 6/17/21 11:44 AM, Joao Martins wrote:
>> Split the utility function prep_compound_page() into head and tail
>> counterparts, and use them accordingly.
>>
>> This is in preparation for sharing the storage for / deduplicating
>> compound page metadata.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/page_alloc.c | 32 +++++++++++++++++++++-----------
>>  1 file changed, 21 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 8836e54721ae..95967ce55829 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -741,24 +741,34 @@ void free_compound_page(struct page *page)
>>  	free_the_page(page, compound_order(page));
>>  }
>>  
>> +static void prep_compound_head(struct page *page, unsigned int order)
>> +{
>> +	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
>> +	set_compound_order(page, order);
>> +	atomic_set(compound_mapcount_ptr(page), -1);
>> +	if (hpage_pincount_available(page))
>> +		atomic_set(compound_pincount_ptr(page), 0);
>> +}
>> +
>> +static void prep_compound_tail(struct page *head, int tail_idx)
>> +{
>> +	struct page *p = head + tail_idx;
>> +
>> +	set_page_count(p, 0);
> 
> When you rebase, you should notice this has been removed from
> prep_compound_page as all tail pages should have zero ref count.
> 
/me nods

>> +	p->mapping = TAIL_MAPPING;
>> +	set_compound_head(p, head);
>> +}
>> +
>>  void prep_compound_page(struct page *page, unsigned int order)
>>  {
>>  	int i;
>>  	int nr_pages = 1 << order;
>>  
>>  	__SetPageHead(page);
>> -	for (i = 1; i < nr_pages; i++) {
>> -		struct page *p = page + i;
>> -		set_page_count(p, 0);
>> -		p->mapping = TAIL_MAPPING;
>> -		set_compound_head(p, page);
>> -	}
>> +	for (i = 1; i < nr_pages; i++)
>> +		prep_compound_tail(page, i);
>>  
>> -	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
>> -	set_compound_order(page, order);
>> -	atomic_set(compound_mapcount_ptr(page), -1);
>> -	if (hpage_pincount_available(page))
>> -		atomic_set(compound_pincount_ptr(page), 0);
>> +	prep_compound_head(page, order);
>>  }
>>  
>>  #ifdef CONFIG_DEBUG_PAGEALLOC
>>
> 
> I'll need something like this for demote hugetlb page fuinctionality
> when the pages being demoted have been optimized for minimal vmemmap
> usage.
> 
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [External] [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-07-13  0:14       ` Mike Kravetz
@ 2021-07-13  1:11         ` Joao Martins
  0 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-07-13  1:11 UTC (permalink / raw)
  To: Mike Kravetz, Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Andrew Morton, Jonathan Corbet, nvdimm,
	linux-doc



On 7/13/21 1:14 AM, Mike Kravetz wrote:
> On 6/21/21 6:42 AM, Joao Martins wrote:
>> On 6/21/21 2:12 PM, Muchun Song wrote:
>>> On Fri, Jun 18, 2021 at 2:46 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> In preparation for device-dax for using hugetlbfs compound page tail
>>>> deduplication technique, move the comment block explanation into a
>>>> common place in Documentation/vm.
>>>>
>>>> Cc: Muchun Song <songmuchun@bytedance.com>
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  Documentation/vm/compound_pagemaps.rst | 170 +++++++++++++++++++++++++
>>>>  Documentation/vm/index.rst             |   1 +
>>>>  mm/hugetlb_vmemmap.c                   | 162 +----------------------
>>>>  3 files changed, 172 insertions(+), 161 deletions(-)
>>>>  create mode 100644 Documentation/vm/compound_pagemaps.rst
>>>
>>> IMHO, how about the name of vmemmap_remap.rst? page_frags.rst seems
>>> to tell people it's about the page mapping not its vmemmap mapping.
>>>
>>
>> Good point.
>>
>> FWIW, I wanted to avoid the use of the word 'remap' solely because that might be
>> implementation specific e.g. hugetlbfs remaps struct pages, whereas device-dax will
>> populate struct pages already with the tail dedup.
>>
>> Me using 'compound_pagemaps' was short of 'compound struct page map' or 'compound vmemmap'.
>>
>> Maybe one other alternative is 'tail_dedup.rst' or 'metadata_dedup.rst' ? That's probably
>> more generic to what really is being done.
>>
>> Regardless, I am also good with 'vmemmap_remap.rst' if that's what folks prefer.
>>
> 
> How about vmemmap_dedup?
> 
Sounds good to me, I'll rename it.

> I do think it is a good idea to move this to a common documentation file
> if Device DAX is going to use the same technique.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-07-13  1:11 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-17 18:44 [PATCH v2 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-06-17 18:44 ` [PATCH v2 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-06-20 23:56   ` HORIGUCHI NAOYA(堀口 直也)
2021-06-21 13:50     ` Joao Martins
2021-06-17 18:44 ` [PATCH v2 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-07-13  0:02   ` Mike Kravetz
2021-07-13  1:11     ` Joao Martins
2021-06-17 18:44 ` [PATCH v2 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-06-17 18:44 ` [PATCH v2 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-06-17 18:44 ` [PATCH v2 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-06-17 18:44 ` [PATCH v2 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2021-06-17 18:45 ` [PATCH v2 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2021-06-21 13:12   ` [External] " Muchun Song
2021-06-21 13:42     ` Joao Martins
2021-07-13  0:14       ` Mike Kravetz
2021-07-13  1:11         ` Joao Martins
2021-06-17 18:45 ` [PATCH v2 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-06-17 18:45 ` [PATCH v2 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-06-17 18:45 ` [PATCH v2 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-06-17 18:45 ` [PATCH v2 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-06-17 18:45 ` [PATCH v2 12/14] device-dax: compound pagemap support Joao Martins
2021-06-17 18:45 ` [PATCH v2 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-06-17 18:45 ` [PATCH v2 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).