NVDIMM Device and Persistent Memory development
 help / color / Atom feed
* [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
@ 2021-07-14 19:35 Joao Martins
  2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
                   ` (14 more replies)
  0 siblings, 15 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Changes since v2 [1]:
 * Collect Mike's Ack on patch 2 (Mike)
 * Collect Naoya's Reviewed-by on patch 1 (Naoya)
 * Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun)
 * Rebased to next-20210714

Changes since v1 [0]:

 (New patches 7, 10, 11)
 * Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
 * Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
 * Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
 that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan]
 * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
 * Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
 tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
 * Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan]
 * Simplify patch 9 as a result of having compound initialization differently [Dan]
 * Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan]
 * Rename Subject of patch 6 [Dan]
 * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
 * Add some type-safety to @block and use 'struct page *' rather than
 void, Patch 8 [Dan]
 * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
 * Remove vmemmap lookup function in place of
 pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
 serialization, Patch 8
 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
 * Add docs about device-dax usage of tail dedup technique in newly added
 compound_pagemaps.rst doc entry.
 * Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan]
 * Add cleanup patch for device-dax for using ALIGN() [Dan]
 * Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan]
 * Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
 * Move PUD compound page to be last patch [Dan]
 * Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan]
 * Rebased to next-20210617 

[0] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/

Full changelog of previous versions at the bottom of cover letter.

---

This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE which is now
in mmotm. 

[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).

So to summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.

Patches 5-12, 14: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap. Patch 14 comes last as it's an improvement, not
mandated for the initial functionality. Also move the very nice docs of
hugetlb_vmemmap.c into a Documentation/vm/ entry.

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally.

Patch 13: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:

                                                   before     after
    (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
    (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
    (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
    (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms

    altmap performance gets specially interesting when pinning a pmem dimm:

                                                   before     after
    (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
    (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
    (128G get_user_pages_fast altmap 2M page size)  ~3.91 s -> ~70 ms
    (128G pin_user_pages_fast altmap 2M page size)  ~3.97 s -> ~74 ms

I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.

Patches apply on top of linux-next tag next-20210714 (commit c0d438dbc0b7).

Comments and suggestions very much appreciated!

Older Changelog,

 RFC[1] -> v1:
 (New patches 1-3, 5-8 but the diffstat isn't that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

[1] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/

Thanks,
	Joao

Joao Martins (14):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to
    helper
  mm/hugetlb_vmemmap: move comment block to Documentation/vm
  mm/sparse-vmemmap: populate compound pagemaps
  mm/page_alloc: reuse tail struct pages for compound pagemaps
  device-dax: use ALIGN() for determining pgoff
  device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  device-dax: compound pagemap support
  mm/gup: grab head page refcount once for group of subpages
  mm/sparse-vmemmap: improve memory savings for compound pud geometry

 Documentation/vm/index.rst         |   1 +
 Documentation/vm/vmemmap_dedup.rst | 300 +++++++++++++++++++++++++++++
 drivers/dax/device.c               |  58 ++++--
 include/linux/memory_hotplug.h     |   5 +-
 include/linux/memremap.h           |  17 ++
 include/linux/mm.h                 |   8 +-
 mm/gup.c                           |  53 +++--
 mm/hugetlb_vmemmap.c               | 162 +---------------
 mm/memory-failure.c                |   6 +
 mm/memory_hotplug.c                |   3 +-
 mm/memremap.c                      |   9 +-
 mm/page_alloc.c                    | 146 ++++++++++----
 mm/sparse-vmemmap.c                | 226 +++++++++++++++++++---
 mm/sparse.c                        |  24 ++-
 14 files changed, 742 insertions(+), 276 deletions(-)
 create mode 100644 Documentation/vm/vmemmap_dedup.rst

-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-15  0:17   ` Dan Williams
  2021-07-15  2:51   ` [External] " Muchun Song
  2021-07-14 19:35 ` [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()).  For pagemap with compound pages fetch the
compound_head in case a tail page memory failure is being handled.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index eefd823deb67..c40ea28a4677 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1532,6 +1532,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto out;
 	}
 
+	/*
+	 * Pages instantiated by device-dax (not filesystem-dax)
+	 * may be compound pages.
+	 */
+	page = compound_head(page);
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-15  0:19   ` Dan Williams
  2021-07-15  2:53   ` [External] " Muchun Song
  2021-07-14 19:35 ` [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

This is in preparation for sharing the storage for / deduplicating
compound page metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/page_alloc.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b97e17806be..68b5591a69fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -730,23 +730,33 @@ void free_compound_page(struct page *page)
 	free_the_page(page, compound_order(page));
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-07-14 19:35 ` [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-15  0:20   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Move struct page init to an helper function __init_zone_device_page().

This is in preparation for sharing the storage for / deduplicating
compound page metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 68b5591a69fe..79f3b38afeca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6557,6 +6557,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap)
+{
+
+	__init_single_page(page, pfn, zone_idx, nid);
+
+	/*
+	 * Mark page reserved as it will need to wait for onlining
+	 * phase for it to be fully associated with a zone.
+	 *
+	 * We can use the non-atomic __set_bit operation for setting
+	 * the flag as we are still initializing the pages.
+	 */
+	__SetPageReserved(page);
+
+	/*
+	 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
+	 * ever freed or placed on a driver-private list.
+	 */
+	page->pgmap = pgmap;
+	page->zone_device_data = NULL;
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		cond_resched();
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6585,39 +6625,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-		 * because this is done early in section_activate()
-		 */
-		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (2 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-15  1:08   ` Dan Williams
  2021-07-15  6:48   ` Christoph Hellwig
  2021-07-14 19:35 ` [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Add a new align property for struct dev_pagemap which specifies that a
pagemap is composed of a set of compound pages of size @align, instead of
base pages. When a compound page geometry is requested, all but the first
page are initialised as tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page size,
this creates an opportunity to optimize GUP and GUP-fast walkers, treating
it the same way as THP or hugetlb pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memremap.h | 17 +++++++++++++++++
 mm/memremap.c            |  8 ++++++--
 mm/page_alloc.c          | 34 +++++++++++++++++++++++++++++++++-
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 119f130ef8f1..e5ab6d4525c1 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -99,6 +99,10 @@ struct dev_pagemap_ops {
  * @done: completion for @internal_ref
  * @type: memory type: see MEMORY_* in memory_hotplug.h
  * @flags: PGMAP_* flags to specify defailed behavior
+ * @geometry: structural definition of how the vmemmap metadata is populated.
+ *	A zero or PAGE_SIZE defaults to using base pages as the memmap metadata
+ *	representation. A bigger value but also multiple of PAGE_SIZE will set
+ *	up compound struct pages representative of the requested geometry size.
  * @ops: method table
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
@@ -114,6 +118,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long geometry;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
+{
+	if (!pgmap || !pgmap->geometry)
+		return PAGE_SIZE;
+	return pgmap->geometry;
+}
+
+static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
+{
+	return PHYS_PFN(pgmap_geometry(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..ffcb924eb6a5 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	if (pgmap_geometry(pgmap) > PAGE_SIZE)
+		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));
+	else
+		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+				- pfn_first(pgmap, range_id));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 79f3b38afeca..188cb5f8c308 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
+					unsigned long zone_idx, int nid,
+					struct dev_pagemap *pgmap,
+					unsigned long nr_pages)
+{
+	unsigned int order_align = order_base_2(nr_pages);
+	unsigned long i;
+
+	__SetPageHead(page);
+
+	for (i = 1; i < nr_pages; i++) {
+		__init_zone_device_page(page + i, pfn + i, zone_idx,
+					nid, pgmap);
+		prep_compound_tail(page, i);
+
+		/*
+		 * The first and second tail pages need to
+		 * initialized first, hence the head page is
+		 * prepared last.
+		 */
+		if (i == 2)
+			prep_compound_head(page, order_align);
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
 		struct page *page = pfn_to_page(pfn);
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfns_per_compound == 1)
+			continue;
+
+		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+				     pfns_per_compound);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (3 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28  5:56   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

In support of using compound pages for devmap mappings, plumb the pgmap
down to the vmemmap_populate implementation. Note that while altmap is
retrievable from pgmap the memory hotplug code passes altmap without
pgmap[*], so both need to be independently plumbed.

So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:

	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap

Passing @pgmap allows __populate_section_memmap() to both fetch the
geometry in which memmap metadata is created for and also to let
sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
whether to just reuse tail pages from past onlined sections.

[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memory_hotplug.h |  5 ++++-
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  3 ++-
 mm/sparse.c                    | 24 +++++++++++++++---------
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a7fd2c3ccb77..9b1bca80224d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,7 @@ struct mem_section;
 struct memory_block;
 struct resource;
 struct vmem_altmap;
+struct dev_pagemap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
@@ -60,6 +61,7 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	struct dev_pagemap *pgmap;
 };
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -333,7 +335,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..f244a9219ce4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3083,7 +3083,8 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8cb75b26ea4f..c728a8ff38ad 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -268,7 +268,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+					 params->pgmap);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bdce883f9286..80d3ba30d345 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -603,7 +603,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index 6326cdf36c4f..5310be6171f1 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -453,7 +453,8 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -552,7 +553,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			break;
 
 		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-				nid, NULL);
+				nid, NULL, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -657,9 +658,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -728,7 +730,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 }
 #else
 struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	return kvmalloc_node(array_size(sizeof(struct page),
 					PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -851,7 +854,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -883,7 +887,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
@@ -898,6 +902,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
  * @altmap: device page map
+ * @pgmap: device page map object that owns the section
  *
  * This is only intended for hotplug.
  *
@@ -911,7 +916,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -922,7 +928,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (4 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28  6:04   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

In preparation for describing a memmap with compound pages, move the
actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages() walk
through all base pages it needs to populate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 80d3ba30d345..76f4158f6301 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -570,33 +570,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap)
 {
-	unsigned long addr = start;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	pgd = vmemmap_pgd_populate(addr, node);
+	if (!pgd)
+		return -ENOMEM;
+	p4d = vmemmap_p4d_populate(pgd, addr, node);
+	if (!p4d)
+		return -ENOMEM;
+	pud = vmemmap_pud_populate(p4d, addr, node);
+	if (!pud)
+		return -ENOMEM;
+	pmd = vmemmap_pmd_populate(pud, addr, node);
+	if (!pmd)
+		return -ENOMEM;
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	if (!pte)
+		return -ENOMEM;
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	unsigned long addr = start;
+
 	for (; addr < end; addr += PAGE_SIZE) {
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
-			return -ENOMEM;
-		p4d = vmemmap_p4d_populate(pgd, addr, node);
-		if (!p4d)
-			return -ENOMEM;
-		pud = vmemmap_pud_populate(p4d, addr, node);
-		if (!pud)
-			return -ENOMEM;
-		pmd = vmemmap_pmd_populate(pud, addr, node);
-		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
-		if (!pte)
+		if (vmemmap_populate_address(addr, node, altmap))
 			return -ENOMEM;
-		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 	}
 
 	return 0;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (5 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-15  2:47   ` [External] " Muchun Song
  2021-07-28  6:09   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

In preparation for device-dax for using hugetlbfs compound page tail
deduplication technique, move the comment block explanation into a
common place in Documentation/vm.

Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/index.rst         |   1 +
 Documentation/vm/vmemmap_dedup.rst | 170 +++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.c               | 162 +--------------------------
 3 files changed, 172 insertions(+), 161 deletions(-)
 create mode 100644 Documentation/vm/vmemmap_dedup.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index eff5fbd492d0..edd690afd890 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -51,5 +51,6 @@ descriptions of data structures and algorithms.
    split_page_table_lock
    transhuge
    unevictable-lru
+   vmemmap_dedup
    z3fold
    zsmalloc
diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
new file mode 100644
index 000000000000..215ae2ef3bce
--- /dev/null
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -0,0 +1,170 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _vmemmap_dedup:
+
+==================================
+Free some vmemmap pages of HugeTLB
+==================================
+
+The struct page structures (page structs) are used to describe a physical
+page frame. By default, there is a one-to-one mapping from a page frame to
+it's corresponding page struct.
+
+HugeTLB pages consist of multiple base page size pages and is supported by
+many architectures. See hugetlbpage.rst in the Documentation directory for
+more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
+are currently supported. Since the base page size on x86 is 4KB, a 2MB
+HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
+4096 base pages. For each base page, there is a corresponding page struct.
+
+Within the HugeTLB subsystem, only the first 4 page structs are used to
+contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
+this upper limit. The only 'useful' information in the remaining page structs
+is the compound_head field, and this field is the same for all tail pages.
+
+By removing redundant page structs for HugeTLB pages, memory can be returned
+to the buddy allocator for other uses.
+
+Different architectures support different HugeTLB pages. For example, the
+following table is the HugeTLB page size supported by x86 and arm64
+architectures. Because arm64 supports 4k, 16k, and 64k base pages and
+supports contiguous entries, so it supports many kinds of sizes of HugeTLB
+page.
+
++--------------+-----------+-----------------------------------------------+
+| Architecture | Page Size |                HugeTLB Page Size              |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
+|              +-----------+-----------+-----------+-----------+-----------+
+|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
+|              +-----------+-----------+-----------+-----------+-----------+
+|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+
+When the system boot up, every HugeTLB page has more than one struct page
+structs which size is (unit: pages):
+
+   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+
+Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
+of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
+relationship.
+
+   HugeTLB_Size = n * PAGE_SIZE
+
+Then,
+
+   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+               = n * sizeof(struct page) / PAGE_SIZE
+
+We can use huge mapping at the pud/pmd level for the HugeTLB page.
+
+For the HugeTLB page of the pmd level mapping, then
+
+   struct_size = n * sizeof(struct page) / PAGE_SIZE
+               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
+               = sizeof(struct page) / sizeof(pte_t)
+               = 64 / 8
+               = 8 (pages)
+
+Where n is how many pte entries which one page can contains. So the value of
+n is (PAGE_SIZE / sizeof(pte_t)).
+
+This optimization only supports 64-bit system, so the value of sizeof(pte_t)
+is 8. And this optimization also applicable only when the size of struct page
+is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
+x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
+size of struct page structs of it is 8 page frames which size depends on the
+size of the base page.
+
+For the HugeTLB page of the pud level mapping, then
+
+   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
+               = PAGE_SIZE / 8 * 8 (pages)
+               = PAGE_SIZE (pages)
+
+Where the struct_size(pmd) is the size of the struct page structs of a
+HugeTLB page of the pmd level mapping.
+
+E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
+HugeTLB page consists in 4096.
+
+Next, we take the pmd level mapping of the HugeTLB page as an example to
+show the internal implementation of this optimization. There are 8 pages
+struct page structs associated with a HugeTLB page which is pmd mapped.
+
+Here is how things look before optimization.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | -------------> |     2     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     3     | -------------> |     3     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     4     | -------------> |     4     |
+ |    PMD    |                     +-----------+                +-----------+
+ |   level   |                     |     5     | -------------> |     5     |
+ |  mapping  |                     +-----------+                +-----------+
+ |           |                     |     6     | -------------> |     6     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     7     | -------------> |     7     |
+ |           |                     +-----------+                +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+The value of page->compound_head is the same for all tail pages. The first
+page of page structs (page 0) associated with the HugeTLB page contains the 4
+page structs necessary to describe the HugeTLB. The only use of the remaining
+pages of page structs (page 1 to page 7) is to point to page->compound_head.
+Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
+will be used for each HugeTLB page. This will allow us to free the remaining
+6 pages to the buddy allocator.
+
+Here is how things look after remapping.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PMD    |                     +-----------+                       | | |
+ |   level   |                     |     5     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     6     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |     7     | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
+vmemmap pages and restore the previous mapping relationship.
+
+For the HugeTLB page of the pud level mapping. It is similar to the former.
+We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
+
+Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
+(e.g. aarch64) provides a contiguous bit in the translation table entries
+that hints to the MMU to indicate that it is one of a contiguous set of
+entries that can be cached in a single TLB entry.
+
+The contiguous bit is used to increase the mapping size at the pmd and pte
+(last) level. So this type of HugeTLB page can be optimized only when its
+size of the struct page structs is greater than 2 pages.
+
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c540c21e26f5..e2994e50ddee 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -6,167 +6,7 @@
  *
  *     Author: Muchun Song <songmuchun@bytedance.com>
  *
- * The struct page structures (page structs) are used to describe a physical
- * page frame. By default, there is a one-to-one mapping from a page frame to
- * it's corresponding page struct.
- *
- * HugeTLB pages consist of multiple base page size pages and is supported by
- * many architectures. See hugetlbpage.rst in the Documentation directory for
- * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
- * are currently supported. Since the base page size on x86 is 4KB, a 2MB
- * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
- * 4096 base pages. For each base page, there is a corresponding page struct.
- *
- * Within the HugeTLB subsystem, only the first 4 page structs are used to
- * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
- * this upper limit. The only 'useful' information in the remaining page structs
- * is the compound_head field, and this field is the same for all tail pages.
- *
- * By removing redundant page structs for HugeTLB pages, memory can be returned
- * to the buddy allocator for other uses.
- *
- * Different architectures support different HugeTLB pages. For example, the
- * following table is the HugeTLB page size supported by x86 and arm64
- * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
- * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
- * page.
- *
- * +--------------+-----------+-----------------------------------------------+
- * | Architecture | Page Size |                HugeTLB Page Size              |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- *
- * When the system boot up, every HugeTLB page has more than one struct page
- * structs which size is (unit: pages):
- *
- *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *
- * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
- * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
- * relationship.
- *
- *    HugeTLB_Size = n * PAGE_SIZE
- *
- * Then,
- *
- *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *                = n * sizeof(struct page) / PAGE_SIZE
- *
- * We can use huge mapping at the pud/pmd level for the HugeTLB page.
- *
- * For the HugeTLB page of the pmd level mapping, then
- *
- *    struct_size = n * sizeof(struct page) / PAGE_SIZE
- *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
- *                = sizeof(struct page) / sizeof(pte_t)
- *                = 64 / 8
- *                = 8 (pages)
- *
- * Where n is how many pte entries which one page can contains. So the value of
- * n is (PAGE_SIZE / sizeof(pte_t)).
- *
- * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
- * is 8. And this optimization also applicable only when the size of struct page
- * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
- * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
- * size of struct page structs of it is 8 page frames which size depends on the
- * size of the base page.
- *
- * For the HugeTLB page of the pud level mapping, then
- *
- *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
- *                = PAGE_SIZE / 8 * 8 (pages)
- *                = PAGE_SIZE (pages)
- *
- * Where the struct_size(pmd) is the size of the struct page structs of a
- * HugeTLB page of the pmd level mapping.
- *
- * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
- * HugeTLB page consists in 4096.
- *
- * Next, we take the pmd level mapping of the HugeTLB page as an example to
- * show the internal implementation of this optimization. There are 8 pages
- * struct page structs associated with a HugeTLB page which is pmd mapped.
- *
- * Here is how things look before optimization.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | -------------> |     1     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     2     | -------------> |     2     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     3     | -------------> |     3     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     4     | -------------> |     4     |
- * |    PMD    |                     +-----------+                +-----------+
- * |   level   |                     |     5     | -------------> |     5     |
- * |  mapping  |                     +-----------+                +-----------+
- * |           |                     |     6     | -------------> |     6     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     7     | -------------> |     7     |
- * |           |                     +-----------+                +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * The value of page->compound_head is the same for all tail pages. The first
- * page of page structs (page 0) associated with the HugeTLB page contains the 4
- * page structs necessary to describe the HugeTLB. The only use of the remaining
- * pages of page structs (page 1 to page 7) is to point to page->compound_head.
- * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
- * will be used for each HugeTLB page. This will allow us to free the remaining
- * 6 pages to the buddy allocator.
- *
- * Here is how things look after remapping.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | -------------> |     1     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
- * |           |                     +-----------+                   | | | | |
- * |           |                     |     3     | ------------------+ | | | |
- * |           |                     +-----------+                     | | | |
- * |           |                     |     4     | --------------------+ | | |
- * |    PMD    |                     +-----------+                       | | |
- * |   level   |                     |     5     | ----------------------+ | |
- * |  mapping  |                     +-----------+                         | |
- * |           |                     |     6     | ------------------------+ |
- * |           |                     +-----------+                           |
- * |           |                     |     7     | --------------------------+
- * |           |                     +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
- * vmemmap pages and restore the previous mapping relationship.
- *
- * For the HugeTLB page of the pud level mapping. It is similar to the former.
- * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
- *
- * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
- * (e.g. aarch64) provides a contiguous bit in the translation table entries
- * that hints to the MMU to indicate that it is one of a contiguous set of
- * entries that can be cached in a single TLB entry.
- *
- * The contiguous bit is used to increase the mapping size at the pmd and pte
- * (last) level. So this type of HugeTLB page can be optimized only when its
- * size of the struct page structs is greater than 2 pages.
+ * See Documentation/vm/vmemmap_dedup.rst
  */
 #define pr_fmt(fmt)	"HugeTLB: " fmt
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (6 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28  6:55   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead. Allocate a separate
page for the vmemmap area which contains the head page and separate for
the next 64 pages. The rest of the subsections then reuse this tail
vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound pagemap with big enough @align (e.g. 1G
PUD) it will cross various sections. To be able to reuse tail pages
across sections belonging to the same gigantic page, fetch the
@range being mapped (nr_ranges + 1).  If the section being mapped is
not offset 0 of the @align, then lookup the PFN of the struct page
address that precedes it and use that to populate the entire
section.

On compound pagemaps with 2M align, this mechanism lets 6 pages be
saved out of the 8 necessary PFNs necessary to set the subsection's
512 struct pages being mapped. On a 1G compound pagemap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().  It
is worth noting that altmap for devmap mappings was there to relieve the
pressure of inordinate amounts of memmap space to map terabytes of pmem.
With compound pages the motivation for altmaps for pmem gets reduced.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/vmemmap_dedup.rst |  27 +++++-
 include/linux/mm.h                 |   2 +-
 mm/memremap.c                      |   1 +
 mm/sparse-vmemmap.c                | 133 +++++++++++++++++++++++++++--
 4 files changed, 151 insertions(+), 12 deletions(-)

diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
index 215ae2ef3bce..42830a667c2a 100644
--- a/Documentation/vm/vmemmap_dedup.rst
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -2,9 +2,12 @@
 
 .. _vmemmap_dedup:
 
-==================================
-Free some vmemmap pages of HugeTLB
-==================================
+=================================================
+Free some vmemmap pages of HugeTLB and Device DAX
+=================================================
+
+HugeTLB
+=======
 
 The struct page structures (page structs) are used to describe a physical
 page frame. By default, there is a one-to-one mapping from a page frame to
@@ -168,3 +171,21 @@ The contiguous bit is used to increase the mapping size at the pmd and pte
 (last) level. So this type of HugeTLB page can be optimized only when its
 size of the struct page structs is greater than 2 pages.
 
+Device DAX
+==========
+
+The device-dax interface uses the same tail deduplication technique explained
+in the previous chapter, except when used with the vmemmap in the device (altmap).
+
+The differences with HugeTLB are relatively minor.
+
+The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
+PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+
+There's no remapping of vmemmap given that device-dax memory is not part of
+System RAM ranges initialized at boot, hence the tail deduplication happens
+at a later stage when we populate the sections.
+
+It only use 3 page structs for storing all information as opposed
+to 4 on HugeTLB pages. This does not affect memory savings between both.
+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f244a9219ce4..5e3e153ddd3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3090,7 +3090,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, struct page *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index ffcb924eb6a5..9198fdace903 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 76f4158f6301..a8de6c472999 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -495,16 +495,31 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap,
+				       struct page *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
 		void *p;
 
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else {
+			/*
+			 * When a PTE/PMD entry is freed from the init_mm
+			 * there's a a free_pages() call to this page allocated
+			 * above. Thus this get_page() is paired with the
+			 * put_page_testzero() on the freeing path.
+			 * This can only called by certain ZONE_DEVICE path,
+			 * and through vmemmap_populate_compound_pages() when
+			 * slab is available.
+			 */
+			get_page(block);
+			p = page_to_virt(block);
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -571,7 +586,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      struct page *reuse, struct page **page)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -591,10 +607,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	if (page)
+		*page = pte_page(*pte);
+	return 0;
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
@@ -603,7 +623,97 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	unsigned long addr = start;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		if (vmemmap_populate_address(addr, node, altmap))
+		if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, struct page *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		if (vmemmap_populate_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  struct page **page)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, page);
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long offset, size, addr;
+
+	/*
+	 * For compound pages bigger than section size (e.g. x86 1G compound
+	 * pages with 2M subsection size) fill the rest of sections as tail
+	 * pages.
+	 *
+	 * Note that memremap_pages() resets @nr_range value and will increment
+	 * it after each range successful onlining. Thus the value or @nr_range
+	 * at section memmap populate corresponds to the in-progress range
+	 * being onlined here.
+	 */
+	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
+	if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
+	    pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
+		pte_t *ptep;
+
+		addr = start - PAGE_SIZE;
+
+		/*
+		 * Sections are populated sequently and in sucession meaning
+		 * this section being populated wouldn't start if the
+		 * preceding one wasn't successful. So there is a guarantee that
+		 * the previous struct pages are mapped when trying to lookup
+		 * the last tail page.
+		 */
+		ptep = pte_offset_kernel(pmd_off_k(addr), addr);
+		if (!ptep)
+			return -ENOMEM;
+
+		/*
+		 * Reuse the page that was populated in the prior iteration
+		 * with just tail struct pages.
+		 */
+		return vmemmap_populate_range(start, end, node,
+					      pte_page(*ptep));
+	}
+
+	size = min(end - start, pgmap_pfn_geometry(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		struct page *block;
+
+		/* Populate the head page vmemmap page */
+		if (vmemmap_populate_page(addr, node, NULL))
+			return -ENOMEM;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		if (vmemmap_populate_page(next, node, &block))
+			return -ENOMEM;
+
+		/*
+		 * Reuse the previous page for the rest of tail pages
+		 * See layout diagram in Documentation/vm/vmemmap_dedup.rst
+		 */
+		next += PAGE_SIZE;
+		if (vmemmap_populate_range(next, last, node, block))
 			return -ENOMEM;
 	}
 
@@ -616,12 +726,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	unsigned int geometry = pgmap_geometry(pgmap);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (geometry > PAGE_SIZE && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (7 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28  7:28   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Currently memmap_init_zone_device() ends up initializing 32768 pages
when it only needs to initialize 128 given tail page reuse. That
number is worse with 1GB compound page geometries, 262144 instead of
128. Update memmap_init_zone_device() to skip redundant
initialization, detailed below.

When a pgmap @geometry is set, all pages are mapped at a given huge page
alignment and use compound pages to describe them as opposed to a
struct per 4K.

With @geometry > PAGE_SIZE and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of unique
struct pages is a lot smaller that the total amount of struct pages
being mapped.

The altmap path is left alone since it does not support memory savings
based on compound pagemap geometries.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 188cb5f8c308..96975edac0a8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6600,11 +6600,23 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
 					unsigned long zone_idx, int nid,
 					struct dev_pagemap *pgmap,
+					struct vmem_altmap *altmap,
 					unsigned long nr_pages)
 {
 	unsigned int order_align = order_base_2(nr_pages);
 	unsigned long i;
 
+	/*
+	 * With compound page geometry and when struct pages are stored in ram
+	 * (!altmap) most tail pages are reused. Consequently, the amount of
+	 * unique struct pages to initialize is a lot smaller that the total
+	 * amount of struct pages being mapped.
+	 * See vmemmap_populate_compound_pages().
+	 */
+	if (!altmap)
+		nr_pages = min_t(unsigned long, nr_pages,
+				 2 * (PAGE_SIZE/sizeof(struct page)));
+
 	__SetPageHead(page);
 
 	for (i = 1; i < nr_pages; i++) {
@@ -6657,7 +6669,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     pfns_per_compound);
+				     altmap, pfns_per_compound);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (8 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28  7:29   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Rather than calculating @pgoff manually, switch to ALIGN() instead.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index dd8222a42808..0b82159b3564 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -234,8 +234,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
+		pgoff = linear_page_index(vmf->vma,
+				ALIGN(vmf->address, fault_size));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (9 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28  7:30   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 12/14] device-dax: compound pagemap support Joao Martins
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Right now, only static dax regions have a valid @pgmap pointer in its
struct dev_dax. Dynamic dax case however, do not.

In preparation for device-dax compound pagemap support, make sure that
dev_dax pgmap field is set after it has been allocated and initialized.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 0b82159b3564..6e348b5f9d45 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	dev_dax->pgmap = pgmap;
+
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (10 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-14 23:36   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Use the newly added compound pagemap facility which maps the assigned dax
ranges as compound pages at a page size of @align. Currently, this means,
that region/namespace bootstrap would take considerably less, given that
you would initialize considerably less pages.

On setups with 128G NVDIMMs the initialization with DRAM stored struct
pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
than a 1msec with 1G pages.

dax devices are created with a fixed @align (huge page size) which is
enforced through as well at mmap() of the device. Faults, consequently
happen too at the specified @align specified at the creation, and those
don't change through out dax device lifetime. MCEs poisons a whole dax
huge page, as well as splits occurring at the configured page size.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 56 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6e348b5f9d45..149627c922cc 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned long fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned long fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head = compound_head(head);
+	if (head->mapping)
+		return;
+
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma,
+			ALIGN(vmf->address, fault_size));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +261,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = dev_dax->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +269,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma,
-				ALIGN(vmf->address, fault_size));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap_geometry(pgmap) > PAGE_SIZE)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -426,6 +454,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (dev_dax->align > PAGE_SIZE)
+		pgmap->geometry = dev_dax->align;
 	dev_dax->pgmap = pgmap;
 
 	addr = devm_memremap_pages(dev, pgmap);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (11 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 12/14] device-dax: compound pagemap support Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28 19:55   ` Dan Williams
  2021-07-14 19:35 ` [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
  2021-07-14 21:48 ` [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Andrew Morton
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Use try_grab_compound_head() for device-dax GUP when configured with a
compound pagemap.

Rather than incrementing the refcount for each page, do one atomic
addition for all the pages to be pinned.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:

 $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
(pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
[altmap]
(get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
(pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

 $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
(pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
[altmap with -m 127004]
(get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
(pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 53 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..9baaa1c0b7f3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2234,31 +2234,55 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			     unsigned long end, unsigned int flags,
 			     struct page **pages, int *nr)
 {
-	int nr_start = *nr;
+	int refs, nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 
 	do {
-		struct page *page = pfn_to_page(pfn);
+		struct page *pinned_head, *head, *page = pfn_to_page(pfn);
+		unsigned long next;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
-		SetPageReferenced(page);
-		pages[*nr] = page;
-		if (unlikely(!try_grab_page(page, flags))) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
+
+		head = compound_head(page);
+		/* @end is assumed to be limited at most one compound page */
+		next = PageCompound(head) ? end : addr + PAGE_SIZE;
+		refs = record_subpages(page, addr, next, pages + *nr);
+
+		SetPageReferenced(head);
+		pinned_head = try_grab_compound_head(head, refs, flags);
+		if (!pinned_head) {
+			if (PageCompound(head)) {
+				ClearPageReferenced(head);
+				put_dev_pagemap(pgmap);
+			} else {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+			}
 			return 0;
 		}
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
+		*nr += refs;
+		pfn += refs;
+	} while (addr += (refs << PAGE_SHIFT), addr != end);
 
 	if (pgmap)
 		put_dev_pagemap(pgmap);
@@ -2318,17 +2342,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (12 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-07-14 19:35 ` Joao Martins
  2021-07-28 20:03   ` Dan Williams
  2021-07-14 21:48 ` [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Andrew Morton
  14 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-14 19:35 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc, Joao Martins

Currently, for compound PUD mappings, the implementation consumes 40MB
per TB but it can be optimized to 16MB per TB with the approach
detailed below.

Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that precedes
the memmap being initialized.  This is done when a given memmap
address isn't aligned to the pgmap @geometry (which is safe to do because
@ranges are guaranteed to be aligned to @geometry).

For pagemaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages.  Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being  allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD previously populated which only contain
tail pages).

After this scheme for an 1GB pagemap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages.  The latter page gets
its PMD reused across future section mapping of the same pagemap.

Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. Rather than
requiring 8 PMD page allocations only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++
 include/linux/mm.h                 |   3 +-
 mm/sparse-vmemmap.c                |  74 +++++++++++++++++---
 3 files changed, 174 insertions(+), 12 deletions(-)

diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
index 42830a667c2a..96d9f5f0a497 100644
--- a/Documentation/vm/vmemmap_dedup.rst
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -189,3 +189,112 @@ at a later stage when we populate the sections.
 It only use 3 page structs for storing all information as opposed
 to 4 on HugeTLB pages. This does not affect memory savings between both.
 
+Additionally, it further extends the tail page deduplication with 1GB
+device-dax compound pages.
+
+E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split
+across 8 PMD page frames, with the first PMD having 2 PTE page frames.
+In total this represents a total of 40960 bytes per 1GB page.
+
+Here is how things look after the previously described tail page deduplication
+technique.
+
+   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
+ +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
+ |           |    |    0     |     |     0     | -------------> |      0      |
+ |           |    +----------+     +-----------+                +-------------+
+ |           |                     |     1     | -------------> |      1      |
+ |           |                     +-----------+                +-------------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | | |
+ |           |                     |     3     | ------------------+ | | | | |
+ |           |                     +-----------+                     | | | | |
+ |           |                     |     4     | --------------------+ | | | |
+ |   PMD 0   |                     +-----------+                       | | | |
+ |           |                     |     5     | ----------------------+ | | |
+ |           |                     +-----------+                         | | |
+ |           |                     |     ..    | ------------------------+ | |
+ |           |                     +-----------+                           | |
+ |           |                     |     511   | --------------------------+ |
+ |           |                     +-----------+                             |
+ |           |                                                               |
+ |           |                                                               |
+ |           |                                                               |
+ +-----------+     page frames                                               |
+ +-----------+ -> +----------+ --> +-----------+    mapping to               |
+ |           |    |  1 .. 7  |     |    512    | ----------------------------+
+ |           |    +----------+     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |    PMD    |                     +-----------+                             |
+ |  1 .. 7   |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    4095   | ----------------------------+
+ +-----------+                     +-----------+
+
+Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame
+that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7
+all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame.
+This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per
+1GB page from 40960 bytes to 16384 bytes.
+
+Here is how things look after PMD tail page deduplication.
+
+   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
+ +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
+ |           |    |    0     |     |     0     | -------------> |      0      |
+ |           |    +----------+     +-----------+                +-------------+
+ |           |                     |     1     | -------------> |      1      |
+ |           |                     +-----------+                +-------------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | | |
+ |           |                     |     3     | ------------------+ | | | | |
+ |           |                     +-----------+                     | | | | |
+ |           |                     |     4     | --------------------+ | | | |
+ |   PMD 0   |                     +-----------+                       | | | |
+ |           |                     |     5     | ----------------------+ | | |
+ |           |                     +-----------+                         | | |
+ |           |                     |     ..    | ------------------------+ | |
+ |           |                     +-----------+                           | |
+ |           |                     |     511   | --------------------------+ |
+ |           |                     +-----------+                             |
+ |           |                                                               |
+ |           |                                                               |
+ |           |                                                               |
+ +-----------+     page frames                                               |
+ +-----------+ -> +----------+ --> +-----------+    mapping to               |
+ |           |    |    1     |     |    512    | ----------------------------+
+ |           |    +----------+     +-----------+                             |
+ |           |     ^ ^ ^ ^ ^ ^     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |   PMD 1   |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    4095   | ----------------------------+
+ +-----------+     | | | | | |     +-----------+
+ |   PMD 2   | ----+ | | | | |
+ +-----------+       | | | | |
+ |   PMD 3   | ------+ | | | |
+ +-----------+         | | | |
+ |   PMD 4   | --------+ | | |
+ +-----------+           | | |
+ |   PMD 5   | ----------+ | |
+ +-----------+             | |
+ |   PMD 6   | ------------+ |
+ +-----------+               |
+ |   PMD 7   | --------------+
+ +-----------+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e3e153ddd3d..e9dc3e2de7be 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+			    struct page *block);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, struct page *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a8de6c472999..68041ca9a797 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+				       struct page *block)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-		if (!p)
-			return NULL;
+		void *p;
+
+		if (!block) {
+			p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+			if (!p)
+				return NULL;
+		} else {
+			/* See comment in vmemmap_pte_populate(). */
+			get_page(block);
+			p = page_to_virt(block);
+		}
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      struct page *reuse, struct page **page)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+						  struct vmem_altmap *altmap,
+						  struct page *reuse, pmd_t **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = vmemmap_pgd_populate(addr, node);
 	if (!pgd)
@@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pud = vmemmap_pud_populate(p4d, addr, node);
 	if (!pud)
 		return -ENOMEM;
-	pmd = vmemmap_pmd_populate(pud, addr, node);
+	pmd = vmemmap_pmd_populate(pud, addr, node, reuse);
 	if (!pmd)
 		return -ENOMEM;
+	if (ptr)
+		*ptr = pmd;
+	return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap,
+					      struct page *reuse, struct page **page)
+{
+	pmd_t *pmd;
+	pte_t *pte;
+
+	if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
+		return -ENOMEM;
+
 	pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
 	if (!pte)
 		return -ENOMEM;
@@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
 	return vmemmap_populate_address(addr, node, NULL, NULL, page);
 }
 
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+						unsigned long end,
+						int node, struct page *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PMD_SIZE) {
+		if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 						     unsigned long start,
 						     unsigned long end, int node,
@@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
 	if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
 	    pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
+		pmd_t *pmdp;
 		pte_t *ptep;
 
 		addr = start - PAGE_SIZE;
@@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 		 * the previous struct pages are mapped when trying to lookup
 		 * the last tail page.
 		 */
-		ptep = pte_offset_kernel(pmd_off_k(addr), addr);
-		if (!ptep)
+		pmdp = pmd_off_k(addr);
+		if (!pmdp)
+			return -ENOMEM;
+
+		/*
+		 * Reuse the tail pages vmemmap pmd page
+		 * See layout diagram in Documentation/vm/vmemmap_dedup.rst
+		 */
+		if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION))
+			return vmemmap_populate_pmd_range(start, end, node,
+							  pmd_page(*pmdp));
+
+		/* See comment above when pmd_off_k() is called. */
+		ptep = pte_offset_kernel(pmdp, addr);
+		if (pte_none(*ptep))
 			return -ENOMEM;
 
 		/*
+		 * Populate the tail pages vmemmap pmd page.
 		 * Reuse the page that was populated in the prior iteration
 		 * with just tail struct pages.
 		 */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (13 preceding siblings ...)
  2021-07-14 19:35 ` [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
@ 2021-07-14 21:48 ` Andrew Morton
  2021-07-14 23:47   ` Dan Williams
  2021-07-22  2:24   ` Matthew Wilcox
  14 siblings, 2 replies; 74+ messages in thread
From: Andrew Morton @ 2021-07-14 21:48 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Jonathan Corbet, nvdimm,
	linux-doc

On Wed, 14 Jul 2021 20:35:28 +0100 Joao Martins <joao.m.martins@oracle.com> wrote:

> This series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE which is now
> in mmotm. 
> 
> [0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

[0] is now in mainline.

This patch series looks like it'll clash significantly with the folio
work and it is pretty thinly reviewed, so I think I'll take a pass for
now.  Matthew, thoughts?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-14 19:35 ` [PATCH v3 12/14] device-dax: compound pagemap support Joao Martins
@ 2021-07-14 23:36   ` Dan Williams
  2021-07-15 12:00     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-14 23:36 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Use the newly added compound pagemap facility which maps the assigned dax
> ranges as compound pages at a page size of @align. Currently, this means,
> that region/namespace bootstrap would take considerably less, given that
> you would initialize considerably less pages.
>
> On setups with 128G NVDIMMs the initialization with DRAM stored struct
> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
> than a 1msec with 1G pages.
>
> dax devices are created with a fixed @align (huge page size) which is
> enforced through as well at mmap() of the device. Faults, consequently
> happen too at the specified @align specified at the creation, and those
> don't change through out dax device lifetime. MCEs poisons a whole dax
> huge page, as well as splits occurring at the configured page size.
>

Hi Joao,

With this patch I'm hitting the following with the 'device-dax' test [1].

kernel BUG at include/linux/mm.h:748!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 29 PID: 1509 Comm: device-dax Tainted: G        W  OE     5.14.0-rc1+ #720
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:memunmap_pages+0x2f5/0x390
Code: 00 00 00 31 d2 48 8d 70 01 48 29 fe 48 c1 ef 0c 48 c1 ee 0c e8
1c 30 fa ff e9 c5 fe ff ff 48 c7 c6 00 4a 58 87 e8 eb d1 f6 ff <0f> 0b
48 8b 7b 30 31 f6 e8 7e aa 2b 00 e9 2d fd ff ff 48 8d 7b 48
RSP: 0018:ffff9d33c240bbf0 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffff8a44446eb700 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8a46b3b58af0 RDI: ffff8a46b3b58af0
RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000ffffbfff
R10: ffff8a46b32a0000 R11: ffff8a46b32a0000 R12: 0000000000104201
R13: ffff8a44446eb700 R14: 0000000000000004 R15: ffff8a44474954d8
FS:  00007fd048a81fc0(0000) GS:ffff8a46b3b40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561ee7399000 CR3: 0000000206c70004 CR4: 0000000000170ee0
Call Trace:
 devres_release_all+0xb8/0x100
 __device_release_driver+0x190/0x240
 device_release_driver+0x26/0x40
 bus_remove_device+0xef/0x160
 device_del+0x18c/0x3e0
 unregister_dev_dax+0x62/0x90
 devres_release_all+0xb8/0x100
 __device_release_driver+0x190/0x240
 device_driver_detach+0x3e/0xa0
 unbind_store+0x113/0x130

[1]: https://github.com/pmem/ndctl/blob/master/test/device-dax.c

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-07-14 21:48 ` [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Andrew Morton
@ 2021-07-14 23:47   ` Dan Williams
  2021-07-22  2:24   ` Matthew Wilcox
  1 sibling, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-14 23:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joao Martins, Linux MM, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 2:48 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 14 Jul 2021 20:35:28 +0100 Joao Martins <joao.m.martins@oracle.com> wrote:
>
> > This series, attempts at minimizing 'struct page' overhead by
> > pursuing a similar approach as Muchun Song series "Free some vmemmap
> > pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE which is now
> > in mmotm.
> >
> > [0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/
>
> [0] is now in mainline.
>
> This patch series looks like it'll clash significantly with the folio
> work and it is pretty thinly reviewed,

Sorry about that, I've promised Joao some final reviewed-by tags and
testing for a while, and the gears are turning now.

> so I think I'll take a pass for now.  Matthew, thoughts?

I'll defer to Matthew about folio collision, but I did not think this
compound page geometry setup for memremap_pages() would collide with
folios that want to clarify passing multi-order pages around the
kernel.

Joao is solving a long standing criticism of memremap_pages() usage
for PMEM where the page metadata is too large to fit in RAM and the
page array in PMEM is noticeably slower to pin for frequent
pin_user_pages() events.

memremap_pages() is a good first candidate for this solution given
it's pages never get handled by the page allocator. If anything it
allows folios to seep deeper into the DAX code as it knocks down the
"base-pages only" assumption of those paths.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-07-15  0:17   ` Dan Williams
  2021-07-15  2:51   ` [External] " Muchun Song
  1 sibling, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-15  0:17 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
> dax_lock_page()).  For pagemap with compound pages fetch the
> compound_head in case a tail page memory failure is being handled.
>
> Currently this is a nop, but in the advent of compound pages in
> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-07-14 19:35 ` [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-07-15  0:19   ` Dan Williams
  2021-07-15  2:53   ` [External] " Muchun Song
  1 sibling, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-15  0:19 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Split the utility function prep_compound_page() into head and tail
> counterparts, and use them accordingly.
>
> This is in preparation for sharing the storage for / deduplicating
> compound page metadata.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

Looks good,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-07-14 19:35 ` [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-07-15  0:20   ` Dan Williams
  0 siblings, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-15  0:20 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Move struct page init to an helper function __init_zone_device_page().
>
> This is in preparation for sharing the storage for / deduplicating
> compound page metadata.
>

Looks good,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-14 19:35 ` [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-07-15  1:08   ` Dan Williams
  2021-07-15 12:52     ` Joao Martins
  2021-07-15 12:59     ` Christoph Hellwig
  2021-07-15  6:48   ` Christoph Hellwig
  1 sibling, 2 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-15  1:08 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Add a new align property for struct dev_pagemap which specifies that a

s/align/@geometry/

> pagemap is composed of a set of compound pages of size @align,

s/@align/@geometry/

> instead of
> base pages. When a compound page geometry is requested, all but the first
> page are initialised as tail pages instead of order-0 pages.
>
> For certain ZONE_DEVICE users like device-dax which have a fixed page size,
> this creates an opportunity to optimize GUP and GUP-fast walkers, treating
> it the same way as THP or hugetlb pages.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/memremap.h | 17 +++++++++++++++++
>  mm/memremap.c            |  8 ++++++--
>  mm/page_alloc.c          | 34 +++++++++++++++++++++++++++++++++-
>  3 files changed, 56 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 119f130ef8f1..e5ab6d4525c1 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -99,6 +99,10 @@ struct dev_pagemap_ops {
>   * @done: completion for @internal_ref
>   * @type: memory type: see MEMORY_* in memory_hotplug.h
>   * @flags: PGMAP_* flags to specify defailed behavior
> + * @geometry: structural definition of how the vmemmap metadata is populated.
> + *     A zero or PAGE_SIZE defaults to using base pages as the memmap metadata
> + *     representation. A bigger value but also multiple of PAGE_SIZE will set
> + *     up compound struct pages representative of the requested geometry size.
>   * @ops: method table
>   * @owner: an opaque pointer identifying the entity that manages this
>   *     instance.  Used by various helpers to make sure that no
> @@ -114,6 +118,7 @@ struct dev_pagemap {
>         struct completion done;
>         enum memory_type type;
>         unsigned int flags;
> +       unsigned long geometry;
>         const struct dev_pagemap_ops *ops;
>         void *owner;
>         int nr_range;
> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>         return NULL;
>  }
>
> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
> +{
> +       if (!pgmap || !pgmap->geometry)
> +               return PAGE_SIZE;
> +       return pgmap->geometry;
> +}
> +
> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
> +{
> +       return PHYS_PFN(pgmap_geometry(pgmap));
> +}

Are both needed? Maybe just have ->geometry natively be in nr_pages
units directly, because pgmap_pfn_geometry() makes it confusing
whether it's a geometry of the pfn or the geometry of the pgmap.

> +
>  #ifdef CONFIG_ZONE_DEVICE
>  bool pfn_zone_device_reserved(unsigned long pfn);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 805d761740c4..ffcb924eb6a5 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>                                 PHYS_PFN(range->start),
>                                 PHYS_PFN(range_len(range)), pgmap);
> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> -                       - pfn_first(pgmap, range_id));
> +       if (pgmap_geometry(pgmap) > PAGE_SIZE)

This would become

if (pgmap_geometry(pgmap) > 1)

> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));

...and this would be pgmap_geometry()

> +       else
> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> +                               - pfn_first(pgmap, range_id));
>         return 0;
>
>  err_add_memory:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 79f3b38afeca..188cb5f8c308 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>         }
>  }
>
> +static void __ref memmap_init_compound(struct page *page, unsigned long pfn,

I'd feel better if @page was renamed @head... more below:

> +                                       unsigned long zone_idx, int nid,
> +                                       struct dev_pagemap *pgmap,
> +                                       unsigned long nr_pages)
> +{
> +       unsigned int order_align = order_base_2(nr_pages);
> +       unsigned long i;
> +
> +       __SetPageHead(page);
> +
> +       for (i = 1; i < nr_pages; i++) {

The switch of loop styles is jarring. I.e. the switch from
memmap_init_zone_device() that is using pfn, end_pfn, and a local
'struct page *' variable to this helper using pfn + i and a mix of
helpers (__init_zone_device_page,  prep_compound_tail) that have
different expectations of head page + tail_idx and current page.

I.e. this reads more obviously correct to me, but maybe I'm just in
the wrong headspace:

        for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
                prep_compound_tail(head, pfn - head_pfn);

> +               __init_zone_device_page(page + i, pfn + i, zone_idx,
> +                                       nid, pgmap);
> +               prep_compound_tail(page, i);
> +
> +               /*
> +                * The first and second tail pages need to
> +                * initialized first, hence the head page is
> +                * prepared last.

I'd change this comment to say why rather than restate what can be
gleaned from the code. It's actually not clear to me why this order is
necessary.

> +                */
> +               if (i == 2)
> +                       prep_compound_head(page, order_align);
> +       }
> +}
> +
>  void __ref memmap_init_zone_device(struct zone *zone,
>                                    unsigned long start_pfn,
>                                    unsigned long nr_pages,
> @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>         struct pglist_data *pgdat = zone->zone_pgdat;
>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> +       unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap);
>         unsigned long zone_idx = zone_idx(zone);
>         unsigned long start = jiffies;
>         int nid = pgdat->node_id;
> @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
>                 nr_pages = end_pfn - start_pfn;
>         }
>
> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
>                 struct page *page = pfn_to_page(pfn);
>
>                 __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
> +
> +               if (pfns_per_compound == 1)
> +                       continue;
> +
> +               memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
> +                                    pfns_per_compound);

I otherwise don't see anything broken with this patch, so feel free to include:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...on the resend with the fixups.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [External] [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-07-14 19:35 ` [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
@ 2021-07-15  2:47   ` Muchun Song
  2021-07-15 13:16     ` Joao Martins
  2021-07-28  6:09   ` Dan Williams
  1 sibling, 1 reply; 74+ messages in thread
From: Muchun Song @ 2021-07-15  2:47 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On Thu, Jul 15, 2021 at 3:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> In preparation for device-dax for using hugetlbfs compound page tail
> deduplication technique, move the comment block explanation into a
> common place in Documentation/vm.
>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

LGTM.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [External] [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-07-15  0:17   ` Dan Williams
@ 2021-07-15  2:51   ` Muchun Song
  2021-07-15  6:40     ` Christoph Hellwig
  2021-07-15 13:17     ` Joao Martins
  1 sibling, 2 replies; 74+ messages in thread
From: Muchun Song @ 2021-07-15  2:51 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On Thu, Jul 15, 2021 at 3:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
> dax_lock_page()).  For pagemap with compound pages fetch the
> compound_head in case a tail page memory failure is being handled.
>
> Currently this is a nop, but in the advent of compound pages in
> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [External] [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-07-14 19:35 ` [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
  2021-07-15  0:19   ` Dan Williams
@ 2021-07-15  2:53   ` Muchun Song
  2021-07-15 13:17     ` Joao Martins
  1 sibling, 1 reply; 74+ messages in thread
From: Muchun Song @ 2021-07-15  2:53 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On Thu, Jul 15, 2021 at 3:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Split the utility function prep_compound_page() into head and tail
> counterparts, and use them accordingly.
>
> This is in preparation for sharing the storage for / deduplicating
> compound page metadata.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [External] [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-07-15  2:51   ` [External] " Muchun Song
@ 2021-07-15  6:40     ` Christoph Hellwig
  2021-07-15  9:19       ` Muchun Song
  2021-07-15 13:17     ` Joao Martins
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2021-07-15  6:40 UTC (permalink / raw)
  To: Muchun Song
  Cc: Joao Martins, Linux Memory Management List, Dan Williams,
	Vishal Verma, Dave Jiang, Naoya Horiguchi, Matthew Wilcox,
	Jason Gunthorpe, John Hubbard, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, nvdimm, linux-doc

Can you please fix up your mailer to not mess with the subject?
That makes the thread completely unreadable.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-14 19:35 ` [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
  2021-07-15  1:08   ` Dan Williams
@ 2021-07-15  6:48   ` Christoph Hellwig
  2021-07-15 13:15     ` Joao Martins
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2021-07-15  6:48 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
> +{
> +	if (!pgmap || !pgmap->geometry)
> +		return PAGE_SIZE;
> +	return pgmap->geometry;

Nit, but avoiding all the negations would make this a little easier to
read:

	if (pgmap && pgmap->geometry)
		return pgmap->geometry;
	return PAGE_SIZE

> +	if (pgmap_geometry(pgmap) > PAGE_SIZE)
> +		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> +			- pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));
> +	else
> +		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> +				- pfn_first(pgmap, range_id));

This is a horrible undreadable mess, which is trivially fixed by a
strategically used local variable:

	refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id);
	if (pgmap_geometry(pgmap) > PAGE_SIZE)
		refs /= pgmap_pfn_geometry(pgmap);
	percpu_ref_get_many(pgmap->ref, refs);


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-07-15  6:40     ` Christoph Hellwig
@ 2021-07-15  9:19       ` Muchun Song
  0 siblings, 0 replies; 74+ messages in thread
From: Muchun Song @ 2021-07-15  9:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Joao Martins, Linux Memory Management List, Dan Williams,
	Vishal Verma, Dave Jiang, Naoya Horiguchi, Matthew Wilcox,
	Jason Gunthorpe, John Hubbard, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, nvdimm, linux-doc

On Thu, Jul 15, 2021 at 2:42 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> Can you please fix up your mailer to not mess with the subject?
> That makes the thread completely unreadable.

Sorry. I didn't realize it before. Thanks for your reminder.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-14 23:36   ` Dan Williams
@ 2021-07-15 12:00     ` Joao Martins
  2021-07-27 23:51       ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-15 12:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On 7/15/21 12:36 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Use the newly added compound pagemap facility which maps the assigned dax
>> ranges as compound pages at a page size of @align. Currently, this means,
>> that region/namespace bootstrap would take considerably less, given that
>> you would initialize considerably less pages.
>>
>> On setups with 128G NVDIMMs the initialization with DRAM stored struct
>> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
>> than a 1msec with 1G pages.
>>
>> dax devices are created with a fixed @align (huge page size) which is
>> enforced through as well at mmap() of the device. Faults, consequently
>> happen too at the specified @align specified at the creation, and those
>> don't change through out dax device lifetime. MCEs poisons a whole dax
>> huge page, as well as splits occurring at the configured page size.
>>
> 
> Hi Joao,
> 
> With this patch I'm hitting the following with the 'device-dax' test [1].
> 
Ugh, I can reproduce it too -- apologies for the oversight.

This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.

It needs this chunk below change on the fourth patch due to the existing elevated page ref
count at zone device memmap init. put_page() called here in memunmap_pages():

for (i = 0; i < pgmap->nr_ranges; i++)
	for_each_device_pfn(pfn, pgmap, i)
		put_page(pfn_to_page(pfn));

... on a zone_device compound memmap would otherwise always decrease head page refcount by
@geometry pfn amount (leading to the aforementioned splat you reported).

diff --git a/mm/memremap.c b/mm/memremap.c
index b0e7b8cf3047..79a883af788e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
        return (range->start + range_len(range)) >> PAGE_SHIFT;
 }

-static unsigned long pfn_next(unsigned long pfn)
+static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
 {
        if (pfn % 1024 == 0)
                cond_resched();
-       return pfn + 1;
+       return pfn + pgmap_pfn_geometry(pgmap);
 }

 #define for_each_device_pfn(pfn, map, i) \
-       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
+       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))

 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {

It could also get this hunk below, but it is sort of redundant provided we won't touch
tail page refcount through out the devmap pages lifetime. This setting of tail pages
refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
from the page allocator (where tail pages are already zeroed in refcount).

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 96975edac0a8..469a7aa5cf38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
long pfn,
                __init_zone_device_page(page + i, pfn + i, zone_idx,
                                        nid, pgmap);
                prep_compound_tail(page, i);
+               set_page_count(page + i, 0);

                /*
                 * The first and second tail pages need to

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15  1:08   ` Dan Williams
@ 2021-07-15 12:52     ` Joao Martins
  2021-07-15 13:06       ` Joao Martins
                         ` (2 more replies)
  2021-07-15 12:59     ` Christoph Hellwig
  1 sibling, 3 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-15 12:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/15/21 2:08 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Add a new align property for struct dev_pagemap which specifies that a
> 
> s/align/@geometry/
> 
Yeap, updated.

>> pagemap is composed of a set of compound pages of size @align,
> 
> s/@align/@geometry/
> 
Yeap, updated.

>> instead of
>> base pages. When a compound page geometry is requested, all but the first
>> page are initialised as tail pages instead of order-0 pages.
>>
>> For certain ZONE_DEVICE users like device-dax which have a fixed page size,
>> this creates an opportunity to optimize GUP and GUP-fast walkers, treating
>> it the same way as THP or hugetlb pages.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/memremap.h | 17 +++++++++++++++++
>>  mm/memremap.c            |  8 ++++++--
>>  mm/page_alloc.c          | 34 +++++++++++++++++++++++++++++++++-
>>  3 files changed, 56 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 119f130ef8f1..e5ab6d4525c1 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -99,6 +99,10 @@ struct dev_pagemap_ops {
>>   * @done: completion for @internal_ref
>>   * @type: memory type: see MEMORY_* in memory_hotplug.h
>>   * @flags: PGMAP_* flags to specify defailed behavior
>> + * @geometry: structural definition of how the vmemmap metadata is populated.
>> + *     A zero or PAGE_SIZE defaults to using base pages as the memmap metadata
>> + *     representation. A bigger value but also multiple of PAGE_SIZE will set
>> + *     up compound struct pages representative of the requested geometry size.
>>   * @ops: method table
>>   * @owner: an opaque pointer identifying the entity that manages this
>>   *     instance.  Used by various helpers to make sure that no
>> @@ -114,6 +118,7 @@ struct dev_pagemap {
>>         struct completion done;
>>         enum memory_type type;
>>         unsigned int flags;
>> +       unsigned long geometry;
>>         const struct dev_pagemap_ops *ops;
>>         void *owner;
>>         int nr_range;
>> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>>         return NULL;
>>  }
>>
>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
>> +{
>> +       if (!pgmap || !pgmap->geometry)
>> +               return PAGE_SIZE;
>> +       return pgmap->geometry;
>> +}
>> +
>> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
>> +{
>> +       return PHYS_PFN(pgmap_geometry(pgmap));
>> +}
> 
> Are both needed? Maybe just have ->geometry natively be in nr_pages
> units directly, because pgmap_pfn_geometry() makes it confusing
> whether it's a geometry of the pfn or the geometry of the pgmap.
> 
I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we
deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that
deals with PFN initialization. For this patch I could remove the confusion.

And actually maybe I can just store the pgmap_geometry() value in bytes locally in
vmemmap_populate_compound_pages() and we can remove this extra helper.

>> +
>>  #ifdef CONFIG_ZONE_DEVICE
>>  bool pfn_zone_device_reserved(unsigned long pfn);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 805d761740c4..ffcb924eb6a5 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>                                 PHYS_PFN(range->start),
>>                                 PHYS_PFN(range_len(range)), pgmap);
>> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> -                       - pfn_first(pgmap, range_id));
>> +       if (pgmap_geometry(pgmap) > PAGE_SIZE)
> 
> This would become
> 
> if (pgmap_geometry(pgmap) > 1)
> 
>> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));
> 
> ...and this would be pgmap_geometry()
> 
>> +       else
>> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> +                               - pfn_first(pgmap, range_id));
>>         return 0;
>>
Let me adjust accordingly.

>>  err_add_memory:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 79f3b38afeca..188cb5f8c308 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>>         }
>>  }
>>
>> +static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
> 
> I'd feel better if @page was renamed @head... more below:
> 
Oh yeah -- definitely more readable.

>> +                                       unsigned long zone_idx, int nid,
>> +                                       struct dev_pagemap *pgmap,
>> +                                       unsigned long nr_pages)
>> +{
>> +       unsigned int order_align = order_base_2(nr_pages);
>> +       unsigned long i;
>> +
>> +       __SetPageHead(page);
>> +
>> +       for (i = 1; i < nr_pages; i++) {
> 
> The switch of loop styles is jarring. I.e. the switch from
> memmap_init_zone_device() that is using pfn, end_pfn, and a local
> 'struct page *' variable to this helper using pfn + i and a mix of
> helpers (__init_zone_device_page,  prep_compound_tail) that have
> different expectations of head page + tail_idx and current page.
> 
> I.e. this reads more obviously correct to me, but maybe I'm just in
> the wrong headspace:
> 
>         for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
>                 struct page *page = pfn_to_page(pfn);
> 
>                 __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>                 prep_compound_tail(head, pfn - head_pfn);
> 
Personally -- and I am dubious given I have been staring at this code -- I find that what
I wrote a little better as it follows more what compound page initialization does. Like
it's easier for me to read that I am initializing a number of tail pages and a head page
(for a known geometry size).

Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn)
provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all
contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and
forth. Which is the second reason I pass a page, and iterate over its tails based on a
head page pointer. But I was at too minds when writing this, so if the there's no added
inefficiency I can rewrite like the above.

>> +               __init_zone_device_page(page + i, pfn + i, zone_idx,
>> +                                       nid, pgmap);
>> +               prep_compound_tail(page, i);
>> +
>> +               /*
>> +                * The first and second tail pages need to
>> +                * initialized first, hence the head page is
>> +                * prepared last.
> 
> I'd change this comment to say why rather than restate what can be
> gleaned from the code. It's actually not clear to me why this order is
> necessary.
> 
So the first tail page stores mapcount_ptr and compound order, and the
second tail page stores pincount_ptr. prep_compound_head() does this:

	set_compound_order(page, order);
	atomic_set(compound_mapcount_ptr(page), -1);
	if (hpage_pincount_available(page))
		atomic_set(compound_pincount_ptr(page), 0);

So we need those tail pages initialized first prior to initializing the head.

I can expand the comment above to make it clear why we need first and second tail pages.

>> +                */
>> +               if (i == 2)
>> +                       prep_compound_head(page, order_align);
>> +       }
>> +}
>> +
>>  void __ref memmap_init_zone_device(struct zone *zone,
>>                                    unsigned long start_pfn,
>>                                    unsigned long nr_pages,
>> @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>>         struct pglist_data *pgdat = zone->zone_pgdat;
>>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>> +       unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap);
>>         unsigned long zone_idx = zone_idx(zone);
>>         unsigned long start = jiffies;
>>         int nid = pgdat->node_id;
>> @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>                 nr_pages = end_pfn - start_pfn;
>>         }
>>
>> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
>>                 struct page *page = pfn_to_page(pfn);
>>
>>                 __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>> +
>> +               if (pfns_per_compound == 1)
>> +                       continue;
>> +
>> +               memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
>> +                                    pfns_per_compound);
> 
> I otherwise don't see anything broken with this patch, so feel free to include:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> ...on the resend with the fixups.
> 
Thanks.

I will wait whether you still want to retain the tag provided the implied changes
fixing the failure you reported.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15  1:08   ` Dan Williams
  2021-07-15 12:52     ` Joao Martins
@ 2021-07-15 12:59     ` Christoph Hellwig
  2021-07-15 13:15       ` Joao Martins
  1 sibling, 1 reply; 74+ messages in thread
From: Christoph Hellwig @ 2021-07-15 12:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, Linux MM, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 06:08:14PM -0700, Dan Williams wrote:
> > +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
> > +{
> > +       if (!pgmap || !pgmap->geometry)
> > +               return PAGE_SIZE;
> > +       return pgmap->geometry;
> > +}
> > +
> > +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
> > +{
> > +       return PHYS_PFN(pgmap_geometry(pgmap));
> > +}
> 
> Are both needed? Maybe just have ->geometry natively be in nr_pages
> units directly, because pgmap_pfn_geometry() makes it confusing
> whether it's a geometry of the pfn or the geometry of the pgmap.

Actually - do we need non-power of two sizes here?  Otherwise a shift
for the pfns would be really nice as that simplifies a lot of the
calculations.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15 12:52     ` Joao Martins
@ 2021-07-15 13:06       ` Joao Martins
  2021-07-15 19:48       ` Dan Williams
  2021-07-22  0:38       ` Jane Chu
  2 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-15 13:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On 7/15/21 1:52 PM, Joao Martins wrote:
> On 7/15/21 2:08 AM, Dan Williams wrote:
>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>> instead of
>>> base pages. When a compound page geometry is requested, all but the first
>>> page are initialised as tail pages instead of order-0 pages.
>>>
>>> For certain ZONE_DEVICE users like device-dax which have a fixed page size,
>>> this creates an opportunity to optimize GUP and GUP-fast walkers, treating
>>> it the same way as THP or hugetlb pages.
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>  include/linux/memremap.h | 17 +++++++++++++++++
>>>  mm/memremap.c            |  8 ++++++--
>>>  mm/page_alloc.c          | 34 +++++++++++++++++++++++++++++++++-
>>>  3 files changed, 56 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 119f130ef8f1..e5ab6d4525c1 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -99,6 +99,10 @@ struct dev_pagemap_ops {
>>>   * @done: completion for @internal_ref
>>>   * @type: memory type: see MEMORY_* in memory_hotplug.h
>>>   * @flags: PGMAP_* flags to specify defailed behavior
>>> + * @geometry: structural definition of how the vmemmap metadata is populated.
>>> + *     A zero or PAGE_SIZE defaults to using base pages as the memmap metadata
>>> + *     representation. A bigger value but also multiple of PAGE_SIZE will set
>>> + *     up compound struct pages representative of the requested geometry size.
>>>   * @ops: method table
>>>   * @owner: an opaque pointer identifying the entity that manages this
>>>   *     instance.  Used by various helpers to make sure that no
>>> @@ -114,6 +118,7 @@ struct dev_pagemap {
>>>         struct completion done;
>>>         enum memory_type type;
>>>         unsigned int flags;
>>> +       unsigned long geometry;
>>>         const struct dev_pagemap_ops *ops;
>>>         void *owner;
>>>         int nr_range;
>>> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>>>         return NULL;
>>>  }
>>>
>>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
>>> +{
>>> +       if (!pgmap || !pgmap->geometry)
>>> +               return PAGE_SIZE;
>>> +       return pgmap->geometry;
>>> +}
>>> +
>>> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
>>> +{
>>> +       return PHYS_PFN(pgmap_geometry(pgmap));
>>> +}
>>
>> Are both needed? Maybe just have ->geometry natively be in nr_pages
>> units directly, because pgmap_pfn_geometry() makes it confusing
>> whether it's a geometry of the pfn or the geometry of the pgmap.
>>
> I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we
> deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that
> deals with PFN initialization. For this patch I could remove the confusion.
> 
> And actually maybe I can just store the pgmap_geometry() value in bytes locally in
> vmemmap_populate_compound_pages() and we can remove this extra helper.
> 
But one nice property of pgmap_geometry() is the @pgmap check and not needing that the
driver initializes a pgmap->geometry property. So a zeroed value would still support pgmap
users on the old case where there's no @geometry (or the user doesn't care). So departing
from this helper might mean that either memremap_pages() sets the right @geometry if a
zeroed value is passed in. and __populate_section_memmap() makes sures pgmap is associated
when trying to figure out if there's a geometry to consider in the section mapping.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15  6:48   ` Christoph Hellwig
@ 2021-07-15 13:15     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-15 13:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On 7/15/21 7:48 AM, Christoph Hellwig wrote:
>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
>> +{
>> +	if (!pgmap || !pgmap->geometry)
>> +		return PAGE_SIZE;
>> +	return pgmap->geometry;
> 
> Nit, but avoiding all the negations would make this a little easier to
> read:
> 
> 	if (pgmap && pgmap->geometry)
> 		return pgmap->geometry;
> 	return PAGE_SIZE
> 
Nicer indeed.

But this might be removed, should we follow Dan's suggestion on geometry representing nr
of pages.


>> +	if (pgmap_geometry(pgmap) > PAGE_SIZE)
>> +		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>> +			- pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));
>> +	else
>> +		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> +				- pfn_first(pgmap, range_id));
> 
> This is a horrible undreadable mess, which is trivially fixed by a
> strategically used local variable:
> 
> 	refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id);
> 	if (pgmap_geometry(pgmap) > PAGE_SIZE)
> 		refs /= pgmap_pfn_geometry(pgmap);
> 	percpu_ref_get_many(pgmap->ref, refs);
> 
> 
Yeap, much readable, thanks for the suggestion.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15 12:59     ` Christoph Hellwig
@ 2021-07-15 13:15       ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-15 13:15 UTC (permalink / raw)
  To: Christoph Hellwig, Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On 7/15/21 1:59 PM, Christoph Hellwig wrote:
> On Wed, Jul 14, 2021 at 06:08:14PM -0700, Dan Williams wrote:
>>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
>>> +{
>>> +       if (!pgmap || !pgmap->geometry)
>>> +               return PAGE_SIZE;
>>> +       return pgmap->geometry;
>>> +}
>>> +
>>> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
>>> +{
>>> +       return PHYS_PFN(pgmap_geometry(pgmap));
>>> +}
>>
>> Are both needed? Maybe just have ->geometry natively be in nr_pages
>> units directly, because pgmap_pfn_geometry() makes it confusing
>> whether it's a geometry of the pfn or the geometry of the pgmap.
> 
> Actually - do we need non-power of two sizes here?  Otherwise a shift
> for the pfns would be really nice as that simplifies a lot of the
> calculations.
> 
AIUI, it's only powers-of-two: PAGE_SIZE (1 if nr of pages), PMD_SIZE (512) and PUD_SIZE (4K).

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [External] [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-07-15  2:47   ` [External] " Muchun Song
@ 2021-07-15 13:16     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-15 13:16 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On 7/15/21 3:47 AM, Muchun Song wrote:
> On Thu, Jul 15, 2021 at 3:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> In preparation for device-dax for using hugetlbfs compound page tail
>> deduplication technique, move the comment block explanation into a
>> common place in Documentation/vm.
>>
>> Cc: Muchun Song <songmuchun@bytedance.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> LGTM.
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
Thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-07-15  2:53   ` [External] " Muchun Song
@ 2021-07-15 13:17     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-15 13:17 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc



On 7/15/21 3:53 AM, Muchun Song wrote:
> On Thu, Jul 15, 2021 at 3:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Split the utility function prep_compound_page() into head and tail
>> counterparts, and use them accordingly.
>>
>> This is in preparation for sharing the storage for / deduplicating
>> compound page metadata.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
Thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-07-15  2:51   ` [External] " Muchun Song
  2021-07-15  6:40     ` Christoph Hellwig
@ 2021-07-15 13:17     ` Joao Martins
  1 sibling, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-15 13:17 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc



On 7/15/21 3:51 AM, Muchun Song wrote:
> On Thu, Jul 15, 2021 at 3:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
>> dax_lock_page()).  For pagemap with compound pages fetch the
>> compound_head in case a tail page memory failure is being handled.
>>
>> Currently this is a nop, but in the advent of compound pages in
>> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>>
>> Reported-by: Jane Chu <jane.chu@oracle.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> 
Thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15 12:52     ` Joao Martins
  2021-07-15 13:06       ` Joao Martins
@ 2021-07-15 19:48       ` Dan Williams
  2021-07-30 16:13         ` Joao Martins
  2021-07-22  0:38       ` Jane Chu
  2 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-15 19:48 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Thu, Jul 15, 2021 at 5:52 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 7/15/21 2:08 AM, Dan Williams wrote:
> > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Add a new align property for struct dev_pagemap which specifies that a
> >
> > s/align/@geometry/
> >
> Yeap, updated.
>
> >> pagemap is composed of a set of compound pages of size @align,
> >
> > s/@align/@geometry/
> >
> Yeap, updated.
>
> >> instead of
> >> base pages. When a compound page geometry is requested, all but the first
> >> page are initialised as tail pages instead of order-0 pages.
> >>
> >> For certain ZONE_DEVICE users like device-dax which have a fixed page size,
> >> this creates an opportunity to optimize GUP and GUP-fast walkers, treating
> >> it the same way as THP or hugetlb pages.
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  include/linux/memremap.h | 17 +++++++++++++++++
> >>  mm/memremap.c            |  8 ++++++--
> >>  mm/page_alloc.c          | 34 +++++++++++++++++++++++++++++++++-
> >>  3 files changed, 56 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> index 119f130ef8f1..e5ab6d4525c1 100644
> >> --- a/include/linux/memremap.h
> >> +++ b/include/linux/memremap.h
> >> @@ -99,6 +99,10 @@ struct dev_pagemap_ops {
> >>   * @done: completion for @internal_ref
> >>   * @type: memory type: see MEMORY_* in memory_hotplug.h
> >>   * @flags: PGMAP_* flags to specify defailed behavior
> >> + * @geometry: structural definition of how the vmemmap metadata is populated.
> >> + *     A zero or PAGE_SIZE defaults to using base pages as the memmap metadata
> >> + *     representation. A bigger value but also multiple of PAGE_SIZE will set
> >> + *     up compound struct pages representative of the requested geometry size.
> >>   * @ops: method table
> >>   * @owner: an opaque pointer identifying the entity that manages this
> >>   *     instance.  Used by various helpers to make sure that no
> >> @@ -114,6 +118,7 @@ struct dev_pagemap {
> >>         struct completion done;
> >>         enum memory_type type;
> >>         unsigned int flags;
> >> +       unsigned long geometry;
> >>         const struct dev_pagemap_ops *ops;
> >>         void *owner;
> >>         int nr_range;
> >> @@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
> >>         return NULL;
> >>  }
> >>
> >> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
> >> +{
> >> +       if (!pgmap || !pgmap->geometry)
> >> +               return PAGE_SIZE;
> >> +       return pgmap->geometry;
> >> +}
> >> +
> >> +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
> >> +{
> >> +       return PHYS_PFN(pgmap_geometry(pgmap));
> >> +}
> >
> > Are both needed? Maybe just have ->geometry natively be in nr_pages
> > units directly, because pgmap_pfn_geometry() makes it confusing
> > whether it's a geometry of the pfn or the geometry of the pgmap.
> >
> I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we
> deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that
> deals with PFN initialization. For this patch I could remove the confusion.
>
> And actually maybe I can just store the pgmap_geometry() value in bytes locally in
> vmemmap_populate_compound_pages() and we can remove this extra helper.
>
> >> +
> >>  #ifdef CONFIG_ZONE_DEVICE
> >>  bool pfn_zone_device_reserved(unsigned long pfn);
> >>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >> diff --git a/mm/memremap.c b/mm/memremap.c
> >> index 805d761740c4..ffcb924eb6a5 100644
> >> --- a/mm/memremap.c
> >> +++ b/mm/memremap.c
> >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >>                                 PHYS_PFN(range->start),
> >>                                 PHYS_PFN(range_len(range)), pgmap);
> >> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> -                       - pfn_first(pgmap, range_id));
> >> +       if (pgmap_geometry(pgmap) > PAGE_SIZE)
> >
> > This would become
> >
> > if (pgmap_geometry(pgmap) > 1)
> >
> >> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));
> >
> > ...and this would be pgmap_geometry()
> >
> >> +       else
> >> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> +                               - pfn_first(pgmap, range_id));
> >>         return 0;
> >>
> Let me adjust accordingly.
>
> >>  err_add_memory:
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 79f3b38afeca..188cb5f8c308 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> >>         }
> >>  }
> >>
> >> +static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
> >
> > I'd feel better if @page was renamed @head... more below:
> >
> Oh yeah -- definitely more readable.
>
> >> +                                       unsigned long zone_idx, int nid,
> >> +                                       struct dev_pagemap *pgmap,
> >> +                                       unsigned long nr_pages)
> >> +{
> >> +       unsigned int order_align = order_base_2(nr_pages);
> >> +       unsigned long i;
> >> +
> >> +       __SetPageHead(page);
> >> +
> >> +       for (i = 1; i < nr_pages; i++) {
> >
> > The switch of loop styles is jarring. I.e. the switch from
> > memmap_init_zone_device() that is using pfn, end_pfn, and a local
> > 'struct page *' variable to this helper using pfn + i and a mix of
> > helpers (__init_zone_device_page,  prep_compound_tail) that have
> > different expectations of head page + tail_idx and current page.
> >
> > I.e. this reads more obviously correct to me, but maybe I'm just in
> > the wrong headspace:
> >
> >         for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
> >                 struct page *page = pfn_to_page(pfn);
> >
> >                 __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
> >                 prep_compound_tail(head, pfn - head_pfn);
> >
> Personally -- and I am dubious given I have been staring at this code -- I find that what
> I wrote a little better as it follows more what compound page initialization does. Like
> it's easier for me to read that I am initializing a number of tail pages and a head page
> (for a known geometry size).
>
> Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn)
> provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all
> contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and
> forth. Which is the second reason I pass a page, and iterate over its tails based on a
> head page pointer. But I was at too minds when writing this, so if the there's no added
> inefficiency I can rewrite like the above.

I mainly just don't want 2 different styles between
memmap_init_zone_device() and this helper. So if the argument is that
"it's inefficient to use pfn_to_page() here" then why does the caller
use pfn_to_page()? I won't argue too much for one way or the other,
I'm still biased towards my rewrite, but whatever you pick just make
the style consistent.

>
> >> +               __init_zone_device_page(page + i, pfn + i, zone_idx,
> >> +                                       nid, pgmap);
> >> +               prep_compound_tail(page, i);
> >> +
> >> +               /*
> >> +                * The first and second tail pages need to
> >> +                * initialized first, hence the head page is
> >> +                * prepared last.
> >
> > I'd change this comment to say why rather than restate what can be
> > gleaned from the code. It's actually not clear to me why this order is
> > necessary.
> >
> So the first tail page stores mapcount_ptr and compound order, and the
> second tail page stores pincount_ptr. prep_compound_head() does this:
>
>         set_compound_order(page, order);
>         atomic_set(compound_mapcount_ptr(page), -1);
>         if (hpage_pincount_available(page))
>                 atomic_set(compound_pincount_ptr(page), 0);
>
> So we need those tail pages initialized first prior to initializing the head.
>
> I can expand the comment above to make it clear why we need first and second tail pages.

Thanks!

> >> +                */
> >> +               if (i == 2)
> >> +                       prep_compound_head(page, order_align);
> >> +       }
> >> +}
> >> +
> >>  void __ref memmap_init_zone_device(struct zone *zone,
> >>                                    unsigned long start_pfn,
> >>                                    unsigned long nr_pages,
> >> @@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >>         struct pglist_data *pgdat = zone->zone_pgdat;
> >>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> >> +       unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap);
> >>         unsigned long zone_idx = zone_idx(zone);
> >>         unsigned long start = jiffies;
> >>         int nid = pgdat->node_id;
> >> @@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >>                 nr_pages = end_pfn - start_pfn;
> >>         }
> >>
> >> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
> >>                 struct page *page = pfn_to_page(pfn);
> >>
> >>                 __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
> >> +
> >> +               if (pfns_per_compound == 1)
> >> +                       continue;
> >> +
> >> +               memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
> >> +                                    pfns_per_compound);
> >
> > I otherwise don't see anything broken with this patch, so feel free to include:
> >
> > Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> >
> > ...on the resend with the fixups.
> >
> Thanks.
>
> I will wait whether you still want to retain the tag provided the implied changes
> fixing the failure you reported.

Yeah, tag is still valid.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15 12:52     ` Joao Martins
  2021-07-15 13:06       ` Joao Martins
  2021-07-15 19:48       ` Dan Williams
@ 2021-07-22  0:38       ` Jane Chu
  2021-07-22 10:56         ` Joao Martins
  2 siblings, 1 reply; 74+ messages in thread
From: Jane Chu @ 2021-07-22  0:38 UTC (permalink / raw)
  To: Joao Martins, Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Linux NVDIMM,
	Linux Doc Mailing List


On 7/15/2021 5:52 AM, Joao Martins wrote:
>>> +               __init_zone_device_page(page + i, pfn + i, zone_idx,
>>> +                                       nid, pgmap);
>>> +               prep_compound_tail(page, i);
>>> +
>>> +               /*
>>> +                * The first and second tail pages need to
>>> +                * initialized first, hence the head page is
>>> +                * prepared last.
>> I'd change this comment to say why rather than restate what can be
>> gleaned from the code. It's actually not clear to me why this order is
>> necessary.
>>
> So the first tail page stores mapcount_ptr and compound order, and the
> second tail page stores pincount_ptr. prep_compound_head() does this:
> 
> 	set_compound_order(page, order);
> 	atomic_set(compound_mapcount_ptr(page), -1);
> 	if (hpage_pincount_available(page))
> 		atomic_set(compound_pincount_ptr(page), 0);
> 
> So we need those tail pages initialized first prior to initializing the head.
> 
> I can expand the comment above to make it clear why we need first and second tail pages.
> 

Perhaps just say
   The reason prep_compound_head() is called after the 1st and 2nd tail
   pages have been initialized is: so it overwrites some of the tail page
   fields setup by __init_zone_device_page(), rather than the other way 
around.
?

thanks,
-jane

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-07-14 21:48 ` [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Andrew Morton
  2021-07-14 23:47   ` Dan Williams
@ 2021-07-22  2:24   ` Matthew Wilcox
  2021-07-22 10:53     ` Joao Martins
  1 sibling, 1 reply; 74+ messages in thread
From: Matthew Wilcox @ 2021-07-22  2:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joao Martins, linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Jonathan Corbet, nvdimm, linux-doc

On Wed, Jul 14, 2021 at 02:48:30PM -0700, Andrew Morton wrote:
> On Wed, 14 Jul 2021 20:35:28 +0100 Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> > This series, attempts at minimizing 'struct page' overhead by
> > pursuing a similar approach as Muchun Song series "Free some vmemmap
> > pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE which is now
> > in mmotm. 
> > 
> > [0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/
> 
> [0] is now in mainline.
> 
> This patch series looks like it'll clash significantly with the folio
> work and it is pretty thinly reviewed, so I think I'll take a pass for
> now.  Matthew, thoughts?

I had a look through it, and I don't see anything that looks like it'll
clash with the folio patches.  The folio work really touches the page
cache for now, and this seems mostly to touch the devmap paths.

It would be nice to convert the devmap code to folios too, but that
can wait.  The mess with page refcounts needs to be sorted out first.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-07-22  2:24   ` Matthew Wilcox
@ 2021-07-22 10:53     ` Joao Martins
  2021-07-27 23:23       ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-22 10:53 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Jonathan Corbet, nvdimm, linux-doc

On 7/22/21 3:24 AM, Matthew Wilcox wrote:
> On Wed, Jul 14, 2021 at 02:48:30PM -0700, Andrew Morton wrote:
>> On Wed, 14 Jul 2021 20:35:28 +0100 Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> This series, attempts at minimizing 'struct page' overhead by
>>> pursuing a similar approach as Muchun Song series "Free some vmemmap
>>> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE which is now
>>> in mmotm. 
>>>
>>> [0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/
>>
>> [0] is now in mainline.
>>
>> This patch series looks like it'll clash significantly with the folio
>> work and it is pretty thinly reviewed, so I think I'll take a pass for
>> now.  Matthew, thoughts?
> 
> I had a look through it, and I don't see anything that looks like it'll
> clash with the folio patches.  

FWIW, I had tried this last week, and this series applies cleanly on top of your 130+
patch series for Folios.


> The folio work really touches the page
> cache for now, and this seems mostly to touch the devmap paths.
> 
/me nods -- it really is about devmap infra for usage in device-dax for persistent memory.

Perhaps I should do s/pagemaps/devmap/ throughout the series to avoid confusion.

> It would be nice to convert the devmap code to folios too, but that
> can wait.  The mess with page refcounts needs to be sorted out first.
> 
I suppose you refer to fixing the current zone-device elevated page ref count?

https://lore.kernel.org/linux-mm/20210717192135.9030-3-alex.sierra@amd.com/

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-22  0:38       ` Jane Chu
@ 2021-07-22 10:56         ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-22 10:56 UTC (permalink / raw)
  To: Jane Chu, Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Linux NVDIMM,
	Linux Doc Mailing List



On 7/22/21 1:38 AM, Jane Chu wrote:
> 
> On 7/15/2021 5:52 AM, Joao Martins wrote:
>>>> +               __init_zone_device_page(page + i, pfn + i, zone_idx,
>>>> +                                       nid, pgmap);
>>>> +               prep_compound_tail(page, i);
>>>> +
>>>> +               /*
>>>> +                * The first and second tail pages need to
>>>> +                * initialized first, hence the head page is
>>>> +                * prepared last.
>>> I'd change this comment to say why rather than restate what can be
>>> gleaned from the code. It's actually not clear to me why this order is
>>> necessary.
>>>
>> So the first tail page stores mapcount_ptr and compound order, and the
>> second tail page stores pincount_ptr. prep_compound_head() does this:
>>
>> 	set_compound_order(page, order);
>> 	atomic_set(compound_mapcount_ptr(page), -1);
>> 	if (hpage_pincount_available(page))
>> 		atomic_set(compound_pincount_ptr(page), 0);
>>
>> So we need those tail pages initialized first prior to initializing the head.
>>
>> I can expand the comment above to make it clear why we need first and second tail pages.
>>
> 
> Perhaps just say
>    The reason prep_compound_head() is called after the 1st and 2nd tail
>    pages have been initialized is: so it overwrites some of the tail page
>    fields setup by __init_zone_device_page(), rather than the other way 
> around.
> ?

Yeap, something along those lines is what I was thinking. Perhaps explicitly mentioning
the struct page fields that 1st and 2nd tail pages store to avoid the reader thinking it's
arbitrarily picked.

	Joao

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-07-22 10:53     ` Joao Martins
@ 2021-07-27 23:23       ` Dan Williams
  2021-08-02 10:40         ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-27 23:23 UTC (permalink / raw)
  To: Joao Martins
  Cc: Matthew Wilcox, Andrew Morton, Linux MM, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Thu, Jul 22, 2021 at 3:54 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> > The folio work really touches the page
> > cache for now, and this seems mostly to touch the devmap paths.
> >
> /me nods -- it really is about devmap infra for usage in device-dax for persistent memory.
>
> Perhaps I should do s/pagemaps/devmap/ throughout the series to avoid confusion.

I also like "devmap" as a more accurate name. It matches the PFN_DEV
and PFN_MAP flags that decorate DAX capable pfn_t instances. It also
happens to match a recommendation I gave to Ira for his support for
supervisor protection keys with devmap pfns.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-15 12:00     ` Joao Martins
@ 2021-07-27 23:51       ` Dan Williams
  2021-07-28  9:36         ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-27 23:51 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Thu, Jul 15, 2021 at 5:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 7/15/21 12:36 AM, Dan Williams wrote:
> > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Use the newly added compound pagemap facility which maps the assigned dax
> >> ranges as compound pages at a page size of @align. Currently, this means,
> >> that region/namespace bootstrap would take considerably less, given that
> >> you would initialize considerably less pages.
> >>
> >> On setups with 128G NVDIMMs the initialization with DRAM stored struct
> >> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
> >> than a 1msec with 1G pages.
> >>
> >> dax devices are created with a fixed @align (huge page size) which is
> >> enforced through as well at mmap() of the device. Faults, consequently
> >> happen too at the specified @align specified at the creation, and those
> >> don't change through out dax device lifetime. MCEs poisons a whole dax
> >> huge page, as well as splits occurring at the configured page size.
> >>
> >
> > Hi Joao,
> >
> > With this patch I'm hitting the following with the 'device-dax' test [1].
> >
> Ugh, I can reproduce it too -- apologies for the oversight.

No worries.

>
> This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.
>
> It needs this chunk below change on the fourth patch due to the existing elevated page ref
> count at zone device memmap init. put_page() called here in memunmap_pages():
>
> for (i = 0; i < pgmap->nr_ranges; i++)
>         for_each_device_pfn(pfn, pgmap, i)
>                 put_page(pfn_to_page(pfn));
>
> ... on a zone_device compound memmap would otherwise always decrease head page refcount by
> @geometry pfn amount (leading to the aforementioned splat you reported).
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index b0e7b8cf3047..79a883af788e 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
>         return (range->start + range_len(range)) >> PAGE_SHIFT;
>  }
>
> -static unsigned long pfn_next(unsigned long pfn)
> +static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
>  {
>         if (pfn % 1024 == 0)
>                 cond_resched();
> -       return pfn + 1;
> +       return pfn + pgmap_pfn_geometry(pgmap);

The cond_resched() would need to be fixed up too to something like:

if (pfn % (1024 << pgmap_geometry_order(pgmap)))
    cond_resched();

...because the goal is to take a break every 1024 iterations, not
every 1024 pfns.

>  }
>
>  #define for_each_device_pfn(pfn, map, i) \
> -       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
> +       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
>
>  static void dev_pagemap_kill(struct dev_pagemap *pgmap)
>  {
>
> It could also get this hunk below, but it is sort of redundant provided we won't touch
> tail page refcount through out the devmap pages lifetime. This setting of tail pages
> refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
> from the page allocator (where tail pages are already zeroed in refcount).

Wait, devmap pages never see the page allocator?

>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 96975edac0a8..469a7aa5cf38 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
> long pfn,
>                 __init_zone_device_page(page + i, pfn + i, zone_idx,
>                                         nid, pgmap);
>                 prep_compound_tail(page, i);
> +               set_page_count(page + i, 0);

Looks good to me and perhaps a for elevated tail page refcount at
teardown as a sanity check that the tail pages was never pinned
directly?

>
>                 /*
>                  * The first and second tail pages need to

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-07-14 19:35 ` [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-07-28  5:56   ` Dan Williams
  2021-07-28  9:43     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28  5:56 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> In support of using compound pages for devmap mappings, plumb the pgmap
> down to the vmemmap_populate implementation. Note that while altmap is
> retrievable from pgmap the memory hotplug code passes altmap without
> pgmap[*], so both need to be independently plumbed.
>
> So in addition to @altmap, pass @pgmap to sparse section populate
> functions namely:
>
>         sparse_add_section
>           section_activate
>             populate_section_memmap
>               __populate_section_memmap
>
> Passing @pgmap allows __populate_section_memmap() to both fetch the
> geometry in which memmap metadata is created for and also to let
> sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
> whether to just reuse tail pages from past onlined sections.

Looks good to me, just one quibble below:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

>
> [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/memory_hotplug.h |  5 ++++-
>  include/linux/mm.h             |  3 ++-
>  mm/memory_hotplug.c            |  3 ++-
>  mm/sparse-vmemmap.c            |  3 ++-
>  mm/sparse.c                    | 24 +++++++++++++++---------
>  5 files changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index a7fd2c3ccb77..9b1bca80224d 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -14,6 +14,7 @@ struct mem_section;
>  struct memory_block;
>  struct resource;
>  struct vmem_altmap;
> +struct dev_pagemap;
>
>  #ifdef CONFIG_MEMORY_HOTPLUG
>  struct page *pfn_to_online_page(unsigned long pfn);
> @@ -60,6 +61,7 @@ typedef int __bitwise mhp_t;
>  struct mhp_params {
>         struct vmem_altmap *altmap;
>         pgprot_t pgprot;
> +       struct dev_pagemap *pgmap;
>  };
>
>  bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
> @@ -333,7 +335,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
>                                        unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
>  extern int sparse_add_section(int nid, unsigned long pfn,
> -               unsigned long nr_pages, struct vmem_altmap *altmap);
> +               unsigned long nr_pages, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap);
>  extern void sparse_remove_section(struct mem_section *ms,
>                 unsigned long pfn, unsigned long nr_pages,
>                 unsigned long map_offset, struct vmem_altmap *altmap);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7ca22e6e694a..f244a9219ce4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3083,7 +3083,8 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
>
>  void *sparse_buffer_alloc(unsigned long size);
>  struct page * __populate_section_memmap(unsigned long pfn,
> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap);
>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8cb75b26ea4f..c728a8ff38ad 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -268,7 +268,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>                 /* Select all remaining pages up to the next section boundary */
>                 cur_nr_pages = min(end_pfn - pfn,
>                                    SECTION_ALIGN_UP(pfn + 1) - pfn);
> -               err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
> +               err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
> +                                        params->pgmap);
>                 if (err)
>                         break;
>                 cond_resched();
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index bdce883f9286..80d3ba30d345 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -603,7 +603,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>  }
>
>  struct page * __meminit __populate_section_memmap(unsigned long pfn,
> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap)
>  {
>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>         unsigned long end = start + nr_pages * sizeof(struct page);
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 6326cdf36c4f..5310be6171f1 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -453,7 +453,8 @@ static unsigned long __init section_map_size(void)
>  }
>
>  struct page __init *__populate_section_memmap(unsigned long pfn,
> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap)
>  {
>         unsigned long size = section_map_size();
>         struct page *map = sparse_buffer_alloc(size);
> @@ -552,7 +553,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
>                         break;
>
>                 map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
> -                               nid, NULL);
> +                               nid, NULL, NULL);
>                 if (!map) {
>                         pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
>                                __func__, nid);
> @@ -657,9 +658,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  static struct page * __meminit populate_section_memmap(unsigned long pfn,
> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap)
>  {
> -       return __populate_section_memmap(pfn, nr_pages, nid, altmap);
> +       return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
>  }
>
>  static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
> @@ -728,7 +730,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
>  }
>  #else
>  struct page * __meminit populate_section_memmap(unsigned long pfn,
> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap)
>  {
>         return kvmalloc_node(array_size(sizeof(struct page),
>                                         PAGES_PER_SECTION), GFP_KERNEL, nid);
> @@ -851,7 +854,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  }
>
>  static struct page * __meminit section_activate(int nid, unsigned long pfn,
> -               unsigned long nr_pages, struct vmem_altmap *altmap)
> +               unsigned long nr_pages, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap)
>  {
>         struct mem_section *ms = __pfn_to_section(pfn);
>         struct mem_section_usage *usage = NULL;
> @@ -883,7 +887,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>         if (nr_pages < PAGES_PER_SECTION && early_section(ms))
>                 return pfn_to_page(pfn);
>
> -       memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
> +       memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
>         if (!memmap) {
>                 section_deactivate(pfn, nr_pages, altmap);
>                 return ERR_PTR(-ENOMEM);
> @@ -898,6 +902,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>   * @start_pfn: start pfn of the memory range
>   * @nr_pages: number of pfns to add in the section
>   * @altmap: device page map
> + * @pgmap: device page map object that owns the section

Since this patch is touching the kdoc, might as well fix it up
properly for @altmap, and perhaps an alternate note for @pgmap:

@altmap: alternate pfns to allocate the memmap backing store
@pgmap: alternate compound page geometry for devmap mappings


>   *
>   * This is only intended for hotplug.
>   *
> @@ -911,7 +916,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>   * * -ENOMEM   - Out of memory.
>   */
>  int __meminit sparse_add_section(int nid, unsigned long start_pfn,
> -               unsigned long nr_pages, struct vmem_altmap *altmap)
> +               unsigned long nr_pages, struct vmem_altmap *altmap,
> +               struct dev_pagemap *pgmap)
>  {
>         unsigned long section_nr = pfn_to_section_nr(start_pfn);
>         struct mem_section *ms;
> @@ -922,7 +928,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>         if (ret < 0)
>                 return ret;
>
> -       memmap = section_activate(nid, start_pfn, nr_pages, altmap);
> +       memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
>         if (IS_ERR(memmap))
>                 return PTR_ERR(memmap);
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
  2021-07-14 19:35 ` [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
@ 2021-07-28  6:04   ` Dan Williams
  2021-07-28 10:48     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28  6:04 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> In preparation for describing a memmap with compound pages, move the
> actual pte population logic into a separate function
> vmemmap_populate_address() and have vmemmap_populate_basepages() walk
> through all base pages it needs to populate.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
>  1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 80d3ba30d345..76f4158f6301 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -570,33 +570,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> -                                        int node, struct vmem_altmap *altmap)
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap)
>  {
> -       unsigned long addr = start;
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
>         pte_t *pte;
>
> +       pgd = vmemmap_pgd_populate(addr, node);
> +       if (!pgd)
> +               return -ENOMEM;
> +       p4d = vmemmap_p4d_populate(pgd, addr, node);
> +       if (!p4d)
> +               return -ENOMEM;
> +       pud = vmemmap_pud_populate(p4d, addr, node);
> +       if (!pud)
> +               return -ENOMEM;
> +       pmd = vmemmap_pmd_populate(pud, addr, node);
> +       if (!pmd)
> +               return -ENOMEM;
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       if (!pte)
> +               return -ENOMEM;
> +       vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);

Missing a return here:

mm/sparse-vmemmap.c:598:1: error: control reaches end of non-void
function [-Werror=return-type]

Yes, it's fixed up in a later patch, but might as well not leave the
bisect breakage lying around, and the kbuild robot would gripe about
this eventually as well.


> +}
> +
> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> +                                        int node, struct vmem_altmap *altmap)
> +{
> +       unsigned long addr = start;
> +
>         for (; addr < end; addr += PAGE_SIZE) {
> -               pgd = vmemmap_pgd_populate(addr, node);
> -               if (!pgd)
> -                       return -ENOMEM;
> -               p4d = vmemmap_p4d_populate(pgd, addr, node);
> -               if (!p4d)
> -                       return -ENOMEM;
> -               pud = vmemmap_pud_populate(p4d, addr, node);
> -               if (!pud)
> -                       return -ENOMEM;
> -               pmd = vmemmap_pmd_populate(pud, addr, node);
> -               if (!pmd)
> -                       return -ENOMEM;
> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> -               if (!pte)
> +               if (vmemmap_populate_address(addr, node, altmap))
>                         return -ENOMEM;

I'd prefer:

rc = vmemmap_populate_address(addr, node, altmap);
if (rc)
    return rc;

...in case future refactoring adds different error codes to pass up.


> -               vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>         }
>
>         return 0;
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-07-14 19:35 ` [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
  2021-07-15  2:47   ` [External] " Muchun Song
@ 2021-07-28  6:09   ` Dan Williams
  1 sibling, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-28  6:09 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> In preparation for device-dax for using hugetlbfs compound page tail
> deduplication technique, move the comment block explanation into a
> common place in Documentation/vm.
>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Looks good,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

> ---
>  Documentation/vm/index.rst         |   1 +
>  Documentation/vm/vmemmap_dedup.rst | 170 +++++++++++++++++++++++++++++
>  mm/hugetlb_vmemmap.c               | 162 +--------------------------
>  3 files changed, 172 insertions(+), 161 deletions(-)
>  create mode 100644 Documentation/vm/vmemmap_dedup.rst
>
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index eff5fbd492d0..edd690afd890 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -51,5 +51,6 @@ descriptions of data structures and algorithms.
>     split_page_table_lock
>     transhuge
>     unevictable-lru
> +   vmemmap_dedup
>     z3fold
>     zsmalloc
> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
> new file mode 100644
> index 000000000000..215ae2ef3bce
> --- /dev/null
> +++ b/Documentation/vm/vmemmap_dedup.rst
> @@ -0,0 +1,170 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _vmemmap_dedup:
> +
> +==================================
> +Free some vmemmap pages of HugeTLB
> +==================================
> +
> +The struct page structures (page structs) are used to describe a physical
> +page frame. By default, there is a one-to-one mapping from a page frame to
> +it's corresponding page struct.
> +
> +HugeTLB pages consist of multiple base page size pages and is supported by
> +many architectures. See hugetlbpage.rst in the Documentation directory for
> +more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
> +are currently supported. Since the base page size on x86 is 4KB, a 2MB
> +HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> +4096 base pages. For each base page, there is a corresponding page struct.
> +
> +Within the HugeTLB subsystem, only the first 4 page structs are used to
> +contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
> +this upper limit. The only 'useful' information in the remaining page structs
> +is the compound_head field, and this field is the same for all tail pages.
> +
> +By removing redundant page structs for HugeTLB pages, memory can be returned
> +to the buddy allocator for other uses.
> +
> +Different architectures support different HugeTLB pages. For example, the
> +following table is the HugeTLB page size supported by x86 and arm64
> +architectures. Because arm64 supports 4k, 16k, and 64k base pages and
> +supports contiguous entries, so it supports many kinds of sizes of HugeTLB
> +page.
> +
> ++--------------+-----------+-----------------------------------------------+
> +| Architecture | Page Size |                HugeTLB Page Size              |
> ++--------------+-----------+-----------+-----------+-----------+-----------+
> +|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
> ++--------------+-----------+-----------+-----------+-----------+-----------+
> +|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
> +|              +-----------+-----------+-----------+-----------+-----------+
> +|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
> +|              +-----------+-----------+-----------+-----------+-----------+
> +|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
> ++--------------+-----------+-----------+-----------+-----------+-----------+
> +
> +When the system boot up, every HugeTLB page has more than one struct page
> +structs which size is (unit: pages):
> +
> +   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> +
> +Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> +of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> +relationship.
> +
> +   HugeTLB_Size = n * PAGE_SIZE
> +
> +Then,
> +
> +   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> +               = n * sizeof(struct page) / PAGE_SIZE
> +
> +We can use huge mapping at the pud/pmd level for the HugeTLB page.
> +
> +For the HugeTLB page of the pmd level mapping, then
> +
> +   struct_size = n * sizeof(struct page) / PAGE_SIZE
> +               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> +               = sizeof(struct page) / sizeof(pte_t)
> +               = 64 / 8
> +               = 8 (pages)
> +
> +Where n is how many pte entries which one page can contains. So the value of
> +n is (PAGE_SIZE / sizeof(pte_t)).
> +
> +This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> +is 8. And this optimization also applicable only when the size of struct page
> +is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
> +x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
> +size of struct page structs of it is 8 page frames which size depends on the
> +size of the base page.
> +
> +For the HugeTLB page of the pud level mapping, then
> +
> +   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> +               = PAGE_SIZE / 8 * 8 (pages)
> +               = PAGE_SIZE (pages)
> +
> +Where the struct_size(pmd) is the size of the struct page structs of a
> +HugeTLB page of the pmd level mapping.
> +
> +E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
> +HugeTLB page consists in 4096.
> +
> +Next, we take the pmd level mapping of the HugeTLB page as an example to
> +show the internal implementation of this optimization. There are 8 pages
> +struct page structs associated with a HugeTLB page which is pmd mapped.
> +
> +Here is how things look before optimization.
> +
> +    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> + +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> + |           |                     |     0     | -------------> |     0     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     1     | -------------> |     1     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     2     | -------------> |     2     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     3     | -------------> |     3     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     4     | -------------> |     4     |
> + |    PMD    |                     +-----------+                +-----------+
> + |   level   |                     |     5     | -------------> |     5     |
> + |  mapping  |                     +-----------+                +-----------+
> + |           |                     |     6     | -------------> |     6     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     7     | -------------> |     7     |
> + |           |                     +-----------+                +-----------+
> + |           |
> + |           |
> + |           |
> + +-----------+
> +
> +The value of page->compound_head is the same for all tail pages. The first
> +page of page structs (page 0) associated with the HugeTLB page contains the 4
> +page structs necessary to describe the HugeTLB. The only use of the remaining
> +pages of page structs (page 1 to page 7) is to point to page->compound_head.
> +Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> +will be used for each HugeTLB page. This will allow us to free the remaining
> +6 pages to the buddy allocator.
> +
> +Here is how things look after remapping.
> +
> +    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> + +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> + |           |                     |     0     | -------------> |     0     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     1     | -------------> |     1     |
> + |           |                     +-----------+                +-----------+
> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
> + |           |                     +-----------+                   | | | | |
> + |           |                     |     3     | ------------------+ | | | |
> + |           |                     +-----------+                     | | | |
> + |           |                     |     4     | --------------------+ | | |
> + |    PMD    |                     +-----------+                       | | |
> + |   level   |                     |     5     | ----------------------+ | |
> + |  mapping  |                     +-----------+                         | |
> + |           |                     |     6     | ------------------------+ |
> + |           |                     +-----------+                           |
> + |           |                     |     7     | --------------------------+
> + |           |                     +-----------+
> + |           |
> + |           |
> + |           |
> + +-----------+
> +
> +When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> +vmemmap pages and restore the previous mapping relationship.
> +
> +For the HugeTLB page of the pud level mapping. It is similar to the former.
> +We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
> +
> +Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
> +(e.g. aarch64) provides a contiguous bit in the translation table entries
> +that hints to the MMU to indicate that it is one of a contiguous set of
> +entries that can be cached in a single TLB entry.
> +
> +The contiguous bit is used to increase the mapping size at the pmd and pte
> +(last) level. So this type of HugeTLB page can be optimized only when its
> +size of the struct page structs is greater than 2 pages.
> +
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index c540c21e26f5..e2994e50ddee 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -6,167 +6,7 @@
>   *
>   *     Author: Muchun Song <songmuchun@bytedance.com>
>   *
> - * The struct page structures (page structs) are used to describe a physical
> - * page frame. By default, there is a one-to-one mapping from a page frame to
> - * it's corresponding page struct.
> - *
> - * HugeTLB pages consist of multiple base page size pages and is supported by
> - * many architectures. See hugetlbpage.rst in the Documentation directory for
> - * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
> - * are currently supported. Since the base page size on x86 is 4KB, a 2MB
> - * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
> - * 4096 base pages. For each base page, there is a corresponding page struct.
> - *
> - * Within the HugeTLB subsystem, only the first 4 page structs are used to
> - * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
> - * this upper limit. The only 'useful' information in the remaining page structs
> - * is the compound_head field, and this field is the same for all tail pages.
> - *
> - * By removing redundant page structs for HugeTLB pages, memory can be returned
> - * to the buddy allocator for other uses.
> - *
> - * Different architectures support different HugeTLB pages. For example, the
> - * following table is the HugeTLB page size supported by x86 and arm64
> - * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
> - * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
> - * page.
> - *
> - * +--------------+-----------+-----------------------------------------------+
> - * | Architecture | Page Size |                HugeTLB Page Size              |
> - * +--------------+-----------+-----------+-----------+-----------+-----------+
> - * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
> - * +--------------+-----------+-----------+-----------+-----------+-----------+
> - * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
> - * |              +-----------+-----------+-----------+-----------+-----------+
> - * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
> - * |              +-----------+-----------+-----------+-----------+-----------+
> - * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
> - * +--------------+-----------+-----------+-----------+-----------+-----------+
> - *
> - * When the system boot up, every HugeTLB page has more than one struct page
> - * structs which size is (unit: pages):
> - *
> - *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> - *
> - * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
> - * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
> - * relationship.
> - *
> - *    HugeTLB_Size = n * PAGE_SIZE
> - *
> - * Then,
> - *
> - *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
> - *                = n * sizeof(struct page) / PAGE_SIZE
> - *
> - * We can use huge mapping at the pud/pmd level for the HugeTLB page.
> - *
> - * For the HugeTLB page of the pmd level mapping, then
> - *
> - *    struct_size = n * sizeof(struct page) / PAGE_SIZE
> - *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
> - *                = sizeof(struct page) / sizeof(pte_t)
> - *                = 64 / 8
> - *                = 8 (pages)
> - *
> - * Where n is how many pte entries which one page can contains. So the value of
> - * n is (PAGE_SIZE / sizeof(pte_t)).
> - *
> - * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
> - * is 8. And this optimization also applicable only when the size of struct page
> - * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
> - * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
> - * size of struct page structs of it is 8 page frames which size depends on the
> - * size of the base page.
> - *
> - * For the HugeTLB page of the pud level mapping, then
> - *
> - *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
> - *                = PAGE_SIZE / 8 * 8 (pages)
> - *                = PAGE_SIZE (pages)
> - *
> - * Where the struct_size(pmd) is the size of the struct page structs of a
> - * HugeTLB page of the pmd level mapping.
> - *
> - * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
> - * HugeTLB page consists in 4096.
> - *
> - * Next, we take the pmd level mapping of the HugeTLB page as an example to
> - * show the internal implementation of this optimization. There are 8 pages
> - * struct page structs associated with a HugeTLB page which is pmd mapped.
> - *
> - * Here is how things look before optimization.
> - *
> - *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> - * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> - * |           |                     |     0     | -------------> |     0     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     1     | -------------> |     1     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     2     | -------------> |     2     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     3     | -------------> |     3     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     4     | -------------> |     4     |
> - * |    PMD    |                     +-----------+                +-----------+
> - * |   level   |                     |     5     | -------------> |     5     |
> - * |  mapping  |                     +-----------+                +-----------+
> - * |           |                     |     6     | -------------> |     6     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     7     | -------------> |     7     |
> - * |           |                     +-----------+                +-----------+
> - * |           |
> - * |           |
> - * |           |
> - * +-----------+
> - *
> - * The value of page->compound_head is the same for all tail pages. The first
> - * page of page structs (page 0) associated with the HugeTLB page contains the 4
> - * page structs necessary to describe the HugeTLB. The only use of the remaining
> - * pages of page structs (page 1 to page 7) is to point to page->compound_head.
> - * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
> - * will be used for each HugeTLB page. This will allow us to free the remaining
> - * 6 pages to the buddy allocator.
> - *
> - * Here is how things look after remapping.
> - *
> - *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
> - * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
> - * |           |                     |     0     | -------------> |     0     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     1     | -------------> |     1     |
> - * |           |                     +-----------+                +-----------+
> - * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
> - * |           |                     +-----------+                   | | | | |
> - * |           |                     |     3     | ------------------+ | | | |
> - * |           |                     +-----------+                     | | | |
> - * |           |                     |     4     | --------------------+ | | |
> - * |    PMD    |                     +-----------+                       | | |
> - * |   level   |                     |     5     | ----------------------+ | |
> - * |  mapping  |                     +-----------+                         | |
> - * |           |                     |     6     | ------------------------+ |
> - * |           |                     +-----------+                           |
> - * |           |                     |     7     | --------------------------+
> - * |           |                     +-----------+
> - * |           |
> - * |           |
> - * |           |
> - * +-----------+
> - *
> - * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
> - * vmemmap pages and restore the previous mapping relationship.
> - *
> - * For the HugeTLB page of the pud level mapping. It is similar to the former.
> - * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
> - *
> - * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
> - * (e.g. aarch64) provides a contiguous bit in the translation table entries
> - * that hints to the MMU to indicate that it is one of a contiguous set of
> - * entries that can be cached in a single TLB entry.
> - *
> - * The contiguous bit is used to increase the mapping size at the pmd and pte
> - * (last) level. So this type of HugeTLB page can be optimized only when its
> - * size of the struct page structs is greater than 2 pages.
> + * See Documentation/vm/vmemmap_dedup.rst
>   */
>  #define pr_fmt(fmt)    "HugeTLB: " fmt
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps
  2021-07-14 19:35 ` [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-07-28  6:55   ` Dan Williams
  2021-07-28 15:35     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28  6:55 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it

Maybe s/compound devmap/compound devmap/ per the other planned usage
of "devmap" in the implementation?

> means that pages are mapped at a given huge page alignment and utilize
> uses compound pages as opposed to order-0 pages.
>
> Take advantage of the fact that most tail pages look the same (except
> the first two) to minimize struct page overhead. Allocate a separate
> page for the vmemmap area which contains the head page and separate for
> the next 64 pages. The rest of the subsections then reuse this tail
> vmemmap page to initialize the rest of the tail pages.
>
> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
> when initializing compound pagemap with big enough @align (e.g. 1G

s/@align/@geometry/?

> PUD) it will cross various sections.

s/will cross various/may cross multiple/

> To be able to reuse tail pages
> across sections belonging to the same gigantic page, fetch the
> @range being mapped (nr_ranges + 1).  If the section being mapped is
> not offset 0 of the @align, then lookup the PFN of the struct page
> address that precedes it and use that to populate the entire
> section.

This sounds like code being read aloud. I would just say something like:

"The vmemmap code needs to consult @pgmap so that multiple sections
that all map the same tail data can refer back to the first copy of
that data for a given gigantic page."

>
> On compound pagemaps with 2M align, this mechanism lets 6 pages be
> saved out of the 8 necessary PFNs necessary to set the subsection's
> 512 struct pages being mapped. On a 1G compound pagemap it saves
> 4094 pages.
>
> Altmap isn't supported yet, given various restrictions in altmap pfn
> allocator, thus fallback to the already in use vmemmap_populate().  It
> is worth noting that altmap for devmap mappings was there to relieve the
> pressure of inordinate amounts of memmap space to map terabytes of pmem.
> With compound pages the motivation for altmaps for pmem gets reduced.

Looks good just some minor comments / typo fixes, and some requests
for a few more helper functions.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  Documentation/vm/vmemmap_dedup.rst |  27 +++++-
>  include/linux/mm.h                 |   2 +-
>  mm/memremap.c                      |   1 +
>  mm/sparse-vmemmap.c                | 133 +++++++++++++++++++++++++++--
>  4 files changed, 151 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
> index 215ae2ef3bce..42830a667c2a 100644
> --- a/Documentation/vm/vmemmap_dedup.rst
> +++ b/Documentation/vm/vmemmap_dedup.rst
> @@ -2,9 +2,12 @@
>
>  .. _vmemmap_dedup:
>
> -==================================
> -Free some vmemmap pages of HugeTLB
> -==================================
> +=================================================
> +Free some vmemmap pages of HugeTLB and Device DAX

How about "A vmemmap diet for HugeTLB and Device DAX"

...because in the HugeTLB case it is dynamically remapping and freeing
the pages after the fact, while Device-DAX is avoiding the allocation
in the first instance.

> +=================================================
> +
> +HugeTLB
> +=======
>
>  The struct page structures (page structs) are used to describe a physical
>  page frame. By default, there is a one-to-one mapping from a page frame to
> @@ -168,3 +171,21 @@ The contiguous bit is used to increase the mapping size at the pmd and pte
>  (last) level. So this type of HugeTLB page can be optimized only when its
>  size of the struct page structs is greater than 2 pages.
>
> +Device DAX
> +==========
> +
> +The device-dax interface uses the same tail deduplication technique explained
> +in the previous chapter, except when used with the vmemmap in the device (altmap).
> +
> +The differences with HugeTLB are relatively minor.
> +
> +The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
> +PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
> +
> +There's no remapping of vmemmap given that device-dax memory is not part of
> +System RAM ranges initialized at boot, hence the tail deduplication happens
> +at a later stage when we populate the sections.
> +
> +It only use 3 page structs for storing all information as opposed
> +to 4 on HugeTLB pages. This does not affect memory savings between both.
> +
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f244a9219ce4..5e3e153ddd3d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3090,7 +3090,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                           struct vmem_altmap *altmap);
> +                           struct vmem_altmap *altmap, struct page *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
>  struct vmem_altmap;
>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index ffcb924eb6a5..9198fdace903 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>  {
>         struct mhp_params params = {
>                 .altmap = pgmap_altmap(pgmap),
> +               .pgmap = pgmap,
>                 .pgprot = PAGE_KERNEL,
>         };
>         const int nr_range = pgmap->nr_range;
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 76f4158f6301..a8de6c472999 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -495,16 +495,31 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>  }
>
>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                                      struct vmem_altmap *altmap)
> +                                      struct vmem_altmap *altmap,
> +                                      struct page *block)
>  {
>         pte_t *pte = pte_offset_kernel(pmd, addr);
>         if (pte_none(*pte)) {
>                 pte_t entry;
>                 void *p;
>
> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> -               if (!p)
> -                       return NULL;
> +               if (!block) {
> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> +                       if (!p)
> +                               return NULL;
> +               } else {
> +                       /*
> +                        * When a PTE/PMD entry is freed from the init_mm
> +                        * there's a a free_pages() call to this page allocated
> +                        * above. Thus this get_page() is paired with the
> +                        * put_page_testzero() on the freeing path.
> +                        * This can only called by certain ZONE_DEVICE path,
> +                        * and through vmemmap_populate_compound_pages() when
> +                        * slab is available.
> +                        */
> +                       get_page(block);
> +                       p = page_to_virt(block);
> +               }
>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>                 set_pte_at(&init_mm, addr, pte, entry);
>         }
> @@ -571,7 +586,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>  }
>
>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap)
> +                                             struct vmem_altmap *altmap,
> +                                             struct page *reuse, struct page **page)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -591,10 +607,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pmd = vmemmap_pmd_populate(pud, addr, node);
>         if (!pmd)
>                 return -ENOMEM;
> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
>         if (!pte)
>                 return -ENOMEM;
>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> +
> +       if (page)
> +               *page = pte_page(*pte);
> +       return 0;
>  }
>
>  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> @@ -603,7 +623,97 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>         unsigned long addr = start;
>
>         for (; addr < end; addr += PAGE_SIZE) {
> -               if (vmemmap_populate_address(addr, node, altmap))
> +               if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_range(unsigned long start,
> +                                           unsigned long end,
> +                                           int node, struct page *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PAGE_SIZE) {
> +               if (vmemmap_populate_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
> +                                                 struct page **page)
> +{
> +       return vmemmap_populate_address(addr, node, NULL, NULL, page);
> +}
> +
> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
> +                                                    unsigned long start,
> +                                                    unsigned long end, int node,
> +                                                    struct dev_pagemap *pgmap)
> +{
> +       unsigned long offset, size, addr;
> +
> +       /*
> +        * For compound pages bigger than section size (e.g. x86 1G compound
> +        * pages with 2M subsection size) fill the rest of sections as tail
> +        * pages.
> +        *
> +        * Note that memremap_pages() resets @nr_range value and will increment
> +        * it after each range successful onlining. Thus the value or @nr_range
> +        * at section memmap populate corresponds to the in-progress range
> +        * being onlined here.
> +        */
> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
> +       if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
> +           pgmap_geometry(pgmap) > SUBSECTION_SIZE) {

How about moving the last 3 lines plus the comment to a helper so this
becomes something like:

if (compound_section_index(start_pfn, pgmap))

...where it is clear that for the Nth section in a compound page where
N is > 0, it can lookup the page data to reuse.


> +               pte_t *ptep;
> +
> +               addr = start - PAGE_SIZE;
> +
> +               /*
> +                * Sections are populated sequently and in sucession meaning
> +                * this section being populated wouldn't start if the
> +                * preceding one wasn't successful. So there is a guarantee that
> +                * the previous struct pages are mapped when trying to lookup
> +                * the last tail page.

I think you can cut this down to:

"Assuming sections are populated sequentially, the previous section's
page data can be reused."

...and maybe this can be a helper like:

compound_section_tail_page()?


> +                * the last tail page.

> +               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
> +               if (!ptep)
> +                       return -ENOMEM;
> +
> +               /*
> +                * Reuse the page that was populated in the prior iteration
> +                * with just tail struct pages.
> +                */
> +               return vmemmap_populate_range(start, end, node,
> +                                             pte_page(*ptep));
> +       }
> +
> +       size = min(end - start, pgmap_pfn_geometry(pgmap) * sizeof(struct page));
> +       for (addr = start; addr < end; addr += size) {
> +               unsigned long next = addr, last = addr + size;
> +               struct page *block;
> +
> +               /* Populate the head page vmemmap page */
> +               if (vmemmap_populate_page(addr, node, NULL))
> +                       return -ENOMEM;
> +
> +               /* Populate the tail pages vmemmap page */
> +               block = NULL;
> +               next = addr + PAGE_SIZE;
> +               if (vmemmap_populate_page(next, node, &block))
> +                       return -ENOMEM;
> +
> +               /*
> +                * Reuse the previous page for the rest of tail pages
> +                * See layout diagram in Documentation/vm/vmemmap_dedup.rst
> +                */
> +               next += PAGE_SIZE;
> +               if (vmemmap_populate_range(next, last, node, block))
>                         return -ENOMEM;
>         }
>
> @@ -616,12 +726,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>  {
>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>         unsigned long end = start + nr_pages * sizeof(struct page);
> +       unsigned int geometry = pgmap_geometry(pgmap);
> +       int r;
>
>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>                 return NULL;
>
> -       if (vmemmap_populate(start, end, nid, altmap))
> +       if (geometry > PAGE_SIZE && !altmap)
> +               r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> +       else
> +               r = vmemmap_populate(start, end, nid, altmap);
> +
> +       if (r < 0)
>                 return NULL;
>
>         return pfn_to_page(pfn);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-07-14 19:35 ` [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-07-28  7:28   ` Dan Williams
  2021-07-28 15:56     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28  7:28 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Currently memmap_init_zone_device() ends up initializing 32768 pages
> when it only needs to initialize 128 given tail page reuse. That
> number is worse with 1GB compound page geometries, 262144 instead of
> 128. Update memmap_init_zone_device() to skip redundant
> initialization, detailed below.
>
> When a pgmap @geometry is set, all pages are mapped at a given huge page
> alignment and use compound pages to describe them as opposed to a
> struct per 4K.
>
> With @geometry > PAGE_SIZE and when struct pages are stored in ram
> (!altmap) most tail pages are reused. Consequently, the amount of unique
> struct pages is a lot smaller that the total amount of struct pages
> being mapped.
>
> The altmap path is left alone since it does not support memory savings
> based on compound pagemap geometries.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 188cb5f8c308..96975edac0a8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6600,11 +6600,23 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>  static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
>                                         unsigned long zone_idx, int nid,
>                                         struct dev_pagemap *pgmap,
> +                                       struct vmem_altmap *altmap,
>                                         unsigned long nr_pages)
>  {
>         unsigned int order_align = order_base_2(nr_pages);
>         unsigned long i;
>
> +       /*
> +        * With compound page geometry and when struct pages are stored in ram
> +        * (!altmap) most tail pages are reused. Consequently, the amount of
> +        * unique struct pages to initialize is a lot smaller that the total
> +        * amount of struct pages being mapped.
> +        * See vmemmap_populate_compound_pages().
> +        */
> +       if (!altmap)
> +               nr_pages = min_t(unsigned long, nr_pages,

What's the scenario where nr_pages is < 128? Shouldn't alignment
already be guaranteed?

> +                                2 * (PAGE_SIZE/sizeof(struct page)));


> +
>         __SetPageHead(page);
>
>         for (i = 1; i < nr_pages; i++) {
> @@ -6657,7 +6669,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>                         continue;
>
>                 memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
> -                                    pfns_per_compound);
> +                                    altmap, pfns_per_compound);

This feels odd, memmap_init_compound() doesn't really care about
altmap, what do you think about explicitly calculating the parameters
that memmap_init_compound() needs and passing them in?

Not a strong requirement to change, but take another look at let me know.



>         }
>
>         pr_info("%s initialised %lu pages in %ums\n", __func__,
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff
  2021-07-14 19:35 ` [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
@ 2021-07-28  7:29   ` Dan Williams
  2021-07-28 15:56     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28  7:29 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Rather than calculating @pgoff manually, switch to ALIGN() instead.

Looks good,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/dax/device.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index dd8222a42808..0b82159b3564 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -234,8 +234,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>                  * mapped. No need to consider the zero page, or racing
>                  * conflicting mappings.
>                  */
> -               pgoff = linear_page_index(vmf->vma, vmf->address
> -                               & ~(fault_size - 1));
> +               pgoff = linear_page_index(vmf->vma,
> +                               ALIGN(vmf->address, fault_size));
>                 for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>                         struct page *page;
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-07-14 19:35 ` [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
@ 2021-07-28  7:30   ` Dan Williams
  2021-07-28 15:56     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28  7:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Right now, only static dax regions have a valid @pgmap pointer in its
> struct dev_dax. Dynamic dax case however, do not.
>
> In preparation for device-dax compound pagemap support, make sure that
> dev_dax pgmap field is set after it has been allocated and initialized.

I think this is ok to fold into the patch that needs it.

>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/dax/device.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 0b82159b3564..6e348b5f9d45 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>         }
>
>         pgmap->type = MEMORY_DEVICE_GENERIC;
> +       dev_dax->pgmap = pgmap;
> +
>         addr = devm_memremap_pages(dev, pgmap);
>         if (IS_ERR(addr))
>                 return PTR_ERR(addr);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-27 23:51       ` Dan Williams
@ 2021-07-28  9:36         ` Joao Martins
  2021-07-28 18:51           ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-28  9:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On 7/28/21 12:51 AM, Dan Williams wrote:
> On Thu, Jul 15, 2021 at 5:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 7/15/21 12:36 AM, Dan Williams wrote:
>>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>> This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.
>>
>> It needs this chunk below change on the fourth patch due to the existing elevated page ref
>> count at zone device memmap init. put_page() called here in memunmap_pages():
>>
>> for (i = 0; i < pgmap->nr_ranges; i++)
>>         for_each_device_pfn(pfn, pgmap, i)
>>                 put_page(pfn_to_page(pfn));
>>
>> ... on a zone_device compound memmap would otherwise always decrease head page refcount by
>> @geometry pfn amount (leading to the aforementioned splat you reported).
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index b0e7b8cf3047..79a883af788e 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
>>         return (range->start + range_len(range)) >> PAGE_SHIFT;
>>  }
>>
>> -static unsigned long pfn_next(unsigned long pfn)
>> +static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
>>  {
>>         if (pfn % 1024 == 0)
>>                 cond_resched();
>> -       return pfn + 1;
>> +       return pfn + pgmap_pfn_geometry(pgmap);
> 
> The cond_resched() would need to be fixed up too to something like:
> 
> if (pfn % (1024 << pgmap_geometry_order(pgmap)))
>     cond_resched();
> 
> ...because the goal is to take a break every 1024 iterations, not
> every 1024 pfns.
> 

Ah, good point.

>>  }
>>
>>  #define for_each_device_pfn(pfn, map, i) \
>> -       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
>> +       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
>>
>>  static void dev_pagemap_kill(struct dev_pagemap *pgmap)
>>  {
>>
>> It could also get this hunk below, but it is sort of redundant provided we won't touch
>> tail page refcount through out the devmap pages lifetime. This setting of tail pages
>> refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
>> from the page allocator (where tail pages are already zeroed in refcount).
> 
> Wait, devmap pages never see the page allocator?
> 
"where tail pages are already zeroed in refcount" this actually meant 'freshly allocated
pages' and I was referring to commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") that removed set_page_count() because the setting of page
ref count to zero was redundant.

Albeit devmap pages don't come from page allocator, you know separate zone and these pages
aren't part of the regular page pools (e.g. accessible via alloc_pages()), as you are
aware. Unless of course, we reassign them via dax_kmem, but then the way we map the struct
pages would be regular without any devmap stuff.

>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 96975edac0a8..469a7aa5cf38 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
>> long pfn,
>>                 __init_zone_device_page(page + i, pfn + i, zone_idx,
>>                                         nid, pgmap);
>>                 prep_compound_tail(page, i);
>> +               set_page_count(page + i, 0);
> 
> Looks good to me and perhaps a for elevated tail page refcount at
> teardown as a sanity check that the tail pages was never pinned
> directly?
> 
Sorry didn't follow completely.

You meant to set tail page refcount back to 1 at teardown if it was kept to 0 (e.g.
memunmap_pages() after put_page()) or that the refcount is indeed kept to zero after the
put_page() in memunmap_pages() ?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-07-28  5:56   ` Dan Williams
@ 2021-07-28  9:43     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-28  9:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 6:56 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> In support of using compound pages for devmap mappings, plumb the pgmap
>> down to the vmemmap_populate implementation. Note that while altmap is
>> retrievable from pgmap the memory hotplug code passes altmap without
>> pgmap[*], so both need to be independently plumbed.
>>
>> So in addition to @altmap, pass @pgmap to sparse section populate
>> functions namely:
>>
>>         sparse_add_section
>>           section_activate
>>             populate_section_memmap
>>               __populate_section_memmap
>>
>> Passing @pgmap allows __populate_section_memmap() to both fetch the
>> geometry in which memmap metadata is created for and also to let
>> sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
>> whether to just reuse tail pages from past onlined sections.
> 
> Looks good to me, just one quibble below:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
Thank you!
>>
>> [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/memory_hotplug.h |  5 ++++-
>>  include/linux/mm.h             |  3 ++-
>>  mm/memory_hotplug.c            |  3 ++-
>>  mm/sparse-vmemmap.c            |  3 ++-
>>  mm/sparse.c                    | 24 +++++++++++++++---------
>>  5 files changed, 25 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> index a7fd2c3ccb77..9b1bca80224d 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -14,6 +14,7 @@ struct mem_section;
>>  struct memory_block;
>>  struct resource;
>>  struct vmem_altmap;
>> +struct dev_pagemap;
>>
>>  #ifdef CONFIG_MEMORY_HOTPLUG
>>  struct page *pfn_to_online_page(unsigned long pfn);
>> @@ -60,6 +61,7 @@ typedef int __bitwise mhp_t;
>>  struct mhp_params {
>>         struct vmem_altmap *altmap;
>>         pgprot_t pgprot;
>> +       struct dev_pagemap *pgmap;
>>  };
>>
>>  bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
>> @@ -333,7 +335,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
>>                                        unsigned long nr_pages);
>>  extern bool is_memblock_offlined(struct memory_block *mem);
>>  extern int sparse_add_section(int nid, unsigned long pfn,
>> -               unsigned long nr_pages, struct vmem_altmap *altmap);
>> +               unsigned long nr_pages, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap);
>>  extern void sparse_remove_section(struct mem_section *ms,
>>                 unsigned long pfn, unsigned long nr_pages,
>>                 unsigned long map_offset, struct vmem_altmap *altmap);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 7ca22e6e694a..f244a9219ce4 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3083,7 +3083,8 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
>>
>>  void *sparse_buffer_alloc(unsigned long size);
>>  struct page * __populate_section_memmap(unsigned long pfn,
>> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
>> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap);
>>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 8cb75b26ea4f..c728a8ff38ad 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -268,7 +268,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>>                 /* Select all remaining pages up to the next section boundary */
>>                 cur_nr_pages = min(end_pfn - pfn,
>>                                    SECTION_ALIGN_UP(pfn + 1) - pfn);
>> -               err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
>> +               err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
>> +                                        params->pgmap);
>>                 if (err)
>>                         break;
>>                 cond_resched();
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index bdce883f9286..80d3ba30d345 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -603,7 +603,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>  }
>>
>>  struct page * __meminit __populate_section_memmap(unsigned long pfn,
>> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
>> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap)
>>  {
>>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>>         unsigned long end = start + nr_pages * sizeof(struct page);
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index 6326cdf36c4f..5310be6171f1 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -453,7 +453,8 @@ static unsigned long __init section_map_size(void)
>>  }
>>
>>  struct page __init *__populate_section_memmap(unsigned long pfn,
>> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
>> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap)
>>  {
>>         unsigned long size = section_map_size();
>>         struct page *map = sparse_buffer_alloc(size);
>> @@ -552,7 +553,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
>>                         break;
>>
>>                 map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
>> -                               nid, NULL);
>> +                               nid, NULL, NULL);
>>                 if (!map) {
>>                         pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
>>                                __func__, nid);
>> @@ -657,9 +658,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>>
>>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>>  static struct page * __meminit populate_section_memmap(unsigned long pfn,
>> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
>> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap)
>>  {
>> -       return __populate_section_memmap(pfn, nr_pages, nid, altmap);
>> +       return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
>>  }
>>
>>  static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
>> @@ -728,7 +730,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
>>  }
>>  #else
>>  struct page * __meminit populate_section_memmap(unsigned long pfn,
>> -               unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
>> +               unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap)
>>  {
>>         return kvmalloc_node(array_size(sizeof(struct page),
>>                                         PAGES_PER_SECTION), GFP_KERNEL, nid);
>> @@ -851,7 +854,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>  }
>>
>>  static struct page * __meminit section_activate(int nid, unsigned long pfn,
>> -               unsigned long nr_pages, struct vmem_altmap *altmap)
>> +               unsigned long nr_pages, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap)
>>  {
>>         struct mem_section *ms = __pfn_to_section(pfn);
>>         struct mem_section_usage *usage = NULL;
>> @@ -883,7 +887,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>>         if (nr_pages < PAGES_PER_SECTION && early_section(ms))
>>                 return pfn_to_page(pfn);
>>
>> -       memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
>> +       memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
>>         if (!memmap) {
>>                 section_deactivate(pfn, nr_pages, altmap);
>>                 return ERR_PTR(-ENOMEM);
>> @@ -898,6 +902,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>>   * @start_pfn: start pfn of the memory range
>>   * @nr_pages: number of pfns to add in the section
>>   * @altmap: device page map
>> + * @pgmap: device page map object that owns the section
> 
> Since this patch is touching the kdoc, might as well fix it up
> properly for @altmap, and perhaps an alternate note for @pgmap:
> 
> @altmap: alternate pfns to allocate the memmap backing store
> @pgmap: alternate compound page geometry for devmap mappings
> 
Ah, indeed. I fixed it up and also added this to the commit message:

"While at it, fix the kdoc for @altmap for sparse_add_section()."

> 
>>   *
>>   * This is only intended for hotplug.
>>   *
>> @@ -911,7 +916,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>>   * * -ENOMEM   - Out of memory.
>>   */
>>  int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>> -               unsigned long nr_pages, struct vmem_altmap *altmap)
>> +               unsigned long nr_pages, struct vmem_altmap *altmap,
>> +               struct dev_pagemap *pgmap)
>>  {
>>         unsigned long section_nr = pfn_to_section_nr(start_pfn);
>>         struct mem_section *ms;
>> @@ -922,7 +928,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>>         if (ret < 0)
>>                 return ret;
>>
>> -       memmap = section_activate(nid, start_pfn, nr_pages, altmap);
>> +       memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
>>         if (IS_ERR(memmap))
>>                 return PTR_ERR(memmap);
>>
>> --
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
  2021-07-28  6:04   ` Dan Williams
@ 2021-07-28 10:48     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-28 10:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 7:04 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> In preparation for describing a memmap with compound pages, move the
>> actual pte population logic into a separate function
>> vmemmap_populate_address() and have vmemmap_populate_basepages() walk
>> through all base pages it needs to populate.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
>>  1 file changed, 26 insertions(+), 18 deletions(-)
>>
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 80d3ba30d345..76f4158f6301 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -570,33 +570,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>         return pgd;
>>  }
>>
>> -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> -                                        int node, struct vmem_altmap *altmap)
>> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> +                                             struct vmem_altmap *altmap)
>>  {
>> -       unsigned long addr = start;
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>>         pud_t *pud;
>>         pmd_t *pmd;
>>         pte_t *pte;
>>
>> +       pgd = vmemmap_pgd_populate(addr, node);
>> +       if (!pgd)
>> +               return -ENOMEM;
>> +       p4d = vmemmap_p4d_populate(pgd, addr, node);
>> +       if (!p4d)
>> +               return -ENOMEM;
>> +       pud = vmemmap_pud_populate(p4d, addr, node);
>> +       if (!pud)
>> +               return -ENOMEM;
>> +       pmd = vmemmap_pmd_populate(pud, addr, node);
>> +       if (!pmd)
>> +               return -ENOMEM;
>> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> +       if (!pte)
>> +               return -ENOMEM;
>> +       vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> 
> Missing a return here:
> 
> mm/sparse-vmemmap.c:598:1: error: control reaches end of non-void
> function [-Werror=return-type]
> 
> Yes, it's fixed up in a later patch 

That fixup definitely needs to be moved here.

>, but might as well not leave the
> bisect breakage lying around, and the kbuild robot would gripe about
> this eventually as well.
> 
Yeap. Fixed, thanks for noticing.

> 
>> +}
>> +
>> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> +                                        int node, struct vmem_altmap *altmap)
>> +{
>> +       unsigned long addr = start;
>> +
>>         for (; addr < end; addr += PAGE_SIZE) {
>> -               pgd = vmemmap_pgd_populate(addr, node);
>> -               if (!pgd)
>> -                       return -ENOMEM;
>> -               p4d = vmemmap_p4d_populate(pgd, addr, node);
>> -               if (!p4d)
>> -                       return -ENOMEM;
>> -               pud = vmemmap_pud_populate(p4d, addr, node);
>> -               if (!pud)
>> -                       return -ENOMEM;
>> -               pmd = vmemmap_pmd_populate(pud, addr, node);
>> -               if (!pmd)
>> -                       return -ENOMEM;
>> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> -               if (!pte)
>> +               if (vmemmap_populate_address(addr, node, altmap))
>>                         return -ENOMEM;
> 
> I'd prefer:
> 
> rc = vmemmap_populate_address(addr, node, altmap);
> if (rc)
>     return rc;
> 
> ...in case future refactoring adds different error codes to pass up.
> 
Fixed.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps
  2021-07-28  6:55   ` Dan Williams
@ 2021-07-28 15:35     ` Joao Martins
  2021-07-28 18:03       ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-28 15:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List


On 7/28/21 7:55 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
> 
> Maybe s/compound devmap/compound devmap/ per the other planned usage
> of "devmap" in the implementation?
> 
Yeap. I am replacing pagemap with devmap -- hopefully better done than
the s/align/geometry which there's still some leftovers in this series.

>> means that pages are mapped at a given huge page alignment and utilize
>> uses compound pages as opposed to order-0 pages.
>>
>> Take advantage of the fact that most tail pages look the same (except
>> the first two) to minimize struct page overhead. Allocate a separate
>> page for the vmemmap area which contains the head page and separate for
>> the next 64 pages. The rest of the subsections then reuse this tail
>> vmemmap page to initialize the rest of the tail pages.
>>
>> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
>> when initializing compound pagemap with big enough @align (e.g. 1G
> 
> s/@align/@geometry/?
> 
Yeap (and the previous mention too in the hunk before this one).

>> PUD) it will cross various sections.
> 
> s/will cross various/may cross multiple/
> 
OK

>> To be able to reuse tail pages
>> across sections belonging to the same gigantic page, fetch the
>> @range being mapped (nr_ranges + 1).  If the section being mapped is
>> not offset 0 of the @align, then lookup the PFN of the struct page
>> address that precedes it and use that to populate the entire
>> section.
> 
> This sounds like code being read aloud. I would just say something like:
> 
> "The vmemmap code needs to consult @pgmap so that multiple sections
> that all map the same tail data can refer back to the first copy of
> that data for a given gigantic page."
> 
Fixed.

>>
>> On compound pagemaps with 2M align, this mechanism lets 6 pages be
>> saved out of the 8 necessary PFNs necessary to set the subsection's
>> 512 struct pages being mapped. On a 1G compound pagemap it saves
>> 4094 pages.
>>
>> Altmap isn't supported yet, given various restrictions in altmap pfn
>> allocator, thus fallback to the already in use vmemmap_populate().  It
>> is worth noting that altmap for devmap mappings was there to relieve the
>> pressure of inordinate amounts of memmap space to map terabytes of pmem.
>> With compound pages the motivation for altmaps for pmem gets reduced.
> 
> Looks good just some minor comments / typo fixes, and some requests
> for a few more helper functions.
> 
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  Documentation/vm/vmemmap_dedup.rst |  27 +++++-
>>  include/linux/mm.h                 |   2 +-
>>  mm/memremap.c                      |   1 +
>>  mm/sparse-vmemmap.c                | 133 +++++++++++++++++++++++++++--
>>  4 files changed, 151 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
>> index 215ae2ef3bce..42830a667c2a 100644
>> --- a/Documentation/vm/vmemmap_dedup.rst
>> +++ b/Documentation/vm/vmemmap_dedup.rst
>> @@ -2,9 +2,12 @@
>>
>>  .. _vmemmap_dedup:
>>
>> -==================================
>> -Free some vmemmap pages of HugeTLB
>> -==================================
>> +=================================================
>> +Free some vmemmap pages of HugeTLB and Device DAX
> 
> How about "A vmemmap diet for HugeTLB and Device DAX"
> 
> ...because in the HugeTLB case it is dynamically remapping and freeing
> the pages after the fact, while Device-DAX is avoiding the allocation
> in the first instance.
> 
Yeap. Better title indeed, fixed.

[...]

>> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> +                                                    unsigned long start,
>> +                                                    unsigned long end, int node,
>> +                                                    struct dev_pagemap *pgmap)
>> +{
>> +       unsigned long offset, size, addr;
>> +
>> +       /*
>> +        * For compound pages bigger than section size (e.g. x86 1G compound
>> +        * pages with 2M subsection size) fill the rest of sections as tail
>> +        * pages.
>> +        *
>> +        * Note that memremap_pages() resets @nr_range value and will increment
>> +        * it after each range successful onlining. Thus the value or @nr_range
>> +        * at section memmap populate corresponds to the in-progress range
>> +        * being onlined here.
>> +        */
>> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>> +       if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
>> +           pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
> 
> How about moving the last 3 lines plus the comment to a helper so this
> becomes something like:
> 
> if (compound_section_index(start_pfn, pgmap))
> 
> ...where it is clear that for the Nth section in a compound page where
> N is > 0, it can lookup the page data to reuse.
> 
Definitely more readable.

Here's what I have so far (already with the change
of pgmap_geometry() to be nr of pages):

+/*
+ * For compound pages bigger than section size (e.g. x86 1G compound
+ * pages with 2M subsection size) fill the rest of sections as tail
+ * pages.
+ *
+ * Note that memremap_pages() resets @nr_range value and will increment
+ * it after each range successful onlining. Thus the value or @nr_range
+ * at section memmap populate corresponds to the in-progress range
+ * being onlined here.
+ */
+static bool compound_section_index(unsigned long start_pfn,
+                                  struct dev_pagemap *pgmap)
+{
+       unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT;
+       unsigned long offset = PFN_PHYS(start_pfn) -
+               pgmap->ranges[pgmap->nr_range].start;
+
+       return !IS_ALIGNED(offset, geometry_size) &&
+               geometry_size > SUBSECTION_SIZE;
+}
+
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
                                                     unsigned long start,
                                                     unsigned long end, int node,
                                                     struct dev_pagemap *pgmap)
 {
-       unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT;
        unsigned long offset, size, addr;

-       /*
-        * For compound pages bigger than section size (e.g. x86 1G compound
-        * pages with 2M subsection size) fill the rest of sections as tail
-        * pages.
-        *
-        * Note that memremap_pages() resets @nr_range value and will increment
-        * it after each range successful onlining. Thus the value or @nr_range
-        * at section memmap populate corresponds to the in-progress range
-        * being onlined here.
-        */
-       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
-       if (!IS_ALIGNED(offset, geometry_size) &&
-           geometry_size > SUBSECTION_SIZE) {
+       if (compound_section_index(start_pfn, pgmap)) {
                pte_t *ptep;

                addr = start - PAGE_SIZE;


> 
>> +               pte_t *ptep;
>> +
>> +               addr = start - PAGE_SIZE;
>> +
>> +               /*
>> +                * Sections are populated sequently and in sucession meaning
>> +                * this section being populated wouldn't start if the
>> +                * preceding one wasn't successful. So there is a guarantee that
>> +                * the previous struct pages are mapped when trying to lookup
>> +                * the last tail page.
> 
> I think you can cut this down to:
> 
> "Assuming sections are populated sequentially, the previous section's
> page data can be reused."
> 
OK.

> ...and maybe this can be a helper like:
> 
> compound_section_tail_page()?
> 
It makes this patch more readable.

Albeit doing this means we might need a compound_section_tail_huge_page (...)

> 
>> +                * the last tail page.
> 
>> +               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
>> +               if (!ptep)
>> +                       return -ENOMEM;
>> +
>> +               /*
>> +                * Reuse the page that was populated in the prior iteration
>> +                * with just tail struct pages.
>> +                */
>> +               return vmemmap_populate_range(start, end, node,
>> +                                             pte_page(*ptep));
>> +       }

The last patch separates the above check and uses the PMD (and the @offset) to reuse the
PMD of the compound_section_tail_page(). So this might mean that we introduce
in the last patch some sort of compound_section_tail_huge_page() for the pmd page.
So far it the second change doesn't seem to translate an obvious improvement in readability.

Pasted below, Here's compound_section_tail_page() [...]

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d7419b5d54d7..31f94802c095 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -673,6 +673,23 @@ static bool __meminit compound_section_index(unsigned long start_pfn,
                geometry_size > SUBSECTION_SIZE;
 }

+static struct page * __meminit compound_section_tail_page(unsigned long addr)
+{
+       pte_t *ptep;
+
+       addr -= PAGE_SIZE;
+
+       /*
+        * Assuming sections are populated sequentially, the previous section's
+        * page data can be reused.
+        */
+       ptep = pte_offset_kernel(pmd_off_k(addr), addr);
+       if (!ptep)
+               return NULL;
+
+       return pte_page(*ptep);
+}
+
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
                                                     unsigned long start,
                                                     unsigned long end, int node,
@@ -681,27 +698,17 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
start_pfn,
        unsigned long offset, size, addr;

        if (compound_section_index(start_pfn, pgmap)) {
-               pte_t *ptep;
-
-               addr = start - PAGE_SIZE;
+               struct page *page;

-               /*
-                * Sections are populated sequently and in sucession meaning
-                * this section being populated wouldn't start if the
-                * preceding one wasn't successful. So there is a guarantee that
-                * the previous struct pages are mapped when trying to lookup
-                * the last tail page.
-                */
-               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
-               if (!ptep)
+               page = compound_section_tail_page(start);
+               if (!page)
                        return -ENOMEM;

                /*
                 * Reuse the page that was populated in the prior iteration
                 * with just tail struct pages.
                 */
-               return vmemmap_populate_range(start, end, node,
-                                             pte_page(*ptep));
+               return vmemmap_populate_range(start, end, node, page);
        }

        size = min(end - start, pgmap_geometry(pgmap) * sizeof(struct page));




[...] And here's compound_section_tail_huge_page() (for the last patch in the series):


@@ -690,6 +727,33 @@ static struct page * __meminit compound_section_tail_page(unsigned
long addr)
        return pte_page(*ptep);
 }

+static struct page * __meminit compound_section_tail_huge_page(unsigned long addr,
+                               unsigned long offset, struct dev_pagemap *pgmap)
+{
+       unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT;
+       pmd_t *pmdp;
+
+       addr -= PAGE_SIZE;
+
+       /*
+        * Assuming sections are populated sequentially, the previous section's
+        * page data can be reused.
+        */
+       pmdp = pmd_off_k(addr);
+       if (!pmdp)
+               return ERR_PTR(-ENOMEM);
+
+       /*
+        * Reuse the tail pages vmemmap pmd page
+        * See layout diagram in Documentation/vm/vmemmap_dedup.rst
+        */
+       if (offset % geometry_size > PFN_PHYS(PAGES_PER_SECTION))
+               return pmd_page(*pmdp);
+
+       /* No reusable PMD fallback to PTE tail page*/
+       return NULL;
+}
+
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
                                                     unsigned long start,
                                                     unsigned long end, int node,
@@ -697,14 +761,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
start_pfn,
 {
        unsigned long offset, size, addr;

-       if (compound_section_index(start_pfn, pgmap)) {
-               struct page *page;
+       if (compound_section_index(start_pfn, pgmap, &offset)) {
+               struct page *page, *hpage;
+
+               hpage = compound_section_tail_huge_page(addr, offset);
+               if (IS_ERR(hpage))
+                       return -ENOMEM;
+               else if (hpage)
+                       return vmemmap_populate_pmd_range(start, end, node,
+                                                         hpage);

                page = compound_section_tail_page(start);
                if (!page)
                        return -ENOMEM;

                /*
+                * Populate the tail pages vmemmap pmd page.
                 * Reuse the page that was populated in the prior iteration
                 * with just tail struct pages.
                 */

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-07-28  7:28   ` Dan Williams
@ 2021-07-28 15:56     ` Joao Martins
  2021-07-28 16:08       ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-28 15:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On 7/28/21 8:28 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> +       /*
>> +        * With compound page geometry and when struct pages are stored in ram
>> +        * (!altmap) most tail pages are reused. Consequently, the amount of
>> +        * unique struct pages to initialize is a lot smaller that the total
>> +        * amount of struct pages being mapped.
>> +        * See vmemmap_populate_compound_pages().
>> +        */
>> +       if (!altmap)
>> +               nr_pages = min_t(unsigned long, nr_pages,
> 
> What's the scenario where nr_pages is < 128? Shouldn't alignment
> already be guaranteed?
> 
Oh yeah, that's right.

>> +                                2 * (PAGE_SIZE/sizeof(struct page)));
> 
> 
>> +
>>         __SetPageHead(page);
>>
>>         for (i = 1; i < nr_pages; i++) {
>> @@ -6657,7 +6669,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>                         continue;
>>
>>                 memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
>> -                                    pfns_per_compound);
>> +                                    altmap, pfns_per_compound);
> 
> This feels odd, memmap_init_compound() doesn't really care about
> altmap, what do you think about explicitly calculating the parameters
> that memmap_init_compound() needs and passing them in?
> 
> Not a strong requirement to change, but take another look at let me know.
> 

Yeah, memmap_init_compound() indeed doesn't care about @altmap itself -- but a previous
comment was to abstract this away in memmap_init_compound() given the mix of complexity in
memmap_init_zone_device() PAGE_SIZE geometry case and the compound case:

https://lore.kernel.org/linux-mm/CAPcyv4gtSqfmuAaX9cs63OvLkf-h4B_5fPiEnM9p9cqLZztXpg@mail.gmail.com/

Before this was called @ntails above and I hide that calculation in memmap_init_compound().

But I can move this back to the caller:

memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
	(!altmap ? 2 * (PAGE_SIZE/sizeof(struct page))) : pfns_per_compound);

Or with another helper like:

#define compound_nr_pages(__altmap, __nr_pages) \
		(!__altmap ? 2 * (PAGE_SIZE/sizeof(struct page))) : __nr_pages);
			
memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
		     compound_nr_pages(altmap, pfns_per_compound));

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff
  2021-07-28  7:29   ` Dan Williams
@ 2021-07-28 15:56     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-28 15:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 8:29 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Rather than calculating @pgoff manually, switch to ALIGN() instead.
> 
> Looks good,
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
Thanks!
>>
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/dax/device.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index dd8222a42808..0b82159b3564 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -234,8 +234,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>                  * mapped. No need to consider the zero page, or racing
>>                  * conflicting mappings.
>>                  */
>> -               pgoff = linear_page_index(vmf->vma, vmf->address
>> -                               & ~(fault_size - 1));
>> +               pgoff = linear_page_index(vmf->vma,
>> +                               ALIGN(vmf->address, fault_size));
>>                 for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>>                         struct page *page;
>>
>> --
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-07-28  7:30   ` Dan Williams
@ 2021-07-28 15:56     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-28 15:56 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 8:30 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Right now, only static dax regions have a valid @pgmap pointer in its
>> struct dev_dax. Dynamic dax case however, do not.
>>
>> In preparation for device-dax compound pagemap support, make sure that
>> dev_dax pgmap field is set after it has been allocated and initialized.
> 
> I think this is ok to fold into the patch that needs it.

OK, I've squashed that in.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-07-28 15:56     ` Joao Martins
@ 2021-07-28 16:08       ` Dan Williams
  2021-07-28 16:12         ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28 16:08 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 28, 2021 at 8:56 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 7/28/21 8:28 AM, Dan Williams wrote:
> > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> +       /*
> >> +        * With compound page geometry and when struct pages are stored in ram
> >> +        * (!altmap) most tail pages are reused. Consequently, the amount of
> >> +        * unique struct pages to initialize is a lot smaller that the total
> >> +        * amount of struct pages being mapped.
> >> +        * See vmemmap_populate_compound_pages().
> >> +        */
> >> +       if (!altmap)
> >> +               nr_pages = min_t(unsigned long, nr_pages,
> >
> > What's the scenario where nr_pages is < 128? Shouldn't alignment
> > already be guaranteed?
> >
> Oh yeah, that's right.
>
> >> +                                2 * (PAGE_SIZE/sizeof(struct page)));
> >
> >
> >> +
> >>         __SetPageHead(page);
> >>
> >>         for (i = 1; i < nr_pages; i++) {
> >> @@ -6657,7 +6669,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >>                         continue;
> >>
> >>                 memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
> >> -                                    pfns_per_compound);
> >> +                                    altmap, pfns_per_compound);
> >
> > This feels odd, memmap_init_compound() doesn't really care about
> > altmap, what do you think about explicitly calculating the parameters
> > that memmap_init_compound() needs and passing them in?
> >
> > Not a strong requirement to change, but take another look at let me know.
> >
>
> Yeah, memmap_init_compound() indeed doesn't care about @altmap itself -- but a previous
> comment was to abstract this away in memmap_init_compound() given the mix of complexity in
> memmap_init_zone_device() PAGE_SIZE geometry case and the compound case:
>
> https://lore.kernel.org/linux-mm/CAPcyv4gtSqfmuAaX9cs63OvLkf-h4B_5fPiEnM9p9cqLZztXpg@mail.gmail.com/
>
> Before this was called @ntails above and I hide that calculation in memmap_init_compound().
>
> But I can move this back to the caller:
>
> memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
>         (!altmap ? 2 * (PAGE_SIZE/sizeof(struct page))) : pfns_per_compound);
>
> Or with another helper like:
>
> #define compound_nr_pages(__altmap, __nr_pages) \
>                 (!__altmap ? 2 * (PAGE_SIZE/sizeof(struct page))) : __nr_pages);
>
> memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
>                      compound_nr_pages(altmap, pfns_per_compound));

I like the helper, but I'd go further to make it a function with a
comment that it is a paired / mild layering violation with explicit
knowledge of how the sparse_vmemmap() internals handle compound pages
in the presence of an altmap. I.e. if someone later goes to add altmap
support, leave them a breadcrumb that they need to update both
locations.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-07-28 16:08       ` Dan Williams
@ 2021-07-28 16:12         ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-28 16:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 5:08 PM, Dan Williams wrote:
> On Wed, Jul 28, 2021 at 8:56 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 7/28/21 8:28 AM, Dan Williams wrote:
>>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> +       /*
>>>> +        * With compound page geometry and when struct pages are stored in ram
>>>> +        * (!altmap) most tail pages are reused. Consequently, the amount of
>>>> +        * unique struct pages to initialize is a lot smaller that the total
>>>> +        * amount of struct pages being mapped.
>>>> +        * See vmemmap_populate_compound_pages().
>>>> +        */
>>>> +       if (!altmap)
>>>> +               nr_pages = min_t(unsigned long, nr_pages,
>>>
>>> What's the scenario where nr_pages is < 128? Shouldn't alignment
>>> already be guaranteed?
>>>
>> Oh yeah, that's right.
>>
>>>> +                                2 * (PAGE_SIZE/sizeof(struct page)));
>>>
>>>
>>>> +
>>>>         __SetPageHead(page);
>>>>
>>>>         for (i = 1; i < nr_pages; i++) {
>>>> @@ -6657,7 +6669,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>>>                         continue;
>>>>
>>>>                 memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
>>>> -                                    pfns_per_compound);
>>>> +                                    altmap, pfns_per_compound);
>>>
>>> This feels odd, memmap_init_compound() doesn't really care about
>>> altmap, what do you think about explicitly calculating the parameters
>>> that memmap_init_compound() needs and passing them in?
>>>
>>> Not a strong requirement to change, but take another look at let me know.
>>>
>>
>> Yeah, memmap_init_compound() indeed doesn't care about @altmap itself -- but a previous
>> comment was to abstract this away in memmap_init_compound() given the mix of complexity in
>> memmap_init_zone_device() PAGE_SIZE geometry case and the compound case:
>>
>> https://lore.kernel.org/linux-mm/CAPcyv4gtSqfmuAaX9cs63OvLkf-h4B_5fPiEnM9p9cqLZztXpg@mail.gmail.com/
>>
>> Before this was called @ntails above and I hide that calculation in memmap_init_compound().
>>
>> But I can move this back to the caller:
>>
>> memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
>>         (!altmap ? 2 * (PAGE_SIZE/sizeof(struct page))) : pfns_per_compound);
>>
>> Or with another helper like:
>>
>> #define compound_nr_pages(__altmap, __nr_pages) \
>>                 (!__altmap ? 2 * (PAGE_SIZE/sizeof(struct page))) : __nr_pages);
>>
>> memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
>>                      compound_nr_pages(altmap, pfns_per_compound));
> 
> I like the helper, but I'd go further to make it a function with a
> comment that it is a paired / mild layering violation with explicit
> knowledge of how the sparse_vmemmap() internals handle compound pages
> in the presence of an altmap. I.e. if someone later goes to add altmap
> support, leave them a breadcrumb that they need to update both
> locations.
> 
OK, got it.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps
  2021-07-28 15:35     ` Joao Martins
@ 2021-07-28 18:03       ` Dan Williams
  2021-07-28 18:54         ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28 18:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 28, 2021 at 8:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> +/*
> + * For compound pages bigger than section size (e.g. x86 1G compound
> + * pages with 2M subsection size) fill the rest of sections as tail
> + * pages.
> + *
> + * Note that memremap_pages() resets @nr_range value and will increment
> + * it after each range successful onlining. Thus the value or @nr_range
> + * at section memmap populate corresponds to the in-progress range
> + * being onlined here.
> + */
> +static bool compound_section_index(unsigned long start_pfn,

Oh, I was thinking this would return the actual Nth index number for
the section within the compound page. A bool is ok too, but then the
function name would be something like:

reuse_compound_section()

...right?


[..]
> [...] And here's compound_section_tail_huge_page() (for the last patch in the series):
>
>
> @@ -690,6 +727,33 @@ static struct page * __meminit compound_section_tail_page(unsigned
> long addr)
>         return pte_page(*ptep);
>  }
>
> +static struct page * __meminit compound_section_tail_huge_page(unsigned long addr,
> +                               unsigned long offset, struct dev_pagemap *pgmap)
> +{
> +       unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT;
> +       pmd_t *pmdp;
> +
> +       addr -= PAGE_SIZE;
> +
> +       /*
> +        * Assuming sections are populated sequentially, the previous section's
> +        * page data can be reused.
> +        */
> +       pmdp = pmd_off_k(addr);
> +       if (!pmdp)
> +               return ERR_PTR(-ENOMEM);
> +
> +       /*
> +        * Reuse the tail pages vmemmap pmd page
> +        * See layout diagram in Documentation/vm/vmemmap_dedup.rst
> +        */
> +       if (offset % geometry_size > PFN_PHYS(PAGES_PER_SECTION))
> +               return pmd_page(*pmdp);
> +
> +       /* No reusable PMD fallback to PTE tail page*/
> +       return NULL;
> +}
> +
>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>                                                      unsigned long start,
>                                                      unsigned long end, int node,
> @@ -697,14 +761,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
> start_pfn,
>  {
>         unsigned long offset, size, addr;
>
> -       if (compound_section_index(start_pfn, pgmap)) {
> -               struct page *page;
> +       if (compound_section_index(start_pfn, pgmap, &offset)) {
> +               struct page *page, *hpage;
> +
> +               hpage = compound_section_tail_huge_page(addr, offset);
> +               if (IS_ERR(hpage))
> +                       return -ENOMEM;
> +               else if (hpage)

No need for "else" after return... other than that these helpers and
this arrangement looks good to me.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-28  9:36         ` Joao Martins
@ 2021-07-28 18:51           ` Dan Williams
  2021-07-28 18:59             ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28 18:51 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 28, 2021 at 2:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 7/28/21 12:51 AM, Dan Williams wrote:
> > On Thu, Jul 15, 2021 at 5:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> On 7/15/21 12:36 AM, Dan Williams wrote:
> >>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.
> >>
> >> It needs this chunk below change on the fourth patch due to the existing elevated page ref
> >> count at zone device memmap init. put_page() called here in memunmap_pages():
> >>
> >> for (i = 0; i < pgmap->nr_ranges; i++)
> >>         for_each_device_pfn(pfn, pgmap, i)
> >>                 put_page(pfn_to_page(pfn));
> >>
> >> ... on a zone_device compound memmap would otherwise always decrease head page refcount by
> >> @geometry pfn amount (leading to the aforementioned splat you reported).
> >>
> >> diff --git a/mm/memremap.c b/mm/memremap.c
> >> index b0e7b8cf3047..79a883af788e 100644
> >> --- a/mm/memremap.c
> >> +++ b/mm/memremap.c
> >> @@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
> >>         return (range->start + range_len(range)) >> PAGE_SHIFT;
> >>  }
> >>
> >> -static unsigned long pfn_next(unsigned long pfn)
> >> +static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
> >>  {
> >>         if (pfn % 1024 == 0)
> >>                 cond_resched();
> >> -       return pfn + 1;
> >> +       return pfn + pgmap_pfn_geometry(pgmap);
> >
> > The cond_resched() would need to be fixed up too to something like:
> >
> > if (pfn % (1024 << pgmap_geometry_order(pgmap)))
> >     cond_resched();
> >
> > ...because the goal is to take a break every 1024 iterations, not
> > every 1024 pfns.
> >
>
> Ah, good point.
>
> >>  }
> >>
> >>  #define for_each_device_pfn(pfn, map, i) \
> >> -       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
> >> +       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
> >>
> >>  static void dev_pagemap_kill(struct dev_pagemap *pgmap)
> >>  {
> >>
> >> It could also get this hunk below, but it is sort of redundant provided we won't touch
> >> tail page refcount through out the devmap pages lifetime. This setting of tail pages
> >> refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
> >> from the page allocator (where tail pages are already zeroed in refcount).
> >
> > Wait, devmap pages never see the page allocator?
> >
> "where tail pages are already zeroed in refcount" this actually meant 'freshly allocated
> pages' and I was referring to commit 7118fc2906e2 ("hugetlb: address ref count racing in
> prep_compound_gigantic_page") that removed set_page_count() because the setting of page
> ref count to zero was redundant.

Ah, maybe include that reference in the changelog?

>
> Albeit devmap pages don't come from page allocator, you know separate zone and these pages
> aren't part of the regular page pools (e.g. accessible via alloc_pages()), as you are
> aware. Unless of course, we reassign them via dax_kmem, but then the way we map the struct
> pages would be regular without any devmap stuff.

Got it. I think with the back reference to that commit (7118fc2906e2)
it resolves my confusion.

>
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 96975edac0a8..469a7aa5cf38 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
> >> long pfn,
> >>                 __init_zone_device_page(page + i, pfn + i, zone_idx,
> >>                                         nid, pgmap);
> >>                 prep_compound_tail(page, i);
> >> +               set_page_count(page + i, 0);
> >
> > Looks good to me and perhaps a for elevated tail page refcount at
> > teardown as a sanity check that the tail pages was never pinned
> > directly?
> >
> Sorry didn't follow completely.
>
> You meant to set tail page refcount back to 1 at teardown if it was kept to 0 (e.g.
> memunmap_pages() after put_page()) or that the refcount is indeed kept to zero after the
> put_page() in memunmap_pages() ?

The latter, i.e. would it be worth it to check that a tail page did
not get accidentally pinned instead of a head page? I'm also ok to
leave out that sanity checking for now.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps
  2021-07-28 18:03       ` Dan Williams
@ 2021-07-28 18:54         ` Joao Martins
  2021-07-28 20:04           ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-28 18:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 7:03 PM, Dan Williams wrote:
> On Wed, Jul 28, 2021 at 8:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>> +/*
>> + * For compound pages bigger than section size (e.g. x86 1G compound
>> + * pages with 2M subsection size) fill the rest of sections as tail
>> + * pages.
>> + *
>> + * Note that memremap_pages() resets @nr_range value and will increment
>> + * it after each range successful onlining. Thus the value or @nr_range
>> + * at section memmap populate corresponds to the in-progress range
>> + * being onlined here.
>> + */
>> +static bool compound_section_index(unsigned long start_pfn,
> 
> Oh, I was thinking this would return the actual Nth index number for
> the section within the compound page. 
> A bool is ok too, but then the
> function name would be something like:
> 
> reuse_compound_section()
> 
> ...right?
> 
Yes.

> 
> [..]
>> [...] And here's compound_section_tail_huge_page() (for the last patch in the series):
>>
>>
>> @@ -690,6 +727,33 @@ static struct page * __meminit compound_section_tail_page(unsigned
>> long addr)
>>         return pte_page(*ptep);
>>  }
>>
>> +static struct page * __meminit compound_section_tail_huge_page(unsigned long addr,
>> +                               unsigned long offset, struct dev_pagemap *pgmap)
>> +{
>> +       unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT;
>> +       pmd_t *pmdp;
>> +
>> +       addr -= PAGE_SIZE;
>> +
>> +       /*
>> +        * Assuming sections are populated sequentially, the previous section's
>> +        * page data can be reused.
>> +        */
>> +       pmdp = pmd_off_k(addr);
>> +       if (!pmdp)
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       /*
>> +        * Reuse the tail pages vmemmap pmd page
>> +        * See layout diagram in Documentation/vm/vmemmap_dedup.rst
>> +        */
>> +       if (offset % geometry_size > PFN_PHYS(PAGES_PER_SECTION))
>> +               return pmd_page(*pmdp);
>> +
Maybe I can drop the geometry_size and just do here:

if (PHYS_PFN(offset) % pgmap_geometry(pgmap) > PAGES_PER_SECTION)

and thus drop the geometry_size variable.

>> +       /* No reusable PMD fallback to PTE tail page*/
>> +       return NULL;
>> +}
>> +
>>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>>                                                      unsigned long start,
>>                                                      unsigned long end, int node,
>> @@ -697,14 +761,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
>> start_pfn,
>>  {
>>         unsigned long offset, size, addr;
>>
>> -       if (compound_section_index(start_pfn, pgmap)) {
>> -               struct page *page;
>> +       if (compound_section_index(start_pfn, pgmap, &offset)) {
>> +               struct page *page, *hpage;
>> +
>> +               hpage = compound_section_tail_huge_page(addr, offset);
>> +               if (IS_ERR(hpage))
>> +                       return -ENOMEM;
>> +               else if (hpage)
> 
> No need for "else" after return... other than that these helpers and
> this arrangement looks good to me.
> 
OK.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-28 18:51           ` Dan Williams
@ 2021-07-28 18:59             ` Joao Martins
  2021-07-28 19:03               ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-28 18:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 7:51 PM, Dan Williams wrote:
> On Wed, Jul 28, 2021 at 2:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 7/28/21 12:51 AM, Dan Williams wrote:
>>> On Thu, Jul 15, 2021 at 5:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> On 7/15/21 12:36 AM, Dan Williams wrote:
>>>>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.
>>>>
>>>> It needs this chunk below change on the fourth patch due to the existing elevated page ref
>>>> count at zone device memmap init. put_page() called here in memunmap_pages():
>>>>
>>>> for (i = 0; i < pgmap->nr_ranges; i++)
>>>>         for_each_device_pfn(pfn, pgmap, i)
>>>>                 put_page(pfn_to_page(pfn));
>>>>
>>>> ... on a zone_device compound memmap would otherwise always decrease head page refcount by
>>>> @geometry pfn amount (leading to the aforementioned splat you reported).
>>>>
>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>> index b0e7b8cf3047..79a883af788e 100644
>>>> --- a/mm/memremap.c
>>>> +++ b/mm/memremap.c
>>>> @@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
>>>>         return (range->start + range_len(range)) >> PAGE_SHIFT;
>>>>  }
>>>>
>>>> -static unsigned long pfn_next(unsigned long pfn)
>>>> +static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
>>>>  {
>>>>         if (pfn % 1024 == 0)
>>>>                 cond_resched();
>>>> -       return pfn + 1;
>>>> +       return pfn + pgmap_pfn_geometry(pgmap);
>>>
>>> The cond_resched() would need to be fixed up too to something like:
>>>
>>> if (pfn % (1024 << pgmap_geometry_order(pgmap)))
>>>     cond_resched();
>>>
>>> ...because the goal is to take a break every 1024 iterations, not
>>> every 1024 pfns.
>>>
>>
>> Ah, good point.
>>
>>>>  }
>>>>
>>>>  #define for_each_device_pfn(pfn, map, i) \
>>>> -       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
>>>> +       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
>>>>
>>>>  static void dev_pagemap_kill(struct dev_pagemap *pgmap)
>>>>  {
>>>>
>>>> It could also get this hunk below, but it is sort of redundant provided we won't touch
>>>> tail page refcount through out the devmap pages lifetime. This setting of tail pages
>>>> refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
>>>> from the page allocator (where tail pages are already zeroed in refcount).
>>>
>>> Wait, devmap pages never see the page allocator?
>>>
>> "where tail pages are already zeroed in refcount" this actually meant 'freshly allocated
>> pages' and I was referring to commit 7118fc2906e2 ("hugetlb: address ref count racing in
>> prep_compound_gigantic_page") that removed set_page_count() because the setting of page
>> ref count to zero was redundant.
> 
> Ah, maybe include that reference in the changelog?
> 
Yeap, will do.

>>
>> Albeit devmap pages don't come from page allocator, you know separate zone and these pages
>> aren't part of the regular page pools (e.g. accessible via alloc_pages()), as you are
>> aware. Unless of course, we reassign them via dax_kmem, but then the way we map the struct
>> pages would be regular without any devmap stuff.
> 
> Got it. I think with the back reference to that commit (7118fc2906e2)
> it resolves my confusion.
> 
>>
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 96975edac0a8..469a7aa5cf38 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
>>>> long pfn,
>>>>                 __init_zone_device_page(page + i, pfn + i, zone_idx,
>>>>                                         nid, pgmap);
>>>>                 prep_compound_tail(page, i);
>>>> +               set_page_count(page + i, 0);
>>>
>>> Looks good to me and perhaps a for elevated tail page refcount at
>>> teardown as a sanity check that the tail pages was never pinned
>>> directly?
>>>
>> Sorry didn't follow completely.
>>
>> You meant to set tail page refcount back to 1 at teardown if it was kept to 0 (e.g.
>> memunmap_pages() after put_page()) or that the refcount is indeed kept to zero after the
>> put_page() in memunmap_pages() ?
> 
> The latter, i.e. would it be worth it to check that a tail page did
> not get accidentally pinned instead of a head page? I'm also ok to
> leave out that sanity checking for now.
> 
What makes me not worry too much about the sanity checking is that this put_page is
supposed to disappear here:

https://lore.kernel.org/linux-mm/20210717192135.9030-3-alex.sierra@amd.com/

.. in fact none the hunks here:

https://lore.kernel.org/linux-mm/f7217b61-c845-eaed-501e-c9e7067a6b87@oracle.com/

None of them would matter, as there would no longer exist an elevated page refcount to
deal with.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 12/14] device-dax: compound pagemap support
  2021-07-28 18:59             ` Joao Martins
@ 2021-07-28 19:03               ` Dan Williams
  0 siblings, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-28 19:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 28, 2021 at 11:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 7/28/21 7:51 PM, Dan Williams wrote:
> > On Wed, Jul 28, 2021 at 2:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> On 7/28/21 12:51 AM, Dan Williams wrote:
> >>> On Thu, Jul 15, 2021 at 5:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>> On 7/15/21 12:36 AM, Dan Williams wrote:
> >>>>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>> This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.
> >>>>
> >>>> It needs this chunk below change on the fourth patch due to the existing elevated page ref
> >>>> count at zone device memmap init. put_page() called here in memunmap_pages():
> >>>>
> >>>> for (i = 0; i < pgmap->nr_ranges; i++)
> >>>>         for_each_device_pfn(pfn, pgmap, i)
> >>>>                 put_page(pfn_to_page(pfn));
> >>>>
> >>>> ... on a zone_device compound memmap would otherwise always decrease head page refcount by
> >>>> @geometry pfn amount (leading to the aforementioned splat you reported).
> >>>>
> >>>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>>> index b0e7b8cf3047..79a883af788e 100644
> >>>> --- a/mm/memremap.c
> >>>> +++ b/mm/memremap.c
> >>>> @@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
> >>>>         return (range->start + range_len(range)) >> PAGE_SHIFT;
> >>>>  }
> >>>>
> >>>> -static unsigned long pfn_next(unsigned long pfn)
> >>>> +static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
> >>>>  {
> >>>>         if (pfn % 1024 == 0)
> >>>>                 cond_resched();
> >>>> -       return pfn + 1;
> >>>> +       return pfn + pgmap_pfn_geometry(pgmap);
> >>>
> >>> The cond_resched() would need to be fixed up too to something like:
> >>>
> >>> if (pfn % (1024 << pgmap_geometry_order(pgmap)))
> >>>     cond_resched();
> >>>
> >>> ...because the goal is to take a break every 1024 iterations, not
> >>> every 1024 pfns.
> >>>
> >>
> >> Ah, good point.
> >>
> >>>>  }
> >>>>
> >>>>  #define for_each_device_pfn(pfn, map, i) \
> >>>> -       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
> >>>> +       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
> >>>>
> >>>>  static void dev_pagemap_kill(struct dev_pagemap *pgmap)
> >>>>  {
> >>>>
> >>>> It could also get this hunk below, but it is sort of redundant provided we won't touch
> >>>> tail page refcount through out the devmap pages lifetime. This setting of tail pages
> >>>> refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
> >>>> from the page allocator (where tail pages are already zeroed in refcount).
> >>>
> >>> Wait, devmap pages never see the page allocator?
> >>>
> >> "where tail pages are already zeroed in refcount" this actually meant 'freshly allocated
> >> pages' and I was referring to commit 7118fc2906e2 ("hugetlb: address ref count racing in
> >> prep_compound_gigantic_page") that removed set_page_count() because the setting of page
> >> ref count to zero was redundant.
> >
> > Ah, maybe include that reference in the changelog?
> >
> Yeap, will do.
>
> >>
> >> Albeit devmap pages don't come from page allocator, you know separate zone and these pages
> >> aren't part of the regular page pools (e.g. accessible via alloc_pages()), as you are
> >> aware. Unless of course, we reassign them via dax_kmem, but then the way we map the struct
> >> pages would be regular without any devmap stuff.
> >
> > Got it. I think with the back reference to that commit (7118fc2906e2)
> > it resolves my confusion.
> >
> >>
> >>>>
> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>> index 96975edac0a8..469a7aa5cf38 100644
> >>>> --- a/mm/page_alloc.c
> >>>> +++ b/mm/page_alloc.c
> >>>> @@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
> >>>> long pfn,
> >>>>                 __init_zone_device_page(page + i, pfn + i, zone_idx,
> >>>>                                         nid, pgmap);
> >>>>                 prep_compound_tail(page, i);
> >>>> +               set_page_count(page + i, 0);
> >>>
> >>> Looks good to me and perhaps a for elevated tail page refcount at
> >>> teardown as a sanity check that the tail pages was never pinned
> >>> directly?
> >>>
> >> Sorry didn't follow completely.
> >>
> >> You meant to set tail page refcount back to 1 at teardown if it was kept to 0 (e.g.
> >> memunmap_pages() after put_page()) or that the refcount is indeed kept to zero after the
> >> put_page() in memunmap_pages() ?
> >
> > The latter, i.e. would it be worth it to check that a tail page did
> > not get accidentally pinned instead of a head page? I'm also ok to
> > leave out that sanity checking for now.
> >
> What makes me not worry too much about the sanity checking is that this put_page is
> supposed to disappear here:
>
> https://lore.kernel.org/linux-mm/20210717192135.9030-3-alex.sierra@amd.com/
>
> .. in fact none the hunks here:
>
> https://lore.kernel.org/linux-mm/f7217b61-c845-eaed-501e-c9e7067a6b87@oracle.com/
>
> None of them would matter, as there would no longer exist an elevated page refcount to
> deal with.

Ah good point. It's past time to take care of that... if only that
patch kit had been Cc'd to the DAX maintainer...

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages
  2021-07-14 19:35 ` [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-07-28 19:55   ` Dan Williams
  2021-07-28 20:07     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28 19:55 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Use try_grab_compound_head() for device-dax GUP when configured with a
> compound pagemap.
>
> Rather than incrementing the refcount for each page, do one atomic
> addition for all the pages to be pinned.
>
> Performance measured by gup_benchmark improves considerably
> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>
>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> [altmap]
> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
>
>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> [altmap with -m 127004]
> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/gup.c | 53 +++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 33 insertions(+), 20 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 42b8b1fa6521..9baaa1c0b7f3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2234,31 +2234,55 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  }
>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>
> +
> +static int record_subpages(struct page *page, unsigned long addr,
> +                          unsigned long end, struct page **pages)
> +{
> +       int nr;
> +
> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
> +               pages[nr++] = page++;
> +
> +       return nr;
> +}
> +
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>                              unsigned long end, unsigned int flags,
>                              struct page **pages, int *nr)
>  {
> -       int nr_start = *nr;
> +       int refs, nr_start = *nr;
>         struct dev_pagemap *pgmap = NULL;
>
>         do {
> -               struct page *page = pfn_to_page(pfn);
> +               struct page *pinned_head, *head, *page = pfn_to_page(pfn);
> +               unsigned long next;
>
>                 pgmap = get_dev_pagemap(pfn, pgmap);
>                 if (unlikely(!pgmap)) {
>                         undo_dev_pagemap(nr, nr_start, flags, pages);
>                         return 0;
>                 }
> -               SetPageReferenced(page);
> -               pages[*nr] = page;
> -               if (unlikely(!try_grab_page(page, flags))) {
> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
> +
> +               head = compound_head(page);
> +               /* @end is assumed to be limited at most one compound page */
> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;

Please no ternary operator for this check, but otherwise this patch
looks good to me.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>


> +               refs = record_subpages(page, addr, next, pages + *nr);
> +
> +               SetPageReferenced(head);
> +               pinned_head = try_grab_compound_head(head, refs, flags);
> +               if (!pinned_head) {
> +                       if (PageCompound(head)) {
> +                               ClearPageReferenced(head);
> +                               put_dev_pagemap(pgmap);
> +                       } else {
> +                               undo_dev_pagemap(nr, nr_start, flags, pages);
> +                       }
>                         return 0;
>                 }
> -               (*nr)++;
> -               pfn++;
> -       } while (addr += PAGE_SIZE, addr != end);
> +               *nr += refs;
> +               pfn += refs;
> +       } while (addr += (refs << PAGE_SHIFT), addr != end);
>
>         if (pgmap)
>                 put_dev_pagemap(pgmap);
> @@ -2318,17 +2342,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>  }
>  #endif
>
> -static int record_subpages(struct page *page, unsigned long addr,
> -                          unsigned long end, struct page **pages)
> -{
> -       int nr;
> -
> -       for (nr = 0; addr != end; addr += PAGE_SIZE)
> -               pages[nr++] = page++;
> -
> -       return nr;
> -}
> -
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>                                       unsigned long sz)
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry
  2021-07-14 19:35 ` [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
@ 2021-07-28 20:03   ` Dan Williams
  2021-07-28 20:08     ` Joao Martins
  0 siblings, 1 reply; 74+ messages in thread
From: Dan Williams @ 2021-07-28 20:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Currently, for compound PUD mappings, the implementation consumes 40MB
> per TB but it can be optimized to 16MB per TB with the approach
> detailed below.
>
> Right now basepages are used to populate the PUD tail pages, and it
> picks the address of the previous page of the subsection that precedes
> the memmap being initialized.  This is done when a given memmap
> address isn't aligned to the pgmap @geometry (which is safe to do because
> @ranges are guaranteed to be aligned to @geometry).
>
> For pagemaps with an align which spans various sections, this means
> that PMD pages are unnecessarily allocated for reusing the same tail
> pages.  Effectively, on x86 a PUD can span 8 sections (depending on
> config), and a page is being  allocated a page for the PMD to reuse
> the tail vmemmap across the rest of the PTEs. In short effecitvely the
> PMD cover the tail vmemmap areas all contain the same PFN. So instead
> of doing this way, populate a new PMD on the second section of the
> compound page (tail vmemmap PMD), and then the following sections
> utilize the preceding PMD previously populated which only contain
> tail pages).
>
> After this scheme for an 1GB pagemap aligned area, the first PMD
> (section) would contain head page and 32767 tail pages, where the
> second PMD contains the full 32768 tail pages.  The latter page gets
> its PMD reused across future section mapping of the same pagemap.
>
> Besides fewer pagetable entries allocated, keeping parity with
> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
> this further increases savings per compound page. Rather than
> requiring 8 PMD page allocations only need 2 (plus two base pages
> allocated for head and tail areas for the first PMD). 2M pages still
> require using base pages, though.

This looks good to me now, modulo the tail_page helper discussed
previously. Thanks for the diagram, makes it clearer what's happening.

I don't see any red flags that would prevent a reviewed-by when you
send the next spin.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++
>  include/linux/mm.h                 |   3 +-
>  mm/sparse-vmemmap.c                |  74 +++++++++++++++++---
>  3 files changed, 174 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
> index 42830a667c2a..96d9f5f0a497 100644
> --- a/Documentation/vm/vmemmap_dedup.rst
> +++ b/Documentation/vm/vmemmap_dedup.rst
> @@ -189,3 +189,112 @@ at a later stage when we populate the sections.
>  It only use 3 page structs for storing all information as opposed
>  to 4 on HugeTLB pages. This does not affect memory savings between both.
>
> +Additionally, it further extends the tail page deduplication with 1GB
> +device-dax compound pages.
> +
> +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split
> +across 8 PMD page frames, with the first PMD having 2 PTE page frames.
> +In total this represents a total of 40960 bytes per 1GB page.
> +
> +Here is how things look after the previously described tail page deduplication
> +technique.
> +
> +   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
> + +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
> + |           |    |    0     |     |     0     | -------------> |      0      |
> + |           |    +----------+     +-----------+                +-------------+
> + |           |                     |     1     | -------------> |      1      |
> + |           |                     +-----------+                +-------------+
> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
> + |           |                     +-----------+                   | | | | | |
> + |           |                     |     3     | ------------------+ | | | | |
> + |           |                     +-----------+                     | | | | |
> + |           |                     |     4     | --------------------+ | | | |
> + |   PMD 0   |                     +-----------+                       | | | |
> + |           |                     |     5     | ----------------------+ | | |
> + |           |                     +-----------+                         | | |
> + |           |                     |     ..    | ------------------------+ | |
> + |           |                     +-----------+                           | |
> + |           |                     |     511   | --------------------------+ |
> + |           |                     +-----------+                             |
> + |           |                                                               |
> + |           |                                                               |
> + |           |                                                               |
> + +-----------+     page frames                                               |
> + +-----------+ -> +----------+ --> +-----------+    mapping to               |
> + |           |    |  1 .. 7  |     |    512    | ----------------------------+
> + |           |    +----------+     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |    PMD    |                     +-----------+                             |
> + |  1 .. 7   |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    ..     | ----------------------------+
> + |           |                     +-----------+                             |
> + |           |                     |    4095   | ----------------------------+
> + +-----------+                     +-----------+
> +
> +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame
> +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7
> +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame.
> +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per
> +1GB page from 40960 bytes to 16384 bytes.
> +
> +Here is how things look after PMD tail page deduplication.
> +
> +   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
> + +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
> + |           |    |    0     |     |     0     | -------------> |      0      |
> + |           |    +----------+     +-----------+                +-------------+
> + |           |                     |     1     | -------------> |      1      |
> + |           |                     +-----------+                +-------------+
> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
> + |           |                     +-----------+                   | | | | | |
> + |           |                     |     3     | ------------------+ | | | | |
> + |           |                     +-----------+                     | | | | |
> + |           |                     |     4     | --------------------+ | | | |
> + |   PMD 0   |                     +-----------+                       | | | |
> + |           |                     |     5     | ----------------------+ | | |
> + |           |                     +-----------+                         | | |
> + |           |                     |     ..    | ------------------------+ | |
> + |           |                     +-----------+                           | |
> + |           |                     |     511   | --------------------------+ |
> + |           |                     +-----------+                             |
> + |           |                                                               |
> + |           |                                                               |
> + |           |                                                               |
> + +-----------+     page frames                                               |
> + +-----------+ -> +----------+ --> +-----------+    mapping to               |
> + |           |    |    1     |     |    512    | ----------------------------+
> + |           |    +----------+     +-----------+                             |
> + |           |     ^ ^ ^ ^ ^ ^     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |   PMD 1   |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    ..     | ----------------------------+
> + |           |     | | | | | |     +-----------+                             |
> + |           |     | | | | | |     |    4095   | ----------------------------+
> + +-----------+     | | | | | |     +-----------+
> + |   PMD 2   | ----+ | | | | |
> + +-----------+       | | | | |
> + |   PMD 3   | ------+ | | | |
> + +-----------+         | | | |
> + |   PMD 4   | --------+ | | |
> + +-----------+           | | |
> + |   PMD 5   | ----------+ | |
> + +-----------+             | |
> + |   PMD 6   | ------------+ |
> + +-----------+               |
> + |   PMD 7   | --------------+
> + +-----------+
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5e3e153ddd3d..e9dc3e2de7be 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                           struct page *block);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>                             struct vmem_altmap *altmap, struct page *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index a8de6c472999..68041ca9a797 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>         return p;
>  }
>
> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                                      struct page *block)
>  {
>         pmd_t *pmd = pmd_offset(pud, addr);
>         if (pmd_none(*pmd)) {
> -               void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> -               if (!p)
> -                       return NULL;
> +               void *p;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> +                       if (!p)
> +                               return NULL;
> +               } else {
> +                       /* See comment in vmemmap_pte_populate(). */
> +                       get_page(block);
> +                       p = page_to_virt(block);
> +               }
>                 pmd_populate_kernel(&init_mm, pmd, p);
>         }
>         return pmd;
> @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap,
> -                                             struct page *reuse, struct page **page)
> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
> +                                                 struct vmem_altmap *altmap,
> +                                                 struct page *reuse, pmd_t **ptr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
> -       pte_t *pte;
>
>         pgd = vmemmap_pgd_populate(addr, node);
>         if (!pgd)
> @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pud = vmemmap_pud_populate(p4d, addr, node);
>         if (!pud)
>                 return -ENOMEM;
> -       pmd = vmemmap_pmd_populate(pud, addr, node);
> +       pmd = vmemmap_pmd_populate(pud, addr, node, reuse);
>         if (!pmd)
>                 return -ENOMEM;
> +       if (ptr)
> +               *ptr = pmd;
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap,
> +                                             struct page *reuse, struct page **page)
> +{
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
> +               return -ENOMEM;
> +
>         pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
>         if (!pte)
>                 return -ENOMEM;
> @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>         return vmemmap_populate_address(addr, node, NULL, NULL, page);
>  }
>
> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
> +                                               unsigned long end,
> +                                               int node, struct page *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PMD_SIZE) {
> +               if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>                                                      unsigned long start,
>                                                      unsigned long end, int node,
> @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>         if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
>             pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
> +               pmd_t *pmdp;
>                 pte_t *ptep;
>
>                 addr = start - PAGE_SIZE;
> @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>                  * the previous struct pages are mapped when trying to lookup
>                  * the last tail page.
>                  */
> -               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
> -               if (!ptep)
> +               pmdp = pmd_off_k(addr);
> +               if (!pmdp)
> +                       return -ENOMEM;
> +
> +               /*
> +                * Reuse the tail pages vmemmap pmd page
> +                * See layout diagram in Documentation/vm/vmemmap_dedup.rst
> +                */
> +               if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION))
> +                       return vmemmap_populate_pmd_range(start, end, node,
> +                                                         pmd_page(*pmdp));
> +
> +               /* See comment above when pmd_off_k() is called. */
> +               ptep = pte_offset_kernel(pmdp, addr);
> +               if (pte_none(*ptep))
>                         return -ENOMEM;
>
>                 /*
> +                * Populate the tail pages vmemmap pmd page.
>                  * Reuse the page that was populated in the prior iteration
>                  * with just tail struct pages.
>                  */
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps
  2021-07-28 18:54         ` Joao Martins
@ 2021-07-28 20:04           ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-28 20:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 7:54 PM, Joao Martins wrote:
> 
> 
> On 7/28/21 7:03 PM, Dan Williams wrote:
>> On Wed, Jul 28, 2021 at 8:36 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> [..]
>>> +/*
>>> + * For compound pages bigger than section size (e.g. x86 1G compound
>>> + * pages with 2M subsection size) fill the rest of sections as tail
>>> + * pages.
>>> + *
>>> + * Note that memremap_pages() resets @nr_range value and will increment
>>> + * it after each range successful onlining. Thus the value or @nr_range
>>> + * at section memmap populate corresponds to the in-progress range
>>> + * being onlined here.
>>> + */
>>> +static bool compound_section_index(unsigned long start_pfn,
>>
>> Oh, I was thinking this would return the actual Nth index number for
>> the section within the compound page. 
>> A bool is ok too, but then the
>> function name would be something like:
>>
>> reuse_compound_section()
>>
>> ...right?
>>
> Yes.
> 
Additionally, I am shifting calculations to be PFN based to avoid needless conversions of
@geometry to bytes. So from this:

+static bool __meminit compound_section_index(unsigned long start_pfn,
+                                            struct dev_pagemap *pgmap)
+{
+       unsigned long geometry_size = pgmap_geometry(pgmap) << PAGE_SHIFT;
+       unsigned long offset = PFN_PHYS(start_pfn) -
+               pgmap->ranges[pgmap->nr_range].start;
+
+       return !IS_ALIGNED(offset, geometry_size) &&
+               geometry_size > SUBSECTION_SIZE;
+}

To this:

+static bool __meminit reuse_compound_section(unsigned long start_pfn,
+                                            struct dev_pagemap *pgmap)
+{
+       unsigned long geometry = pgmap_geometry(pgmap);
+       unsigned long offset = start_pfn -
+               PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
+
+       return !IS_ALIGNED(offset, geometry) && geometry > PAGES_PER_SUBSECTION;
+}


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages
  2021-07-28 19:55   ` Dan Williams
@ 2021-07-28 20:07     ` Joao Martins
  2021-07-28 20:23       ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-07-28 20:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 8:55 PM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Use try_grab_compound_head() for device-dax GUP when configured with a
>> compound pagemap.
>>
>> Rather than incrementing the refcount for each page, do one atomic
>> addition for all the pages to be pinned.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>>
>>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
>> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
>> [altmap]
>> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
>> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
>>
>>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
>> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
>> [altmap with -m 127004]
>> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
>> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/gup.c | 53 +++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 33 insertions(+), 20 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 42b8b1fa6521..9baaa1c0b7f3 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2234,31 +2234,55 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>  }
>>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +                          unsigned long end, struct page **pages)
>> +{
>> +       int nr;
>> +
>> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +               pages[nr++] = page++;
>> +
>> +       return nr;
>> +}
>> +
>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>                              unsigned long end, unsigned int flags,
>>                              struct page **pages, int *nr)
>>  {
>> -       int nr_start = *nr;
>> +       int refs, nr_start = *nr;
>>         struct dev_pagemap *pgmap = NULL;
>>
>>         do {
>> -               struct page *page = pfn_to_page(pfn);
>> +               struct page *pinned_head, *head, *page = pfn_to_page(pfn);
>> +               unsigned long next;
>>
>>                 pgmap = get_dev_pagemap(pfn, pgmap);
>>                 if (unlikely(!pgmap)) {
>>                         undo_dev_pagemap(nr, nr_start, flags, pages);
>>                         return 0;
>>                 }
>> -               SetPageReferenced(page);
>> -               pages[*nr] = page;
>> -               if (unlikely(!try_grab_page(page, flags))) {
>> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
>> +
>> +               head = compound_head(page);
>> +               /* @end is assumed to be limited at most one compound page */
>> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;
> 
> Please no ternary operator for this check, but otherwise this patch
> looks good to me.
> 
OK. I take that you prefer this instead:

unsigned long next = addr + PAGE_SIZE;

[...]

/* @end is assumed to be limited at most one compound page */
if (PageCompound(head))
	next = end;

> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
Thanks!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry
  2021-07-28 20:03   ` Dan Williams
@ 2021-07-28 20:08     ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-28 20:08 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 9:03 PM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Currently, for compound PUD mappings, the implementation consumes 40MB
>> per TB but it can be optimized to 16MB per TB with the approach
>> detailed below.
>>
>> Right now basepages are used to populate the PUD tail pages, and it
>> picks the address of the previous page of the subsection that precedes
>> the memmap being initialized.  This is done when a given memmap
>> address isn't aligned to the pgmap @geometry (which is safe to do because
>> @ranges are guaranteed to be aligned to @geometry).
>>
>> For pagemaps with an align which spans various sections, this means
>> that PMD pages are unnecessarily allocated for reusing the same tail
>> pages.  Effectively, on x86 a PUD can span 8 sections (depending on
>> config), and a page is being  allocated a page for the PMD to reuse
>> the tail vmemmap across the rest of the PTEs. In short effecitvely the
>> PMD cover the tail vmemmap areas all contain the same PFN. So instead
>> of doing this way, populate a new PMD on the second section of the
>> compound page (tail vmemmap PMD), and then the following sections
>> utilize the preceding PMD previously populated which only contain
>> tail pages).
>>
>> After this scheme for an 1GB pagemap aligned area, the first PMD
>> (section) would contain head page and 32767 tail pages, where the
>> second PMD contains the full 32768 tail pages.  The latter page gets
>> its PMD reused across future section mapping of the same pagemap.
>>
>> Besides fewer pagetable entries allocated, keeping parity with
>> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
>> this further increases savings per compound page. Rather than
>> requiring 8 PMD page allocations only need 2 (plus two base pages
>> allocated for head and tail areas for the first PMD). 2M pages still
>> require using base pages, though.
> 
> This looks good to me now, modulo the tail_page helper discussed
> previously. Thanks for the diagram, makes it clearer what's happening.
> 
> I don't see any red flags that would prevent a reviewed-by when you
> send the next spin.
> 
Cool, thanks!

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++
>>  include/linux/mm.h                 |   3 +-
>>  mm/sparse-vmemmap.c                |  74 +++++++++++++++++---
>>  3 files changed, 174 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
>> index 42830a667c2a..96d9f5f0a497 100644
>> --- a/Documentation/vm/vmemmap_dedup.rst
>> +++ b/Documentation/vm/vmemmap_dedup.rst
>> @@ -189,3 +189,112 @@ at a later stage when we populate the sections.
>>  It only use 3 page structs for storing all information as opposed
>>  to 4 on HugeTLB pages. This does not affect memory savings between both.
>>
>> +Additionally, it further extends the tail page deduplication with 1GB
>> +device-dax compound pages.
>> +
>> +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split
>> +across 8 PMD page frames, with the first PMD having 2 PTE page frames.
>> +In total this represents a total of 40960 bytes per 1GB page.
>> +
>> +Here is how things look after the previously described tail page deduplication
>> +technique.
>> +
>> +   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
>> + +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
>> + |           |    |    0     |     |     0     | -------------> |      0      |
>> + |           |    +----------+     +-----------+                +-------------+
>> + |           |                     |     1     | -------------> |      1      |
>> + |           |                     +-----------+                +-------------+
>> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
>> + |           |                     +-----------+                   | | | | | |
>> + |           |                     |     3     | ------------------+ | | | | |
>> + |           |                     +-----------+                     | | | | |
>> + |           |                     |     4     | --------------------+ | | | |
>> + |   PMD 0   |                     +-----------+                       | | | |
>> + |           |                     |     5     | ----------------------+ | | |
>> + |           |                     +-----------+                         | | |
>> + |           |                     |     ..    | ------------------------+ | |
>> + |           |                     +-----------+                           | |
>> + |           |                     |     511   | --------------------------+ |
>> + |           |                     +-----------+                             |
>> + |           |                                                               |
>> + |           |                                                               |
>> + |           |                                                               |
>> + +-----------+     page frames                                               |
>> + +-----------+ -> +----------+ --> +-----------+    mapping to               |
>> + |           |    |  1 .. 7  |     |    512    | ----------------------------+
>> + |           |    +----------+     +-----------+                             |
>> + |           |                     |    ..     | ----------------------------+
>> + |           |                     +-----------+                             |
>> + |           |                     |    ..     | ----------------------------+
>> + |           |                     +-----------+                             |
>> + |           |                     |    ..     | ----------------------------+
>> + |           |                     +-----------+                             |
>> + |           |                     |    ..     | ----------------------------+
>> + |    PMD    |                     +-----------+                             |
>> + |  1 .. 7   |                     |    ..     | ----------------------------+
>> + |           |                     +-----------+                             |
>> + |           |                     |    ..     | ----------------------------+
>> + |           |                     +-----------+                             |
>> + |           |                     |    4095   | ----------------------------+
>> + +-----------+                     +-----------+
>> +
>> +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame
>> +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7
>> +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame.
>> +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per
>> +1GB page from 40960 bytes to 16384 bytes.
>> +
>> +Here is how things look after PMD tail page deduplication.
>> +
>> +   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
>> + +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
>> + |           |    |    0     |     |     0     | -------------> |      0      |
>> + |           |    +----------+     +-----------+                +-------------+
>> + |           |                     |     1     | -------------> |      1      |
>> + |           |                     +-----------+                +-------------+
>> + |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
>> + |           |                     +-----------+                   | | | | | |
>> + |           |                     |     3     | ------------------+ | | | | |
>> + |           |                     +-----------+                     | | | | |
>> + |           |                     |     4     | --------------------+ | | | |
>> + |   PMD 0   |                     +-----------+                       | | | |
>> + |           |                     |     5     | ----------------------+ | | |
>> + |           |                     +-----------+                         | | |
>> + |           |                     |     ..    | ------------------------+ | |
>> + |           |                     +-----------+                           | |
>> + |           |                     |     511   | --------------------------+ |
>> + |           |                     +-----------+                             |
>> + |           |                                                               |
>> + |           |                                                               |
>> + |           |                                                               |
>> + +-----------+     page frames                                               |
>> + +-----------+ -> +----------+ --> +-----------+    mapping to               |
>> + |           |    |    1     |     |    512    | ----------------------------+
>> + |           |    +----------+     +-----------+                             |
>> + |           |     ^ ^ ^ ^ ^ ^     |    ..     | ----------------------------+
>> + |           |     | | | | | |     +-----------+                             |
>> + |           |     | | | | | |     |    ..     | ----------------------------+
>> + |           |     | | | | | |     +-----------+                             |
>> + |           |     | | | | | |     |    ..     | ----------------------------+
>> + |           |     | | | | | |     +-----------+                             |
>> + |           |     | | | | | |     |    ..     | ----------------------------+
>> + |   PMD 1   |     | | | | | |     +-----------+                             |
>> + |           |     | | | | | |     |    ..     | ----------------------------+
>> + |           |     | | | | | |     +-----------+                             |
>> + |           |     | | | | | |     |    ..     | ----------------------------+
>> + |           |     | | | | | |     +-----------+                             |
>> + |           |     | | | | | |     |    4095   | ----------------------------+
>> + +-----------+     | | | | | |     +-----------+
>> + |   PMD 2   | ----+ | | | | |
>> + +-----------+       | | | | |
>> + |   PMD 3   | ------+ | | | |
>> + +-----------+         | | | |
>> + |   PMD 4   | --------+ | | |
>> + +-----------+           | | |
>> + |   PMD 5   | ----------+ | |
>> + +-----------+             | |
>> + |   PMD 6   | ------------+ |
>> + +-----------+               |
>> + |   PMD 7   | --------------+
>> + +-----------+
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5e3e153ddd3d..e9dc3e2de7be 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> +                           struct page *block);
>>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>>                             struct vmem_altmap *altmap, struct page *block);
>>  void *vmemmap_alloc_block(unsigned long size, int node);
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index a8de6c472999..68041ca9a797 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>>         return p;
>>  }
>>
>> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> +                                      struct page *block)
>>  {
>>         pmd_t *pmd = pmd_offset(pud, addr);
>>         if (pmd_none(*pmd)) {
>> -               void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> -               if (!p)
>> -                       return NULL;
>> +               void *p;
>> +
>> +               if (!block) {
>> +                       p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> +                       if (!p)
>> +                               return NULL;
>> +               } else {
>> +                       /* See comment in vmemmap_pte_populate(). */
>> +                       get_page(block);
>> +                       p = page_to_virt(block);
>> +               }
>>                 pmd_populate_kernel(&init_mm, pmd, p);
>>         }
>>         return pmd;
>> @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>         return pgd;
>>  }
>>
>> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> -                                             struct vmem_altmap *altmap,
>> -                                             struct page *reuse, struct page **page)
>> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
>> +                                                 struct vmem_altmap *altmap,
>> +                                                 struct page *reuse, pmd_t **ptr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>>         pud_t *pud;
>>         pmd_t *pmd;
>> -       pte_t *pte;
>>
>>         pgd = vmemmap_pgd_populate(addr, node);
>>         if (!pgd)
>> @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>         pud = vmemmap_pud_populate(p4d, addr, node);
>>         if (!pud)
>>                 return -ENOMEM;
>> -       pmd = vmemmap_pmd_populate(pud, addr, node);
>> +       pmd = vmemmap_pmd_populate(pud, addr, node, reuse);
>>         if (!pmd)
>>                 return -ENOMEM;
>> +       if (ptr)
>> +               *ptr = pmd;
>> +       return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> +                                             struct vmem_altmap *altmap,
>> +                                             struct page *reuse, struct page **page)
>> +{
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
>> +               return -ENOMEM;
>> +
>>         pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
>>         if (!pte)
>>                 return -ENOMEM;
>> @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>>         return vmemmap_populate_address(addr, node, NULL, NULL, page);
>>  }
>>
>> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
>> +                                               unsigned long end,
>> +                                               int node, struct page *page)
>> +{
>> +       unsigned long addr = start;
>> +
>> +       for (; addr < end; addr += PMD_SIZE) {
>> +               if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>>                                                      unsigned long start,
>>                                                      unsigned long end, int node,
>> @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>>         if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
>>             pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
>> +               pmd_t *pmdp;
>>                 pte_t *ptep;
>>
>>                 addr = start - PAGE_SIZE;
>> @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>>                  * the previous struct pages are mapped when trying to lookup
>>                  * the last tail page.
>>                  */
>> -               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
>> -               if (!ptep)
>> +               pmdp = pmd_off_k(addr);
>> +               if (!pmdp)
>> +                       return -ENOMEM;
>> +
>> +               /*
>> +                * Reuse the tail pages vmemmap pmd page
>> +                * See layout diagram in Documentation/vm/vmemmap_dedup.rst
>> +                */
>> +               if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION))
>> +                       return vmemmap_populate_pmd_range(start, end, node,
>> +                                                         pmd_page(*pmdp));
>> +
>> +               /* See comment above when pmd_off_k() is called. */
>> +               ptep = pte_offset_kernel(pmdp, addr);
>> +               if (pte_none(*ptep))
>>                         return -ENOMEM;
>>
>>                 /*
>> +                * Populate the tail pages vmemmap pmd page.
>>                  * Reuse the page that was populated in the prior iteration
>>                  * with just tail struct pages.
>>                  */
>> --
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages
  2021-07-28 20:07     ` Joao Martins
@ 2021-07-28 20:23       ` Dan Williams
  0 siblings, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-07-28 20:23 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Wed, Jul 28, 2021 at 1:08 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 7/28/21 8:55 PM, Dan Williams wrote:
> > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Use try_grab_compound_head() for device-dax GUP when configured with a
> >> compound pagemap.
> >>
> >> Rather than incrementing the refcount for each page, do one atomic
> >> addition for all the pages to be pinned.
> >>
> >> Performance measured by gup_benchmark improves considerably
> >> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
> >>
> >>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> >> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> >> [altmap]
> >> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> >> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
> >>
> >>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> >> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> >> [altmap with -m 127004]
> >> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> >> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  mm/gup.c | 53 +++++++++++++++++++++++++++++++++--------------------
> >>  1 file changed, 33 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 42b8b1fa6521..9baaa1c0b7f3 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2234,31 +2234,55 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >>  }
> >>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >>
> >> +
> >> +static int record_subpages(struct page *page, unsigned long addr,
> >> +                          unsigned long end, struct page **pages)
> >> +{
> >> +       int nr;
> >> +
> >> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
> >> +               pages[nr++] = page++;
> >> +
> >> +       return nr;
> >> +}
> >> +
> >>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>                              unsigned long end, unsigned int flags,
> >>                              struct page **pages, int *nr)
> >>  {
> >> -       int nr_start = *nr;
> >> +       int refs, nr_start = *nr;
> >>         struct dev_pagemap *pgmap = NULL;
> >>
> >>         do {
> >> -               struct page *page = pfn_to_page(pfn);
> >> +               struct page *pinned_head, *head, *page = pfn_to_page(pfn);
> >> +               unsigned long next;
> >>
> >>                 pgmap = get_dev_pagemap(pfn, pgmap);
> >>                 if (unlikely(!pgmap)) {
> >>                         undo_dev_pagemap(nr, nr_start, flags, pages);
> >>                         return 0;
> >>                 }
> >> -               SetPageReferenced(page);
> >> -               pages[*nr] = page;
> >> -               if (unlikely(!try_grab_page(page, flags))) {
> >> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
> >> +
> >> +               head = compound_head(page);
> >> +               /* @end is assumed to be limited at most one compound page */
> >> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;
> >
> > Please no ternary operator for this check, but otherwise this patch
> > looks good to me.
> >
> OK. I take that you prefer this instead:
>
> unsigned long next = addr + PAGE_SIZE;
>
> [...]
>
> /* @end is assumed to be limited at most one compound page */
> if (PageCompound(head))
>         next = end;

Yup.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-07-15 19:48       ` Dan Williams
@ 2021-07-30 16:13         ` Joao Martins
  0 siblings, 0 replies; 74+ messages in thread
From: Joao Martins @ 2021-07-30 16:13 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On 7/15/21 8:48 PM, Dan Williams wrote:
> On Thu, Jul 15, 2021 at 5:52 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 7/15/21 2:08 AM, Dan Williams wrote:
>>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> +                                       unsigned long zone_idx, int nid,
>>>> +                                       struct dev_pagemap *pgmap,
>>>> +                                       unsigned long nr_pages)
>>>> +{
>>>> +       unsigned int order_align = order_base_2(nr_pages);
>>>> +       unsigned long i;
>>>> +
>>>> +       __SetPageHead(page);
>>>> +
>>>> +       for (i = 1; i < nr_pages; i++) {
>>>
>>> The switch of loop styles is jarring. I.e. the switch from
>>> memmap_init_zone_device() that is using pfn, end_pfn, and a local
>>> 'struct page *' variable to this helper using pfn + i and a mix of
>>> helpers (__init_zone_device_page,  prep_compound_tail) that have
>>> different expectations of head page + tail_idx and current page.
>>>
>>> I.e. this reads more obviously correct to me, but maybe I'm just in
>>> the wrong headspace:
>>>
>>>         for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
>>>                 struct page *page = pfn_to_page(pfn);
>>>
>>>                 __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>>>                 prep_compound_tail(head, pfn - head_pfn);
>>>
>> Personally -- and I am dubious given I have been staring at this code -- I find that what
>> I wrote a little better as it follows more what compound page initialization does. Like
>> it's easier for me to read that I am initializing a number of tail pages and a head page
>> (for a known geometry size).
>>
>> Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn)
>> provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all
>> contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and
>> forth. Which is the second reason I pass a page, and iterate over its tails based on a
>> head page pointer. But I was at too minds when writing this, so if the there's no added
>> inefficiency I can rewrite like the above.
> 
> I mainly just don't want 2 different styles between
> memmap_init_zone_device() and this helper. So if the argument is that
> "it's inefficient to use pfn_to_page() here" then why does the caller
> use pfn_to_page()? I won't argue too much for one way or the other,
> I'm still biased towards my rewrite, but whatever you pick just make
> the style consistent.
> 

Meanwhile, turns out my concerns didn't materialize. I am not seeing a
visible difference compared to old numbers. So I switched to the style
you suggested above.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-07-27 23:23       ` Dan Williams
@ 2021-08-02 10:40         ` Joao Martins
  2021-08-02 14:06           ` Dan Williams
  0 siblings, 1 reply; 74+ messages in thread
From: Joao Martins @ 2021-08-02 10:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Matthew Wilcox, Andrew Morton, Linux MM, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List



On 7/28/21 12:23 AM, Dan Williams wrote:
> On Thu, Jul 22, 2021 at 3:54 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>> The folio work really touches the page
>>> cache for now, and this seems mostly to touch the devmap paths.
>>>
>> /me nods -- it really is about devmap infra for usage in device-dax for persistent memory.
>>
>> Perhaps I should do s/pagemaps/devmap/ throughout the series to avoid confusion.
> 
> I also like "devmap" as a more accurate name. It matches the PFN_DEV
> and PFN_MAP flags that decorate DAX capable pfn_t instances. It also
> happens to match a recommendation I gave to Ira for his support for
> supervisor protection keys with devmap pfns.
> 
/me nods

Additionally, I think I'll be reordering the patches for more clear/easier
bisection i.e. first introducing compound pages for devmap, fixing associated
issues wrt to the slow pinning and then introduce vmemmap deduplication for
devmap.

It should look like below after the reordering from first patch to last.
Let me know if you disagree.

memory-failure: fetch compound_head after pgmap_pfn_valid()
mm/page_alloc: split prep_compound_page into head and tail subparts
mm/page_alloc: refactor memmap_init_zone_device() page init
mm/memremap: add ZONE_DEVICE support for compound pages
device-dax: use ALIGN() for determining pgoff
device-dax: compound devmap support
mm/gup: grab head page refcount once for group of subpages
mm/sparse-vmemmap: add a pgmap argument to section activation
mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
mm/hugetlb_vmemmap: move comment block to Documentation/vm
mm/sparse-vmemmap: populate compound devmaps
mm/page_alloc: reuse tail struct pages for compound devmaps
mm/sparse-vmemmap: improve memory savings for compound pud geometry

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-08-02 10:40         ` Joao Martins
@ 2021-08-02 14:06           ` Dan Williams
  0 siblings, 0 replies; 74+ messages in thread
From: Dan Williams @ 2021-08-02 14:06 UTC (permalink / raw)
  To: Joao Martins
  Cc: Matthew Wilcox, Andrew Morton, Linux MM, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Jonathan Corbet,
	Linux NVDIMM, Linux Doc Mailing List

On Mon, Aug 2, 2021 at 3:41 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 7/28/21 12:23 AM, Dan Williams wrote:
> > On Thu, Jul 22, 2021 at 3:54 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> > [..]
> >>> The folio work really touches the page
> >>> cache for now, and this seems mostly to touch the devmap paths.
> >>>
> >> /me nods -- it really is about devmap infra for usage in device-dax for persistent memory.
> >>
> >> Perhaps I should do s/pagemaps/devmap/ throughout the series to avoid confusion.
> >
> > I also like "devmap" as a more accurate name. It matches the PFN_DEV
> > and PFN_MAP flags that decorate DAX capable pfn_t instances. It also
> > happens to match a recommendation I gave to Ira for his support for
> > supervisor protection keys with devmap pfns.
> >
> /me nods
>
> Additionally, I think I'll be reordering the patches for more clear/easier
> bisection i.e. first introducing compound pages for devmap, fixing associated
> issues wrt to the slow pinning and then introduce vmemmap deduplication for
> devmap.
>
> It should look like below after the reordering from first patch to last.
> Let me know if you disagree.
>
> memory-failure: fetch compound_head after pgmap_pfn_valid()
> mm/page_alloc: split prep_compound_page into head and tail subparts
> mm/page_alloc: refactor memmap_init_zone_device() page init
> mm/memremap: add ZONE_DEVICE support for compound pages
> device-dax: use ALIGN() for determining pgoff
> device-dax: compound devmap support
> mm/gup: grab head page refcount once for group of subpages
> mm/sparse-vmemmap: add a pgmap argument to section activation
> mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
> mm/hugetlb_vmemmap: move comment block to Documentation/vm
> mm/sparse-vmemmap: populate compound devmaps
> mm/page_alloc: reuse tail struct pages for compound devmaps
> mm/sparse-vmemmap: improve memory savings for compound pud geometry

LGTM.

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, back to index

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 19:35 [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-07-14 19:35 ` [PATCH v3 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-07-15  0:17   ` Dan Williams
2021-07-15  2:51   ` [External] " Muchun Song
2021-07-15  6:40     ` Christoph Hellwig
2021-07-15  9:19       ` Muchun Song
2021-07-15 13:17     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-07-15  0:19   ` Dan Williams
2021-07-15  2:53   ` [External] " Muchun Song
2021-07-15 13:17     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-07-15  0:20   ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-07-15  1:08   ` Dan Williams
2021-07-15 12:52     ` Joao Martins
2021-07-15 13:06       ` Joao Martins
2021-07-15 19:48       ` Dan Williams
2021-07-30 16:13         ` Joao Martins
2021-07-22  0:38       ` Jane Chu
2021-07-22 10:56         ` Joao Martins
2021-07-15 12:59     ` Christoph Hellwig
2021-07-15 13:15       ` Joao Martins
2021-07-15  6:48   ` Christoph Hellwig
2021-07-15 13:15     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 05/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-07-28  5:56   ` Dan Williams
2021-07-28  9:43     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 06/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2021-07-28  6:04   ` Dan Williams
2021-07-28 10:48     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 07/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2021-07-15  2:47   ` [External] " Muchun Song
2021-07-15 13:16     ` Joao Martins
2021-07-28  6:09   ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 08/14] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-07-28  6:55   ` Dan Williams
2021-07-28 15:35     ` Joao Martins
2021-07-28 18:03       ` Dan Williams
2021-07-28 18:54         ` Joao Martins
2021-07-28 20:04           ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 09/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-07-28  7:28   ` Dan Williams
2021-07-28 15:56     ` Joao Martins
2021-07-28 16:08       ` Dan Williams
2021-07-28 16:12         ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 10/14] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-07-28  7:29   ` Dan Williams
2021-07-28 15:56     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 11/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-07-28  7:30   ` Dan Williams
2021-07-28 15:56     ` Joao Martins
2021-07-14 19:35 ` [PATCH v3 12/14] device-dax: compound pagemap support Joao Martins
2021-07-14 23:36   ` Dan Williams
2021-07-15 12:00     ` Joao Martins
2021-07-27 23:51       ` Dan Williams
2021-07-28  9:36         ` Joao Martins
2021-07-28 18:51           ` Dan Williams
2021-07-28 18:59             ` Joao Martins
2021-07-28 19:03               ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 13/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-07-28 19:55   ` Dan Williams
2021-07-28 20:07     ` Joao Martins
2021-07-28 20:23       ` Dan Williams
2021-07-14 19:35 ` [PATCH v3 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
2021-07-28 20:03   ` Dan Williams
2021-07-28 20:08     ` Joao Martins
2021-07-14 21:48 ` [PATCH v3 00/14] mm, sparse-vmemmap: Introduce compound pagemaps Andrew Morton
2021-07-14 23:47   ` Dan Williams
2021-07-22  2:24   ` Matthew Wilcox
2021-07-22 10:53     ` Joao Martins
2021-07-27 23:23       ` Dan Williams
2021-08-02 10:40         ` Joao Martins
2021-08-02 14:06           ` Dan Williams

NVDIMM Device and Persistent Memory development

Archives are clonable:
	git clone --mirror https://lore.kernel.org/nvdimm/0 nvdimm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 nvdimm nvdimm/ https://lore.kernel.org/nvdimm \
		nvdimm@lists.linux.dev
	public-inbox-index nvdimm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/dev.linux.lists.nvdimm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git