linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
@ 2021-03-25 23:09 Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
                   ` (11 more replies)
  0 siblings, 12 replies; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

Hey,

This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 

[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).

So to summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.

Patches 5-10: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap.

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally.

Patch 11: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:

                                                   before     after
    (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
    (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
    (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
    (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms

    altmap performance gets specially interesting when pinning a pmem dimm:

                                                   before     after
    (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
    (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
    (128G get_user_pages_fast altmap 2M page size)  ~3.91 ms -> ~70 ms
    (128G pin_user_pages_fast altmap 2M page size)  ~3.97 ms -> ~74 ms

The unpinning improvement patches are in mmotm/linux-next so removed from this
series.

I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.

Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).

Comments and suggestions very much appreciated!

Changelog,

 RFC -> v1:
 (New patches 1-3, 5-8 but the diffstat is that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

Thanks,
	Joao

Joao Martins (11):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  mm/sparse-vmemmap: populate compound pagemaps
  mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  mm/page_alloc: reuse tail struct pages for compound pagemaps
  device-dax: compound pagemap support
  mm/gup: grab head page refcount once for group of subpages

 drivers/dax/device.c           |  58 +++++++--
 include/linux/memory_hotplug.h |   5 +-
 include/linux/memremap.h       |  13 ++
 include/linux/mm.h             |   8 +-
 mm/gup.c                       |  52 +++++---
 mm/memory-failure.c            |   2 +
 mm/memory_hotplug.c            |   3 +-
 mm/memremap.c                  |   9 +-
 mm/page_alloc.c                | 126 +++++++++++++------
 mm/sparse-vmemmap.c            | 221 +++++++++++++++++++++++++++++----
 mm/sparse.c                    |  24 ++--
 11 files changed, 406 insertions(+), 115 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-04-24  0:12   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()).  For pagemap with compound pages fetch the
compound_head in case we are handling a tail page memory failure.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory-failure.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..94240d772623 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto out;
 	}
 
+	page = compound_head(page);
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-04-24  0:16   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c53fe4fa10bf..43dd98446b0b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page), FPI_NONE);
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	set_page_count(p, 0);
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		set_page_count(p, 0);
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-04-24  0:18   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

Move struct page init to an helper function __init_zone_device_page().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 43dd98446b0b..58974067bbd4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap)
+{
+
+	__init_single_page(page, pfn, zone_idx, nid);
+
+	/*
+	 * Mark page reserved as it will need to wait for onlining
+	 * phase for it to be fully associated with a zone.
+	 *
+	 * We can use the non-atomic __set_bit operation for setting
+	 * the flag as we are still initializing the pages.
+	 */
+	__SetPageReserved(page);
+
+	/*
+	 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
+	 * ever freed or placed on a driver-private list.
+	 */
+	page->pgmap = pgmap;
+	page->zone_device_data = NULL;
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		cond_resched();
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-		 * because this is done early in section_activate()
-		 */
-		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (2 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-05 18:44   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

Add a new align property for struct dev_pagemap which specifies that a
pagemap is composed of a set of compound pages of size @align, instead
of base pages. When these pages are initialised, most are initialised as
tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memremap.h | 13 +++++++++++++
 mm/memremap.c            |  8 ++++++--
 mm/page_alloc.c          | 24 +++++++++++++++++++++++-
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index b46f63dcaed3..bb28d82dda5e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -114,6 +114,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long align;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
+{
+	if (!pgmap || !pgmap->align)
+		return PAGE_SIZE;
+	return pgmap->align;
+}
+
+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+	return PHYS_PFN(pgmap_align(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..d160853670c4 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	if (pgmap_align(pgmap) > PAGE_SIZE)
+		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
+	else
+		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+				- pfn_first(pgmap, range_id));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58974067bbd4..3a77f9e43f3a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfn_align = pgmap_pfn_align(pgmap);
+	unsigned int order_align = order_base_2(pfn_align);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
 		struct page *page = pfn_to_page(pfn);
+		unsigned long i;
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfn_align == 1)
+			continue;
+
+		__SetPageHead(page);
+
+		for (i = 1; i < pfn_align; i++) {
+			__init_zone_device_page(page + i, pfn + i, zone_idx,
+						nid, pgmap);
+			prep_compound_tail(page, i);
+
+			/*
+			 * The first and second tail pages need to
+			 * initialized first, hence the head page is
+			 * prepared last.
+			 */
+			if (i == 2)
+				prep_compound_head(page, order_align);
+		}
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (3 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-05 22:34   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

@altmap is stored in a dev_pagemap, but it will be repurposed for
hotplug memory for storing the memmap in the hotplugged memory[*] and
reusing the altmap infrastructure to that end. This is to say that
altmap can't be replaced with a @pgmap as it is going to cover more than
dev_pagemap backend altmaps.

So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:

	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap

Passing @pgmap allows __populate_section_memmap() to both fetch the align
in which memmap metadata is created for and also to let sparse-vmemmap
fetch pgmap ranges to co-relate to a given section and pick whether to just
reuse tail pages from past onlined sections.

[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memory_hotplug.h |  5 ++++-
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  3 ++-
 mm/sparse.c                    | 24 +++++++++++++++---------
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a85d4b7d15c2..45532192934c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,7 @@ struct mem_section;
 struct memory_block;
 struct resource;
 struct vmem_altmap;
+struct dev_pagemap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
@@ -72,6 +73,7 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	struct dev_pagemap *pgmap;
 };
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -360,7 +362,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb1e191da319..61474602c2b1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3033,7 +3033,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e0b499a1fee4..2df3b2a7b4b5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -384,7 +384,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+					 params->pgmap);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16183d85a7d5..370728c206ee 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -249,7 +249,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index be66a62e22c3..c2abf1281a89 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -443,7 +443,8 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -542,7 +543,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			break;
 
 		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-				nid, NULL);
+				nid, NULL, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -648,9 +649,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -719,7 +721,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 }
 #else
 struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	return kvmalloc_node(array_size(sizeof(struct page),
 					PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -842,7 +845,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -874,7 +878,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
@@ -889,6 +893,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
  * @altmap: device page map
+ * @pgmap: device page map object that owns the section
  *
  * This is only intended for hotplug.
  *
@@ -902,7 +907,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -913,7 +919,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (4 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-05 22:43   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

Move the actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages()
walk through all base pages it needs to populate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 370728c206ee..8814876edcfa 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap)
 {
-	unsigned long addr = start;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	pgd = vmemmap_pgd_populate(addr, node);
+	if (!pgd)
+		return -ENOMEM;
+	p4d = vmemmap_p4d_populate(pgd, addr, node);
+	if (!p4d)
+		return -ENOMEM;
+	pud = vmemmap_pud_populate(p4d, addr, node);
+	if (!pud)
+		return -ENOMEM;
+	pmd = vmemmap_pmd_populate(pud, addr, node);
+	if (!pmd)
+		return -ENOMEM;
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	if (!pte)
+		return -ENOMEM;
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	unsigned long addr = start;
+
 	for (; addr < end; addr += PAGE_SIZE) {
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
-			return -ENOMEM;
-		p4d = vmemmap_p4d_populate(pgd, addr, node);
-		if (!p4d)
-			return -ENOMEM;
-		pud = vmemmap_pud_populate(p4d, addr, node);
-		if (!pud)
-			return -ENOMEM;
-		pmd = vmemmap_pmd_populate(pud, addr, node);
-		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
-		if (!pte)
+		if (vmemmap_populate_address(addr, node, altmap))
 			return -ENOMEM;
-		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 	}
 
 	return 0;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (5 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-06  1:18   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

To minimize struct page overhead we take advantage of the fact that
most tail pages look the same (except the first two). We allocate a
separate page for the vmemmap area which contains the head page and
separate for the next 64 pages. The rest of the subsections then reuse
this tail vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound pagemap with big enough @align (e.g. 1G
PUD) it  will cross various sections. To be able to reuse tail pages
across sections belonging to the same gigantic page we fetch the
@range being mapped (nr_ranges + 1).  If the section being mapped is
not offset 0 of the @align, then lookup the PFN of the struct page
address that preceeds it and use that to populate the entire
section.

On compound pagemaps with 2M align, this lets mechanism saves 6 pages
out of the 8 necessary PFNs necessary to set the subsection's 512
struct pages being mapped. On a 1G compound pagemap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |   2 +-
 mm/memremap.c       |   1 +
 mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 131 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 61474602c2b1..49d717ae40ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index d160853670c4..2e6bc0b1ff00 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8814876edcfa..f57c5eada099 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap, void *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p;
-
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else if (!altmap) {
+			get_page(virt_to_page(block));
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	if (ptr)
+		*ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
+	return 0;
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
@@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	unsigned long addr = start;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		if (vmemmap_populate_address(addr, node, altmap))
+		if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		if (vmemmap_populate_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  void **ptr)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
+}
+
+static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	pte = pte_offset_kernel(pmd, addr);
+	if (pte_none(*pte))
+		return NULL;
+
+	return pte;
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long offset, size, addr;
+
+	/*
+	 * For compound pages bigger than section size (e.g. 1G) fill the rest
+	 * of sections as tail pages.
+	 *
+	 * Note that memremap_pages() resets @nr_range value and will increment
+	 * it after each range successful onlining. Thus the value or @nr_range
+	 * at section memmap populate corresponds to the in-progress range
+	 * being onlined that we care about.
+	 */
+	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
+	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
+	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
+		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+
+		if (!ptep)
+			return -ENOMEM;
+
+		return vmemmap_populate_range(start, end, node,
+					      page_to_virt(pte_page(*ptep)));
+	}
+
+	size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		void *block;
+
+		/* Populate the head page vmemmap page */
+		if (vmemmap_populate_page(addr, node, NULL))
+			return -ENOMEM;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		if (vmemmap_populate_page(next, node, &block))
+			return -ENOMEM;
+
+		/* Reuse the previous page for the rest of tail pages */
+		next += PAGE_SIZE;
+		if (vmemmap_populate_range(next, last, node, block))
 			return -ENOMEM;
 	}
 
@@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	unsigned int align = pgmap_align(pgmap);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (align > PAGE_SIZE && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (6 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-06-01 19:30   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that preceeds
the memmap we are initializing.  This is done when a given memmap
address isn't aligned to the pgmap @align (which is safe to do because
@ranges are guaranteed to be aligned to @align).

For pagemaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages.  Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being  allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD we previously populated which only contain
tail pages).

After this scheme for an 1GB pagemap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages.  The latter page gets
its PMD reused across future section mapping of the same pagemap.

Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. For each PUD-aligned
pagemap we go from 40960 bytes down to 16384 bytes. Rather than
requiring 8 PMD page allocations we only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |  3 +-
 mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 63 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 49d717ae40ae..9c1a676d6b95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+			    void *block);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index f57c5eada099..291a8a32480a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+				       void *block)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+			if (!p)
+				return NULL;
+		} else {
+			get_page(virt_to_page(block));
+		}
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      void *page, void **ptr)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+						  struct vmem_altmap *altmap,
+						  void *page, pmd_t **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = vmemmap_pgd_populate(addr, node);
 	if (!pgd)
@@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pud = vmemmap_pud_populate(p4d, addr, node);
 	if (!pud)
 		return -ENOMEM;
-	pmd = vmemmap_pmd_populate(pud, addr, node);
+	pmd = vmemmap_pmd_populate(pud, addr, node, page);
 	if (!pmd)
 		return -ENOMEM;
+	if (ptr)
+		*ptr = pmd;
+	return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
+{
+	pmd_t *pmd;
+	pte_t *pte;
+
+	if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
+		return -ENOMEM;
+
 	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
@@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
 	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
 }
 
-static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+						unsigned long end,
+						int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PMD_SIZE) {
+		if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd))
@@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
 	if (pmd_none(*pmd))
 		return NULL;
 
-	pte = pte_offset_kernel(pmd, addr);
-	if (pte_none(*pte))
-		return NULL;
-
-	return pte;
+	return pmd;
 }
 
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
@@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
 	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
 	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
-		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		addr = start - PAGE_SIZE;
+		pmdp = vmemmap_lookup_address(addr);
+		if (!pmdp)
+			return -ENOMEM;
+
+		/* Reuse the tail pages vmemmap pmd page */
+		if (offset % pgmap->align > PFN_PHYS(PAGES_PER_SECTION))
+			return vmemmap_populate_pmd_range(start, end, node,
+						page_to_virt(pmd_page(*pmdp)));
 
-		if (!ptep)
+		/* Populate the tail pages vmemmap pmd page */
+		ptep = pte_offset_kernel(pmdp, addr);
+		if (pte_none(*ptep))
 			return -ENOMEM;
 
 		return vmemmap_populate_range(start, end, node,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (7 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-06-01 23:35   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

When a pgmap @align is set, all pages are mapped at a given huge page
alignment and thus uses compound pages to describe them as opposed to a
struct per 4K.

With @align > PAGE_SIZE and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of unique
struct pages is a lot smaller that the total amount of struct pages
being mapped.

When struct pages are initialize in memmap_init_zone_device, make
sure that only unique struct pages are initialized i.e. the first 2
4K pages per @align which means 128 struct pages, instead of 32768 for
2M @align or 262144 for a 1G @align.

Keep old behaviour with altmap given that it doesn't support reusal
of tail vmemmap areas.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3a77f9e43f3a..948dfad6754b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6277,6 +6277,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+#define MEMMAP_NR_PAGES	(2 * (PAGE_SIZE/sizeof(struct page)))
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6287,6 +6289,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
 	unsigned int pfn_align = pgmap_pfn_align(pgmap);
 	unsigned int order_align = order_base_2(pfn_align);
+	unsigned long ntails = min_t(unsigned long, pfn_align, MEMMAP_NR_PAGES);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6302,6 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	if (altmap) {
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 		nr_pages = end_pfn - start_pfn;
+		ntails = pfn_align;
 	}
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
@@ -6315,7 +6319,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__SetPageHead(page);
 
-		for (i = 1; i < pfn_align; i++) {
+		for (i = 1; i < ntails; i++) {
 			__init_zone_device_page(page + i, pfn + i, zone_idx,
 						nid, pgmap);
 			prep_compound_tail(page, i);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 10/11] device-dax: compound pagemap support
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (8 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-06-02  0:36   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
  2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

dax devices are created with a fixed @align (huge page size) which
is enforced through as well at mmap() of the device. Faults,
consequently happen too at the specified @align specified at the
creation, and those don't change through out dax device lifetime.
MCEs poisons a whole dax huge page, as well as splits occurring at
at the configured page size.

Use the newly added compound pagemap facility which maps the
assigned dax ranges as compound pages at a page size of @align.
Currently, this means, that region/namespace bootstrap would take
considerably less, given that you would initialize considerably less
pages.

On setups with 128G NVDIMMs the initialization with DRAM stored struct pages
improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than
a 1msec with 1G pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index db92573c94e8..e3dcc4ad1727 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned long fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned long fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head = compound_head(head);
+	if (head->mapping)
+		return;
+
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap->align > PAGE_SIZE)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (dev_dax->align > PAGE_SIZE)
+		pgmap->align = dev_dax->align;
+
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (9 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-06-02  1:05   ` Dan Williams
  2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  11 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Joao Martins

Much like hugetlbfs or THPs, treat device pagemaps with
compound pages like the rest of GUP handling of compound pages.

Rather than incrementing the refcount every 4K, we record
all sub pages and increment by @refs amount *once*.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:

 $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
(pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
[altmap]
(get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
(pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

 $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
(pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
[altmap with -m 127004]
(get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
(pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b3e647c8b7ee..514f12157a0f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			     unsigned long end, unsigned int flags,
 			     struct page **pages, int *nr)
 {
-	int nr_start = *nr;
+	int refs, nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 
 	do {
-		struct page *page = pfn_to_page(pfn);
+		struct page *head, *page = pfn_to_page(pfn);
+		unsigned long next;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
-		SetPageReferenced(page);
-		pages[*nr] = page;
-		if (unlikely(!try_grab_page(page, flags))) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
+
+		head = compound_head(page);
+		next = PageCompound(head) ? end : addr + PAGE_SIZE;
+		refs = record_subpages(page, addr, next, pages + *nr);
+
+		SetPageReferenced(head);
+		head = try_grab_compound_head(head, refs, flags);
+		if (!head) {
+			if (PageCompound(head)) {
+				ClearPageReferenced(head);
+				put_dev_pagemap(pgmap);
+			} else {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+			}
 			return 0;
 		}
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
+		*nr += refs;
+		pfn += refs;
+	} while (addr += (refs << PAGE_SHIFT), addr != end);
 
 	if (pgmap)
 		put_dev_pagemap(pgmap);
@@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (10 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-04-01  9:38 ` Joao Martins
  11 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-04-01  9:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 3/25/21 11:09 PM, Joao Martins wrote:

[...]

> Patch 11: Optimize grabbing page refcount changes given that we
> are working with compound pages i.e. we do 1 increment to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably with DRAM stored struct pages. It also *greatly*
> improves pinning with altmap. Results with gup_test:
> 
>                                                    before     after
>     (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
>     (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
>     (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
>     (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
> 
>     altmap performance gets specially interesting when pinning a pmem dimm:
> 
>                                                    before     after
>     (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
>     (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
>     (128G get_user_pages_fast altmap 2M page size)  ~3.91 ms -> ~70 ms
>     (128G pin_user_pages_fast altmap 2M page size)  ~3.97 ms -> ~74 ms
> 

Quick correction: These last two 3.91 and 3.97 on the left column are in *seconds* not
milliseconds. By mistake I added an extra 'm'. Sorry about that.

> The unpinning improvement patches are in mmotm/linux-next so removed from this
> series.
> 
> I have deferred the __get_user_pages() patch to outside this series
> (https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
> as I found an simpler way to address it and that is also applicable to
> THP. But will submit that as a follow up of this.
> 
> Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
> 
> Comments and suggestions very much appreciated!
> 
> Changelog,
> 
>  RFC -> v1:
>  (New patches 1-3, 5-8 but the diffstat is that different)
>  * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
>  * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)

I just noticed that I haven't fully removed the 'we' occurrences. Patches 7 and 8 had
their commit messages rewritten and I mistakenly re-introduced remnants of 'we'. I will
have it fixed for (v2) albeit I'll still wait for comments on this series before following up.

>  * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
>  * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
>  * Remove the gup_device_compound_huge special path and have the same code
>    work both ways while special casing when devmap page is compound (Jason, John)
>  * Avoid usage of vmemmap_populate_basepages() and introduce a first class
>    loop that doesn't care about passing an altmap for memmap reuse. (Dan)
>  * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
>    hack into passing block across sparse_add_section calls. It's a lot easier to
>    follow and more explicit in what it does.
>  * Replace the vmemmap refactoring with adding a @pgmap argument and moving
>    parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
>  * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
>  * Improve memmap_init_zone_device() to initialize compound pages when
>    struct pages are cache warm. That lead to a even further speed up further
>    from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
>    as a result (Dan)
>  * Remove PGMAP_COMPOUND and use @align as the property to detect whether
>    or not to reuse vmemmap areas (Dan)
> 
> Thanks,
> 	Joao
> 
> Joao Martins (11):
>   memory-failure: fetch compound_head after pgmap_pfn_valid()
>   mm/page_alloc: split prep_compound_page into head and tail subparts
>   mm/page_alloc: refactor memmap_init_zone_device() page init
>   mm/memremap: add ZONE_DEVICE support for compound pages
>   mm/sparse-vmemmap: add a pgmap argument to section activation
>   mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
>   mm/sparse-vmemmap: populate compound pagemaps
>   mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
>   mm/page_alloc: reuse tail struct pages for compound pagemaps
>   device-dax: compound pagemap support
>   mm/gup: grab head page refcount once for group of subpages
> 
>  drivers/dax/device.c           |  58 +++++++--
>  include/linux/memory_hotplug.h |   5 +-
>  include/linux/memremap.h       |  13 ++
>  include/linux/mm.h             |   8 +-
>  mm/gup.c                       |  52 +++++---
>  mm/memory-failure.c            |   2 +
>  mm/memory_hotplug.c            |   3 +-
>  mm/memremap.c                  |   9 +-
>  mm/page_alloc.c                | 126 +++++++++++++------
>  mm/sparse-vmemmap.c            | 221 +++++++++++++++++++++++++++++----
>  mm/sparse.c                    |  24 ++--
>  11 files changed, 406 insertions(+), 115 deletions(-)
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-04-24  0:12   ` Dan Williams
  2021-04-24 19:00     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-04-24  0:12 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
> dax_lock_page()).  For pagemap with compound pages fetch the
> compound_head in case we are handling a tail page memory failure.
>
> Currently this is a nop, but in the advent of compound pages in
> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/memory-failure.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..94240d772623 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>                 goto out;
>         }
>
> +       page = compound_head(page);

Unless / until we do compound pages for the filesystem-dax case, I
would add a comment like:

/* pages instantiated by device-dax (not filesystem-dax) may be
compound pages */


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-04-24  0:16   ` Dan Williams
  2021-04-24 19:05     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-04-24  0:16 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Split the utility function prep_compound_page() into head and tail
> counterparts, and use them accordingly.

To make this patch stand alone better lets add another sentence:

"This is in preparation for sharing the storage for / deduplicating
compound page metadata."

Other than that, looks good to me.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c53fe4fa10bf..43dd98446b0b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
>         __free_pages_ok(page, compound_order(page), FPI_NONE);
>  }
>
> +static void prep_compound_head(struct page *page, unsigned int order)
> +{
> +       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> +       set_compound_order(page, order);
> +       atomic_set(compound_mapcount_ptr(page), -1);
> +       if (hpage_pincount_available(page))
> +               atomic_set(compound_pincount_ptr(page), 0);
> +}
> +
> +static void prep_compound_tail(struct page *head, int tail_idx)
> +{
> +       struct page *p = head + tail_idx;
> +
> +       set_page_count(p, 0);
> +       p->mapping = TAIL_MAPPING;
> +       set_compound_head(p, head);
> +}
> +
>  void prep_compound_page(struct page *page, unsigned int order)
>  {
>         int i;
>         int nr_pages = 1 << order;
>
>         __SetPageHead(page);
> -       for (i = 1; i < nr_pages; i++) {
> -               struct page *p = page + i;
> -               set_page_count(p, 0);
> -               p->mapping = TAIL_MAPPING;
> -               set_compound_head(p, page);
> -       }
> +       for (i = 1; i < nr_pages; i++)
> +               prep_compound_tail(page, i);
>
> -       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> -       set_compound_order(page, order);
> -       atomic_set(compound_mapcount_ptr(page), -1);
> -       if (hpage_pincount_available(page))
> -               atomic_set(compound_pincount_ptr(page), 0);
> +       prep_compound_head(page, order);
>  }
>
>  #ifdef CONFIG_DEBUG_PAGEALLOC
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-04-24  0:18   ` Dan Williams
  2021-04-24 19:05     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-04-24  0:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Move struct page init to an helper function __init_zone_device_page().

Same sentence addition suggestion from the last patch to make this
patch have some rationale for existing.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
>  1 file changed, 41 insertions(+), 33 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 43dd98446b0b..58974067bbd4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
>  }
>
>  #ifdef CONFIG_ZONE_DEVICE
> +static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> +                                         unsigned long zone_idx, int nid,
> +                                         struct dev_pagemap *pgmap)
> +{
> +
> +       __init_single_page(page, pfn, zone_idx, nid);
> +
> +       /*
> +        * Mark page reserved as it will need to wait for onlining
> +        * phase for it to be fully associated with a zone.
> +        *
> +        * We can use the non-atomic __set_bit operation for setting
> +        * the flag as we are still initializing the pages.
> +        */
> +       __SetPageReserved(page);
> +
> +       /*
> +        * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
> +        * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
> +        * ever freed or placed on a driver-private list.
> +        */
> +       page->pgmap = pgmap;
> +       page->zone_device_data = NULL;
> +
> +       /*
> +        * Mark the block movable so that blocks are reserved for
> +        * movable at startup. This will force kernel allocations
> +        * to reserve their blocks rather than leaking throughout
> +        * the address space during boot when many long-lived
> +        * kernel allocations are made.
> +        *
> +        * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
> +        * because this is done early in section_activate()
> +        */
> +       if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> +               set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +               cond_resched();
> +       }
> +}
> +
>  void __ref memmap_init_zone_device(struct zone *zone,
>                                    unsigned long start_pfn,
>                                    unsigned long nr_pages,
> @@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>                 struct page *page = pfn_to_page(pfn);
>
> -               __init_single_page(page, pfn, zone_idx, nid);
> -
> -               /*
> -                * Mark page reserved as it will need to wait for onlining
> -                * phase for it to be fully associated with a zone.
> -                *
> -                * We can use the non-atomic __set_bit operation for setting
> -                * the flag as we are still initializing the pages.
> -                */
> -               __SetPageReserved(page);
> -
> -               /*
> -                * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
> -                * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
> -                * ever freed or placed on a driver-private list.
> -                */
> -               page->pgmap = pgmap;
> -               page->zone_device_data = NULL;
> -
> -               /*
> -                * Mark the block movable so that blocks are reserved for
> -                * movable at startup. This will force kernel allocations
> -                * to reserve their blocks rather than leaking throughout
> -                * the address space during boot when many long-lived
> -                * kernel allocations are made.
> -                *
> -                * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
> -                * because this is done early in section_activate()
> -                */
> -               if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> -                       set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> -                       cond_resched();
> -               }
> +               __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>         }
>
>         pr_info("%s initialised %lu pages in %ums\n", __func__,
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-04-24  0:12   ` Dan Williams
@ 2021-04-24 19:00     ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-04-24 19:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 4/24/21 1:12 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
>> dax_lock_page()).  For pagemap with compound pages fetch the
>> compound_head in case we are handling a tail page memory failure.
>>
>> Currently this is a nop, but in the advent of compound pages in
>> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>>
>> Reported-by: Jane Chu <jane.chu@oracle.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/memory-failure.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 24210c9bd843..94240d772623 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>>                 goto out;
>>         }
>>
>> +       page = compound_head(page);
> 
> Unless / until we do compound pages for the filesystem-dax case, I
> would add a comment like:
> 
> /* pages instantiated by device-dax (not filesystem-dax) may be
> compound pages */
> 
I've fixed up with the comment.

Thanks!


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-04-24  0:16   ` Dan Williams
@ 2021-04-24 19:05     ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-04-24 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 4/24/21 1:16 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Split the utility function prep_compound_page() into head and tail
>> counterparts, and use them accordingly.
> 
> To make this patch stand alone better lets add another sentence:
> 
> "This is in preparation for sharing the storage for / deduplicating
> compound page metadata."
> 
Yeap, I've fixed it up.

> Other than that, looks good to me.
> 
/me nods

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/page_alloc.c | 32 +++++++++++++++++++++-----------
>>  1 file changed, 21 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c53fe4fa10bf..43dd98446b0b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
>>         __free_pages_ok(page, compound_order(page), FPI_NONE);
>>  }
>>
>> +static void prep_compound_head(struct page *page, unsigned int order)
>> +{
>> +       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
>> +       set_compound_order(page, order);
>> +       atomic_set(compound_mapcount_ptr(page), -1);
>> +       if (hpage_pincount_available(page))
>> +               atomic_set(compound_pincount_ptr(page), 0);
>> +}
>> +
>> +static void prep_compound_tail(struct page *head, int tail_idx)
>> +{
>> +       struct page *p = head + tail_idx;
>> +
>> +       set_page_count(p, 0);
>> +       p->mapping = TAIL_MAPPING;
>> +       set_compound_head(p, head);
>> +}
>> +
>>  void prep_compound_page(struct page *page, unsigned int order)
>>  {
>>         int i;
>>         int nr_pages = 1 << order;
>>
>>         __SetPageHead(page);
>> -       for (i = 1; i < nr_pages; i++) {
>> -               struct page *p = page + i;
>> -               set_page_count(p, 0);
>> -               p->mapping = TAIL_MAPPING;
>> -               set_compound_head(p, page);
>> -       }
>> +       for (i = 1; i < nr_pages; i++)
>> +               prep_compound_tail(page, i);
>>
>> -       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
>> -       set_compound_order(page, order);
>> -       atomic_set(compound_mapcount_ptr(page), -1);
>> -       if (hpage_pincount_available(page))
>> -               atomic_set(compound_pincount_ptr(page), 0);
>> +       prep_compound_head(page, order);
>>  }
>>
>>  #ifdef CONFIG_DEBUG_PAGEALLOC
>> --
>> 2.17.1
>>
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-04-24  0:18   ` Dan Williams
@ 2021-04-24 19:05     ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-04-24 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 4/24/21 1:18 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Move struct page init to an helper function __init_zone_device_page().
> 
> Same sentence addition suggestion from the last patch to make this
> patch have some rationale for existing.
> 
I have fixed this too, with the same message as the previous patch.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
>>  1 file changed, 41 insertions(+), 33 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 43dd98446b0b..58974067bbd4 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
>>  }
>>
>>  #ifdef CONFIG_ZONE_DEVICE
>> +static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>> +                                         unsigned long zone_idx, int nid,
>> +                                         struct dev_pagemap *pgmap)
>> +{
>> +
>> +       __init_single_page(page, pfn, zone_idx, nid);
>> +
>> +       /*
>> +        * Mark page reserved as it will need to wait for onlining
>> +        * phase for it to be fully associated with a zone.
>> +        *
>> +        * We can use the non-atomic __set_bit operation for setting
>> +        * the flag as we are still initializing the pages.
>> +        */
>> +       __SetPageReserved(page);
>> +
>> +       /*
>> +        * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> +        * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
>> +        * ever freed or placed on a driver-private list.
>> +        */
>> +       page->pgmap = pgmap;
>> +       page->zone_device_data = NULL;
>> +
>> +       /*
>> +        * Mark the block movable so that blocks are reserved for
>> +        * movable at startup. This will force kernel allocations
>> +        * to reserve their blocks rather than leaking throughout
>> +        * the address space during boot when many long-lived
>> +        * kernel allocations are made.
>> +        *
>> +        * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
>> +        * because this is done early in section_activate()
>> +        */
>> +       if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
>> +               set_pageblock_migratetype(page, MIGRATE_MOVABLE);
>> +               cond_resched();
>> +       }
>> +}
>> +
>>  void __ref memmap_init_zone_device(struct zone *zone,
>>                                    unsigned long start_pfn,
>>                                    unsigned long nr_pages,
>> @@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>                 struct page *page = pfn_to_page(pfn);
>>
>> -               __init_single_page(page, pfn, zone_idx, nid);
>> -
>> -               /*
>> -                * Mark page reserved as it will need to wait for onlining
>> -                * phase for it to be fully associated with a zone.
>> -                *
>> -                * We can use the non-atomic __set_bit operation for setting
>> -                * the flag as we are still initializing the pages.
>> -                */
>> -               __SetPageReserved(page);
>> -
>> -               /*
>> -                * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> -                * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
>> -                * ever freed or placed on a driver-private list.
>> -                */
>> -               page->pgmap = pgmap;
>> -               page->zone_device_data = NULL;
>> -
>> -               /*
>> -                * Mark the block movable so that blocks are reserved for
>> -                * movable at startup. This will force kernel allocations
>> -                * to reserve their blocks rather than leaking throughout
>> -                * the address space during boot when many long-lived
>> -                * kernel allocations are made.
>> -                *
>> -                * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
>> -                * because this is done early in section_activate()
>> -                */
>> -               if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
>> -                       set_pageblock_migratetype(page, MIGRATE_MOVABLE);
>> -                       cond_resched();
>> -               }
>> +               __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>>         }
>>
>>         pr_info("%s initialised %lu pages in %ums\n", __func__,
>> --
>> 2.17.1
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-05-05 18:44   ` Dan Williams
  2021-05-05 18:58     ` Matthew Wilcox
  2021-05-05 19:49     ` Joao Martins
  0 siblings, 2 replies; 64+ messages in thread
From: Dan Williams @ 2021-05-05 18:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Add a new align property for struct dev_pagemap which specifies that a
> pagemap is composed of a set of compound pages of size @align, instead
> of base pages. When these pages are initialised, most are initialised as
> tail pages instead of order-0 pages.
>
> For certain ZONE_DEVICE users like device-dax which have a fixed page
> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> treating it the same way as THP or hugetlb pages.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/memremap.h | 13 +++++++++++++
>  mm/memremap.c            |  8 ++++++--
>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
>  3 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index b46f63dcaed3..bb28d82dda5e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -114,6 +114,7 @@ struct dev_pagemap {
>         struct completion done;
>         enum memory_type type;
>         unsigned int flags;
> +       unsigned long align;

I think this wants some kernel-doc above to indicate that non-zero
means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
means "use non-compound base pages". The non-zero value must be
PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. Hmm, maybe it should be an
enum:

enum devmap_geometry {
    DEVMAP_PTE,
    DEVMAP_PMD,
    DEVMAP_PUD,
}

...because it's more than just an alignment it's a structural
definition of how the memmap is laid out.

>         const struct dev_pagemap_ops *ops;
>         void *owner;
>         int nr_range;
> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>         return NULL;
>  }
>
> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> +{
> +       if (!pgmap || !pgmap->align)
> +               return PAGE_SIZE;
> +       return pgmap->align;
> +}
> +
> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> +{
> +       return PHYS_PFN(pgmap_align(pgmap));
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  bool pfn_zone_device_reserved(unsigned long pfn);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 805d761740c4..d160853670c4 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>                                 PHYS_PFN(range->start),
>                                 PHYS_PFN(range_len(range)), pgmap);
> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> -                       - pfn_first(pgmap, range_id));
> +       if (pgmap_align(pgmap) > PAGE_SIZE)
> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
> +       else
> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> +                               - pfn_first(pgmap, range_id));
>         return 0;
>
>  err_add_memory:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 58974067bbd4..3a77f9e43f3a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>         struct pglist_data *pgdat = zone->zone_pgdat;
>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
> +       unsigned int order_align = order_base_2(pfn_align);
>         unsigned long zone_idx = zone_idx(zone);
>         unsigned long start = jiffies;
>         int nid = pgdat->node_id;
> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
>                 nr_pages = end_pfn - start_pfn;
>         }
>
> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {

pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
PAGE_SHIFT" I missed somewhere?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 18:44   ` Dan Williams
@ 2021-05-05 18:58     ` Matthew Wilcox
  2021-05-05 19:49     ` Joao Martins
  1 sibling, 0 replies; 64+ messages in thread
From: Matthew Wilcox @ 2021-05-05 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, Linux MM, Ira Weiny, linux-nvdimm, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Wed, May 05, 2021 at 11:44:29AM -0700, Dan Williams wrote:
> > @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >         unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >         struct pglist_data *pgdat = zone->zone_pgdat;
> >         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> > +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
> > +       unsigned int order_align = order_base_2(pfn_align);
> >         unsigned long zone_idx = zone_idx(zone);
> >         unsigned long start = jiffies;
> >         int nid = pgdat->node_id;
> > @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >                 nr_pages = end_pfn - start_pfn;
> >         }
> >
> > -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> 
> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> PAGE_SHIFT" I missed somewhere?

If something is measured in bytes, I like to use size_t (if it's
in memory) and loff_t (if it's on storage).  The compiler doesn't do
anything useful to warn you, but it's a nice indication to humans about
what's going on.  And it removes the temptation to do 'pfn_align >>=
PAGE_SHIFT' and suddenly take pfn_align from being measured in bytes to
being measured in pages.



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 18:44   ` Dan Williams
  2021-05-05 18:58     ` Matthew Wilcox
@ 2021-05-05 19:49     ` Joao Martins
  2021-05-05 22:20       ` Dan Williams
  1 sibling, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-05-05 19:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 5/5/21 7:44 PM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Add a new align property for struct dev_pagemap which specifies that a
>> pagemap is composed of a set of compound pages of size @align, instead
>> of base pages. When these pages are initialised, most are initialised as
>> tail pages instead of order-0 pages.
>>
>> For certain ZONE_DEVICE users like device-dax which have a fixed page
>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>> treating it the same way as THP or hugetlb pages.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/memremap.h | 13 +++++++++++++
>>  mm/memremap.c            |  8 ++++++--
>>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
>>  3 files changed, 42 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index b46f63dcaed3..bb28d82dda5e 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>         struct completion done;
>>         enum memory_type type;
>>         unsigned int flags;
>> +       unsigned long align;
> 
> I think this wants some kernel-doc above to indicate that non-zero
> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> means "use non-compound base pages". 

Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top
of pgmap_align()?

> The non-zero value must be
> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. 
> Hmm, maybe it should be an
> enum:
> 
> enum devmap_geometry {
>     DEVMAP_PTE,
>     DEVMAP_PMD,
>     DEVMAP_PUD,
> }
> 
I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
the whole dax/nvdimm align values change meanwhile (as a followup improvement)?

Although to be fair we only ever care about compound page size in this series (and
similarly dax/nvdimm @align properties).

> ...because it's more than just an alignment it's a structural
> definition of how the memmap is laid out.
> 
>>         const struct dev_pagemap_ops *ops;
>>         void *owner;
>>         int nr_range;
>> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>>         return NULL;
>>  }
>>
>> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
>> +{
>> +       if (!pgmap || !pgmap->align)
>> +               return PAGE_SIZE;
>> +       return pgmap->align;
>> +}
>> +
>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
>> +{
>> +       return PHYS_PFN(pgmap_align(pgmap));
>> +}
>> +
>>  #ifdef CONFIG_ZONE_DEVICE
>>  bool pfn_zone_device_reserved(unsigned long pfn);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 805d761740c4..d160853670c4 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>                                 PHYS_PFN(range->start),
>>                                 PHYS_PFN(range_len(range)), pgmap);
>> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> -                       - pfn_first(pgmap, range_id));
>> +       if (pgmap_align(pgmap) > PAGE_SIZE)
>> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
>> +       else
>> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> +                               - pfn_first(pgmap, range_id));
>>         return 0;
>>
>>  err_add_memory:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 58974067bbd4..3a77f9e43f3a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>>         struct pglist_data *pgdat = zone->zone_pgdat;
>>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
>> +       unsigned int order_align = order_base_2(pfn_align);
>>         unsigned long zone_idx = zone_idx(zone);
>>         unsigned long start = jiffies;
>>         int nid = pgdat->node_id;
>> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>                 nr_pages = end_pfn - start_pfn;
>>         }
>>
>> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> 
> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> PAGE_SHIFT" I missed somewhere?
> 
@pfn_align is in pages too. It's pgmap_align() which is in bytes:

+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+       return PHYS_PFN(pgmap_align(pgmap));
+}


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 19:49     ` Joao Martins
@ 2021-05-05 22:20       ` Dan Williams
  2021-05-05 22:36         ` Joao Martins
  2021-05-06  8:05         ` Aneesh Kumar K.V
  0 siblings, 2 replies; 64+ messages in thread
From: Dan Williams @ 2021-05-05 22:20 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 5/5/21 7:44 PM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Add a new align property for struct dev_pagemap which specifies that a
> >> pagemap is composed of a set of compound pages of size @align, instead
> >> of base pages. When these pages are initialised, most are initialised as
> >> tail pages instead of order-0 pages.
> >>
> >> For certain ZONE_DEVICE users like device-dax which have a fixed page
> >> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> >> treating it the same way as THP or hugetlb pages.
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  include/linux/memremap.h | 13 +++++++++++++
> >>  mm/memremap.c            |  8 ++++++--
> >>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
> >>  3 files changed, 42 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> index b46f63dcaed3..bb28d82dda5e 100644
> >> --- a/include/linux/memremap.h
> >> +++ b/include/linux/memremap.h
> >> @@ -114,6 +114,7 @@ struct dev_pagemap {
> >>         struct completion done;
> >>         enum memory_type type;
> >>         unsigned int flags;
> >> +       unsigned long align;
> >
> > I think this wants some kernel-doc above to indicate that non-zero
> > means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> > means "use non-compound base pages".
>
> Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top
> of pgmap_align()?

I was thinking in dev_pagemap because this value is more than just a
plain alignment its restructuring the layout and construction of the
memmap(), but when I say it that way it seems much less like a vanilla
align value compared to a construction / geometry mode setting.

>
> > The non-zero value must be
> > PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
> > Hmm, maybe it should be an
> > enum:
> >
> > enum devmap_geometry {
> >     DEVMAP_PTE,
> >     DEVMAP_PMD,
> >     DEVMAP_PUD,
> > }
> >
> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?

I think it is ok for dax/nvdimm to continue to maintain their align
value because it should be ok to have 4MB align if the device really
wanted. However, when it goes to map that alignment with
memremap_pages() it can pick a mode. For example, it's already the
case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
they're already separate concepts that can stay separate.

>
> Although to be fair we only ever care about compound page size in this series (and
> similarly dax/nvdimm @align properties).
>
> > ...because it's more than just an alignment it's a structural
> > definition of how the memmap is laid out.
> >
> >>         const struct dev_pagemap_ops *ops;
> >>         void *owner;
> >>         int nr_range;
> >> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
> >>         return NULL;
> >>  }
> >>
> >> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> >> +{
> >> +       if (!pgmap || !pgmap->align)
> >> +               return PAGE_SIZE;
> >> +       return pgmap->align;
> >> +}
> >> +
> >> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> >> +{
> >> +       return PHYS_PFN(pgmap_align(pgmap));
> >> +}
> >> +
> >>  #ifdef CONFIG_ZONE_DEVICE
> >>  bool pfn_zone_device_reserved(unsigned long pfn);
> >>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >> diff --git a/mm/memremap.c b/mm/memremap.c
> >> index 805d761740c4..d160853670c4 100644
> >> --- a/mm/memremap.c
> >> +++ b/mm/memremap.c
> >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >>                                 PHYS_PFN(range->start),
> >>                                 PHYS_PFN(range_len(range)), pgmap);
> >> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> -                       - pfn_first(pgmap, range_id));
> >> +       if (pgmap_align(pgmap) > PAGE_SIZE)
> >> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
> >> +       else
> >> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> +                               - pfn_first(pgmap, range_id));
> >>         return 0;
> >>
> >>  err_add_memory:
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 58974067bbd4..3a77f9e43f3a 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >>         struct pglist_data *pgdat = zone->zone_pgdat;
> >>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> >> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
> >> +       unsigned int order_align = order_base_2(pfn_align);
> >>         unsigned long zone_idx = zone_idx(zone);
> >>         unsigned long start = jiffies;
> >>         int nid = pgdat->node_id;
> >> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >>                 nr_pages = end_pfn - start_pfn;
> >>         }
> >>
> >> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> >
> > pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> > PAGE_SHIFT" I missed somewhere?
> >
> @pfn_align is in pages too. It's pgmap_align() which is in bytes:
>
> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> +{
> +       return PHYS_PFN(pgmap_align(pgmap));
> +}

Ah yup, my eyes glazed over that. I think this is another place that
benefits from a more specific name than "align". "pfns_per_compound"
"compound_pfns"?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-05-05 22:34   ` Dan Williams
  2021-05-05 22:37     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-05-05 22:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Oscar Salvador

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> @altmap is stored in a dev_pagemap, but it will be repurposed for
> hotplug memory for storing the memmap in the hotplugged memory[*] and
> reusing the altmap infrastructure to that end. This is to say that
> altmap can't be replaced with a @pgmap as it is going to cover more than
> dev_pagemap backend altmaps.

I was going to say, just pass the pgmap and lookup the altmap from
pgmap, but Oscar added a use case for altmap independent of pgmap. So
you might refresh this commit message to clarify why passing pgmap by
itself is not sufficient.

Other than that, looks good.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:20       ` Dan Williams
@ 2021-05-05 22:36         ` Joao Martins
  2021-05-05 23:03           ` Dan Williams
  2021-05-18 17:27           ` Joao Martins
  2021-05-06  8:05         ` Aneesh Kumar K.V
  1 sibling, 2 replies; 64+ messages in thread
From: Joao Martins @ 2021-05-05 22:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 5/5/21 11:20 PM, Dan Williams wrote:
> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> Add a new align property for struct dev_pagemap which specifies that a
>>>> pagemap is composed of a set of compound pages of size @align, instead
>>>> of base pages. When these pages are initialised, most are initialised as
>>>> tail pages instead of order-0 pages.
>>>>
>>>> For certain ZONE_DEVICE users like device-dax which have a fixed page
>>>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>>>> treating it the same way as THP or hugetlb pages.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  include/linux/memremap.h | 13 +++++++++++++
>>>>  mm/memremap.c            |  8 ++++++--
>>>>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
>>>>  3 files changed, 42 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>> --- a/include/linux/memremap.h
>>>> +++ b/include/linux/memremap.h
>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>         struct completion done;
>>>>         enum memory_type type;
>>>>         unsigned int flags;
>>>> +       unsigned long align;
>>>
>>> I think this wants some kernel-doc above to indicate that non-zero
>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>> means "use non-compound base pages".
>>
>> Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top
>> of pgmap_align()?
> 
> I was thinking in dev_pagemap because this value is more than just a
> plain alignment its restructuring the layout and construction of the
> memmap(), but when I say it that way it seems much less like a vanilla
> align value compared to a construction / geometry mode setting.
> 
/me nods

>>
>>> The non-zero value must be
>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>> Hmm, maybe it should be an
>>> enum:
>>>
>>> enum devmap_geometry {
>>>     DEVMAP_PTE,
>>>     DEVMAP_PMD,
>>>     DEVMAP_PUD,
>>> }
>>>
>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
> 
> I think it is ok for dax/nvdimm to continue to maintain their align
> value because it should be ok to have 4MB align if the device really
> wanted. However, when it goes to map that alignment with
> memremap_pages() it can pick a mode. For example, it's already the
> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> they're already separate concepts that can stay separate.
> 
Gotcha.

>>
>> Although to be fair we only ever care about compound page size in this series (and
>> similarly dax/nvdimm @align properties).
>>
>>> ...because it's more than just an alignment it's a structural
>>> definition of how the memmap is laid out.
>>>

Somehow I missed this other part of your response.

>>>>         const struct dev_pagemap_ops *ops;
>>>>         void *owner;
>>>>         int nr_range;
>>>> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>>>>         return NULL;
>>>>  }
>>>>
>>>> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
>>>> +{
>>>> +       if (!pgmap || !pgmap->align)
>>>> +               return PAGE_SIZE;
>>>> +       return pgmap->align;
>>>> +}
>>>> +
>>>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
>>>> +{
>>>> +       return PHYS_PFN(pgmap_align(pgmap));
>>>> +}
>>>> +
>>>>  #ifdef CONFIG_ZONE_DEVICE
>>>>  bool pfn_zone_device_reserved(unsigned long pfn);
>>>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>> index 805d761740c4..d160853670c4 100644
>>>> --- a/mm/memremap.c
>>>> +++ b/mm/memremap.c
>>>> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>>>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>>>                                 PHYS_PFN(range->start),
>>>>                                 PHYS_PFN(range_len(range)), pgmap);
>>>> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>>>> -                       - pfn_first(pgmap, range_id));
>>>> +       if (pgmap_align(pgmap) > PAGE_SIZE)
>>>> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>>>> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
>>>> +       else
>>>> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>>>> +                               - pfn_first(pgmap, range_id));
>>>>         return 0;
>>>>
>>>>  err_add_memory:
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 58974067bbd4..3a77f9e43f3a 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>>>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>>>>         struct pglist_data *pgdat = zone->zone_pgdat;
>>>>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>>>> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
>>>> +       unsigned int order_align = order_base_2(pfn_align);
>>>>         unsigned long zone_idx = zone_idx(zone);
>>>>         unsigned long start = jiffies;
>>>>         int nid = pgdat->node_id;
>>>> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>>>                 nr_pages = end_pfn - start_pfn;
>>>>         }
>>>>
>>>> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>>> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
>>>
>>> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
>>> PAGE_SHIFT" I missed somewhere?
>>>
>> @pfn_align is in pages too. It's pgmap_align() which is in bytes:
>>
>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
>> +{
>> +       return PHYS_PFN(pgmap_align(pgmap));
>> +}
> 
> Ah yup, my eyes glazed over that. I think this is another place that
> benefits from a more specific name than "align". "pfns_per_compound"
> "compound_pfns"?
> 
We are still describing a page, just not a base page. So perhaps @pfns_per_hpage ?

I am fine with @pfns_per_compound or @compound_pfns as well.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-05-05 22:34   ` Dan Williams
@ 2021-05-05 22:37     ` Joao Martins
  2021-05-05 23:14       ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-05-05 22:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Oscar Salvador



On 5/5/21 11:34 PM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> @altmap is stored in a dev_pagemap, but it will be repurposed for
>> hotplug memory for storing the memmap in the hotplugged memory[*] and
>> reusing the altmap infrastructure to that end. This is to say that
>> altmap can't be replaced with a @pgmap as it is going to cover more than
>> dev_pagemap backend altmaps.
> 
> I was going to say, just pass the pgmap and lookup the altmap from
> pgmap, but Oscar added a use case for altmap independent of pgmap. So
> you might refresh this commit message to clarify why passing pgmap by
> itself is not sufficient.
> 
Isn't that what I am doing above with that exact paragraph? I even
reference his series at the end of commit description :) in [*]

> Other than that, looks good.
> 
/me nods


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
@ 2021-05-05 22:43   ` Dan Williams
  2021-05-06 10:27     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-05-05 22:43 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

I suspect it's a good sign I'm only finding cosmetic and changelog
changes in the review... I have some more:

A year for now if I'm tracking down a problem and looking through mm
commits I would appreciate a subject line like the following:
"refactor core of vmemmap_populate_basepages() to helper" that gives
an idea of the impact and side effects of the change.

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>

I would add a lead in phrase like: "In preparation for describing a
memmap with compound pages, move the actual..."

> Move the actual pte population logic into a separate function
> vmemmap_populate_address() and have vmemmap_populate_basepages()
> walk through all base pages it needs to populate.

Aside from the above, looks good.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
>  1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 370728c206ee..8814876edcfa 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> -                                        int node, struct vmem_altmap *altmap)
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap)
>  {
> -       unsigned long addr = start;
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
>         pte_t *pte;
>
> +       pgd = vmemmap_pgd_populate(addr, node);
> +       if (!pgd)
> +               return -ENOMEM;
> +       p4d = vmemmap_p4d_populate(pgd, addr, node);
> +       if (!p4d)
> +               return -ENOMEM;
> +       pud = vmemmap_pud_populate(p4d, addr, node);
> +       if (!pud)
> +               return -ENOMEM;
> +       pmd = vmemmap_pmd_populate(pud, addr, node);
> +       if (!pmd)
> +               return -ENOMEM;
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       if (!pte)
> +               return -ENOMEM;
> +       vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> +}
> +
> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> +                                        int node, struct vmem_altmap *altmap)
> +{
> +       unsigned long addr = start;
> +
>         for (; addr < end; addr += PAGE_SIZE) {
> -               pgd = vmemmap_pgd_populate(addr, node);
> -               if (!pgd)
> -                       return -ENOMEM;
> -               p4d = vmemmap_p4d_populate(pgd, addr, node);
> -               if (!p4d)
> -                       return -ENOMEM;
> -               pud = vmemmap_pud_populate(p4d, addr, node);
> -               if (!pud)
> -                       return -ENOMEM;
> -               pmd = vmemmap_pmd_populate(pud, addr, node);
> -               if (!pmd)
> -                       return -ENOMEM;
> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> -               if (!pte)
> +               if (vmemmap_populate_address(addr, node, altmap))
>                         return -ENOMEM;
> -               vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>         }
>
>         return 0;
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:36         ` Joao Martins
@ 2021-05-05 23:03           ` Dan Williams
  2021-05-06 10:12             ` Joao Martins
  2021-05-18 17:27           ` Joao Martins
  1 sibling, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-05-05 23:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Wed, May 5, 2021 at 3:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> > Ah yup, my eyes glazed over that. I think this is another place that
> > benefits from a more specific name than "align". "pfns_per_compound"
> > "compound_pfns"?
> >
> We are still describing a page, just not a base page. So perhaps @pfns_per_hpage ?
>
> I am fine with @pfns_per_compound or @compound_pfns as well.

My only concern about hpage is that hpage implies PMD, where compound
is generic across PMD and PUD.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-05-05 22:37     ` Joao Martins
@ 2021-05-05 23:14       ` Dan Williams
  2021-05-06 10:24         ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-05-05 23:14 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Oscar Salvador

On Wed, May 5, 2021 at 3:38 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 5/5/21 11:34 PM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> @altmap is stored in a dev_pagemap, but it will be repurposed for
> >> hotplug memory for storing the memmap in the hotplugged memory[*] and
> >> reusing the altmap infrastructure to that end. This is to say that
> >> altmap can't be replaced with a @pgmap as it is going to cover more than
> >> dev_pagemap backend altmaps.
> >
> > I was going to say, just pass the pgmap and lookup the altmap from
> > pgmap, but Oscar added a use case for altmap independent of pgmap. So
> > you might refresh this commit message to clarify why passing pgmap by
> > itself is not sufficient.
> >
> Isn't that what I am doing above with that exact paragraph? I even
> reference his series at the end of commit description :) in [*]

Oh, sorry, it didn't hit me explicitly that you were talking about
Oscar's work I thought you were referring to your own changes in this
set. I see it now... at a minimum the tense needs updating since
Oscar's changes are in the past not the future anymore. If it helps,
the following reads more direct to me: "In support of using compound
pages for devmap mappings, plumb the pgmap down to the
vmemmap_populate* implementation. Note that while altmap is
retrievable from pgmap the memory hotplug codes passes altmap without
pgmap, so both need to be independently plumbed."


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-05-06  1:18   ` Dan Williams
  2021-05-06 11:01     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-05-06  1:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
> means that pages are mapped at a given huge page alignment and utilize
> uses compound pages as opposed to order-0 pages.
>
> To minimize struct page overhead we take advantage of the fact that

I'm not sure if Andrew is a member of the "we" police, but I am on
team "imperative tense".

"Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead."

...I think all the other "we"s below can be dropped without losing any meaning.

> most tail pages look the same (except the first two). We allocate a
> separate page for the vmemmap area which contains the head page and
> separate for the next 64 pages. The rest of the subsections then reuse
> this tail vmemmap page to initialize the rest of the tail pages.
>
> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
> when initializing compound pagemap with big enough @align (e.g. 1G
> PUD) it  will cross various sections. To be able to reuse tail pages
> across sections belonging to the same gigantic page we fetch the
> @range being mapped (nr_ranges + 1).  If the section being mapped is
> not offset 0 of the @align, then lookup the PFN of the struct page
> address that preceeds it and use that to populate the entire

s/preceeds/precedes/

> section.
>
> On compound pagemaps with 2M align, this lets mechanism saves 6 pages

s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/

> out of the 8 necessary PFNs necessary to set the subsection's 512
> struct pages being mapped. On a 1G compound pagemap it saves
> 4094 pages.
>
> Altmap isn't supported yet, given various restrictions in altmap pfn
> allocator, thus fallback to the already in use vmemmap_populate().

Perhaps also note that altmap for devmap mappings was there to relieve
the pressure of inordinate amounts of memmap space to map terabytes of
PMEM. With compound pages the motivation for altmaps for PMEM is
reduced.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h  |   2 +-
>  mm/memremap.c       |   1 +
>  mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 131 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 61474602c2b1..49d717ae40ae 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                           struct vmem_altmap *altmap);
> +                           struct vmem_altmap *altmap, void *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
>  struct vmem_altmap;
>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index d160853670c4..2e6bc0b1ff00 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>  {
>         struct mhp_params params = {
>                 .altmap = pgmap_altmap(pgmap),
> +               .pgmap = pgmap,
>                 .pgprot = PAGE_KERNEL,
>         };
>         const int nr_range = pgmap->nr_range;
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 8814876edcfa..f57c5eada099 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>  }
>
>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                                      struct vmem_altmap *altmap)
> +                                      struct vmem_altmap *altmap, void *block)

For type-safety I would make this argument a 'struct page *' and drop
the virt_to_page().

>  {
>         pte_t *pte = pte_offset_kernel(pmd, addr);
>         if (pte_none(*pte)) {
>                 pte_t entry;
> -               void *p;
> -
> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> -               if (!p)
> -                       return NULL;
> +               void *p = block;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> +                       if (!p)
> +                               return NULL;
> +               } else if (!altmap) {
> +                       get_page(virt_to_page(block));

This is either super subtle, or there is a missing put_page(). I'm
assuming that in the shutdown path the same page will be freed
multiple times because the same page is mapped multiple times.

Have you validated that the accounting is correct and all allocated
pages get freed? I feel this needs more than a comment, it needs a
test to validate that the accounting continues to happen correctly as
future vmemmap changes land.

> +               }
>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>                 set_pte_at(&init_mm, addr, pte, entry);
>         }
> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>  }
>
>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap)
> +                                             struct vmem_altmap *altmap,
> +                                             void *page, void **ptr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pmd = vmemmap_pmd_populate(pud, addr, node);
>         if (!pmd)
>                 return -ENOMEM;
> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>         if (!pte)
>                 return -ENOMEM;
>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> +
> +       if (ptr)
> +               *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
> +       return 0;
>  }
>
>  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>         unsigned long addr = start;
>
>         for (; addr < end; addr += PAGE_SIZE) {
> -               if (vmemmap_populate_address(addr, node, altmap))
> +               if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_range(unsigned long start,
> +                                           unsigned long end,
> +                                           int node, void *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PAGE_SIZE) {
> +               if (vmemmap_populate_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
> +                                                 void **ptr)
> +{
> +       return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
> +}
> +
> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)

I think this can be replaced with a call to follow_pte(&init_mm...).

> +{
> +       pgd_t *pgd;
> +       p4d_t *p4d;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       pgd = pgd_offset_k(addr);
> +       if (pgd_none(*pgd))
> +               return NULL;
> +
> +       p4d = p4d_offset(pgd, addr);
> +       if (p4d_none(*p4d))
> +               return NULL;
> +
> +       pud = pud_offset(p4d, addr);
> +       if (pud_none(*pud))
> +               return NULL;
> +
> +       pmd = pmd_offset(pud, addr);
> +       if (pmd_none(*pmd))
> +               return NULL;
> +
> +       pte = pte_offset_kernel(pmd, addr);
> +       if (pte_none(*pte))
> +               return NULL;
> +
> +       return pte;
> +}
> +
> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
> +                                                    unsigned long start,
> +                                                    unsigned long end, int node,
> +                                                    struct dev_pagemap *pgmap)
> +{
> +       unsigned long offset, size, addr;
> +
> +       /*
> +        * For compound pages bigger than section size (e.g. 1G) fill the rest

Oh, I had to read this a couple times because I thought the "e.g." was
referring to section size. How about:

s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/

> +        * of sections as tail pages.
> +        *
> +        * Note that memremap_pages() resets @nr_range value and will increment
> +        * it after each range successful onlining. Thus the value or @nr_range
> +        * at section memmap populate corresponds to the in-progress range
> +        * being onlined that we care about.
> +        */
> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
> +       if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
> +           pgmap_align(pgmap) > SUBSECTION_SIZE) {
> +               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
> +
> +               if (!ptep)
> +                       return -ENOMEM;

Side note: I had been resistant to adding 'struct page' for altmap
pages, but that would be a requirement to enable compound pages +
altmap.

> +

Perhaps a comment here to the effect "recall the page that was
allocated in a prior iteration and fill it in with tail pages".

> +               return vmemmap_populate_range(start, end, node,
> +                                             page_to_virt(pte_page(*ptep)));

If the @block parameter changes to a 'struct page *' it also saves
some typing here.

> +       }
> +
> +       size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
> +       for (addr = start; addr < end; addr += size) {
> +               unsigned long next = addr, last = addr + size;
> +               void *block;
> +
> +               /* Populate the head page vmemmap page */
> +               if (vmemmap_populate_page(addr, node, NULL))
> +                       return -ENOMEM;
> +
> +               /* Populate the tail pages vmemmap page */
> +               block = NULL;
> +               next = addr + PAGE_SIZE;
> +               if (vmemmap_populate_page(next, node, &block))
> +                       return -ENOMEM;
> +
> +               /* Reuse the previous page for the rest of tail pages */
> +               next += PAGE_SIZE;
> +               if (vmemmap_populate_range(next, last, node, block))
>                         return -ENOMEM;
>         }
>
> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>  {
>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>         unsigned long end = start + nr_pages * sizeof(struct page);
> +       unsigned int align = pgmap_align(pgmap);
> +       int r;
>
>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>                 return NULL;
>
> -       if (vmemmap_populate(start, end, nid, altmap))
> +       if (align > PAGE_SIZE && !altmap)

I also think this code will read better the devmap_geometry enum.

> +               r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> +       else
> +               r = vmemmap_populate(start, end, nid, altmap);
> +
> +       if (r < 0)
>                 return NULL;
>
>         return pfn_to_page(pfn);
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:20       ` Dan Williams
  2021-05-05 22:36         ` Joao Martins
@ 2021-05-06  8:05         ` Aneesh Kumar K.V
  2021-05-06 10:23           ` Joao Martins
  1 sibling, 1 reply; 64+ messages in thread
From: Aneesh Kumar K.V @ 2021-05-06  8:05 UTC (permalink / raw)
  To: Dan Williams, Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



IIUC this series is about devdax namespace with aligh of 1G or 2M where we can
save the vmmemap space by not allocating memory for tail struct pages? 

Dan Williams <dan.j.williams@intel.com> writes:

> > enum:
>> >
>> > enum devmap_geometry {
>> >     DEVMAP_PTE,
>> >     DEVMAP_PMD,
>> >     DEVMAP_PUD,
>> > }
>> >
>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>
> I think it is ok for dax/nvdimm to continue to maintain their align
> value because it should be ok to have 4MB align if the device really
> wanted. However, when it goes to map that alignment with
> memremap_pages() it can pick a mode. For example, it's already the
> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> they're already separate concepts that can stay separate.

devdax namespace with align of 1G implies we expect to map them with 1G
pte entries? I didn't follow when you say we map them today with
DEVMAP_PTE entries.


-aneesh


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 23:03           ` Dan Williams
@ 2021-05-06 10:12             ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-05-06 10:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 5/6/21 12:03 AM, Dan Williams wrote:
> On Wed, May 5, 2021 at 3:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>> Ah yup, my eyes glazed over that. I think this is another place that
>>> benefits from a more specific name than "align". "pfns_per_compound"
>>> "compound_pfns"?
>>>
>> We are still describing a page, just not a base page. So perhaps @pfns_per_hpage ?
>>
>> I am fine with @pfns_per_compound or @compound_pfns as well.
> 
> My only concern about hpage is that hpage implies PMD, where compound
> is generic across PMD and PUD.
> 
True.

I will stick with your suggestions.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-06  8:05         ` Aneesh Kumar K.V
@ 2021-05-06 10:23           ` Joao Martins
  2021-05-06 11:43             ` Matthew Wilcox
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-05-06 10:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 5/6/21 9:05 AM, Aneesh Kumar K.V wrote:
> 
> 
> IIUC this series is about devdax namespace with aligh of 1G or 2M where we can
> save the vmmemap space by not allocating memory for tail struct pages? 
> 
Right.

It reuses base pages across the vmemmap, but for the base pages containing
only the tail struct pages.

> Dan Williams <dan.j.williams@intel.com> writes:
> 
>>> enum:
>>>>
>>>> enum devmap_geometry {
>>>>     DEVMAP_PTE,
>>>>     DEVMAP_PMD,
>>>>     DEVMAP_PUD,
>>>> }
>>>>
>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>
>> I think it is ok for dax/nvdimm to continue to maintain their align
>> value because it should be ok to have 4MB align if the device really
>> wanted. However, when it goes to map that alignment with
>> memremap_pages() it can pick a mode. For example, it's already the
>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>> they're already separate concepts that can stay separate.
> 
> devdax namespace with align of 1G implies we expect to map them with 1G
> pte entries? I didn't follow when you say we map them today with
> DEVMAP_PTE entries.
> 
This sort of confusion is largelly why Dan is suggesting a @geometry for naming rather
than @align (which traditionally refers to page tables entry sizes in pagemap-related stuff).

DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages (DEVMAP_PTE),
compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order (DEVMAP_PUD).

So, today:

* namespaces with align of 1G would use *struct pages of order-0* (DEVMAP_PTE) backed with
PMD entries in the direct map.
* namespaces with align of 2M would use *struct pages of order-0* (DEVMAP_PTE) backed with
PMD entries in the direct map.

After this series:

* namespaces with align of 1G would use *compound struct pages of order-30* (DEVMAP_PUD)
backed with PMD entries in the direct map.
* namespaces with align of 1G would use *compound struct pages of order-21* (DEVMAP_PMD)
backed with PTE entries in the direct map.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-05-05 23:14       ` Dan Williams
@ 2021-05-06 10:24         ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-05-06 10:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton, Oscar Salvador



On 5/6/21 12:14 AM, Dan Williams wrote:
> On Wed, May 5, 2021 at 3:38 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>
>>
>> On 5/5/21 11:34 PM, Dan Williams wrote:
>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> @altmap is stored in a dev_pagemap, but it will be repurposed for
>>>> hotplug memory for storing the memmap in the hotplugged memory[*] and
>>>> reusing the altmap infrastructure to that end. This is to say that
>>>> altmap can't be replaced with a @pgmap as it is going to cover more than
>>>> dev_pagemap backend altmaps.
>>>
>>> I was going to say, just pass the pgmap and lookup the altmap from
>>> pgmap, but Oscar added a use case for altmap independent of pgmap. So
>>> you might refresh this commit message to clarify why passing pgmap by
>>> itself is not sufficient.
>>>
>> Isn't that what I am doing above with that exact paragraph? I even
>> reference his series at the end of commit description :) in [*]
> 
> Oh, sorry, it didn't hit me explicitly that you were talking about
> Oscar's work I thought you were referring to your own changes in this
> set. I see it now... at a minimum the tense needs updating since
> Oscar's changes are in the past not the future anymore. 

/me nods

> If it helps,
> the following reads more direct to me: "In support of using compound
> pages for devmap mappings, plumb the pgmap down to the
> vmemmap_populate* implementation. Note that while altmap is
> retrievable from pgmap the memory hotplug codes passes altmap without
> pgmap, so both need to be independently plumbed."
> 
Let me use this one instead, definitely better than my strange english :)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-05-05 22:43   ` Dan Williams
@ 2021-05-06 10:27     ` Joao Martins
  2021-05-06 18:36       ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-05-06 10:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 5/5/21 11:43 PM, Dan Williams wrote:
> I suspect it's a good sign I'm only finding cosmetic and changelog
> changes in the review... 

Hopefully it continues that way, but the meat of the series is located
in patches 4, 6, 7 and 11. *Specially* 6 and 7.

I very strongly suspect I am gonna get non-cosmetic comments there.

> I have some more:
> 
> A year for now if I'm tracking down a problem and looking through mm
> commits I would appreciate a subject line like the following:
> "refactor core of vmemmap_populate_basepages() to helper" that gives
> an idea of the impact and side effects of the change.
> 
Fixed.

> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
> 
> I would add a lead in phrase like: "In preparation for describing a
> memmap with compound pages, move the actual..."
> 
>> Move the actual pte population logic into a separate function
>> vmemmap_populate_address() and have vmemmap_populate_basepages()
>> walk through all base pages it needs to populate.
> 
> Aside from the above, looks good.
>
Cool!


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-05-06  1:18   ` Dan Williams
@ 2021-05-06 11:01     ` Joao Martins
  2021-05-10 19:19       ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-05-06 11:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 5/6/21 2:18 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
>> means that pages are mapped at a given huge page alignment and utilize
>> uses compound pages as opposed to order-0 pages.
>>
>> To minimize struct page overhead we take advantage of the fact that
> 
> I'm not sure if Andrew is a member of the "we" police, but I am on
> team "imperative tense".
> 
That was my mistake once again. You did mentioned it over the RFC.
A few days after submitting the series I noticed[0] this patch and the
next one had still a few instances of misplaced usage of 'we', in addition
to a little unit mistake I made over the cover-letter.

[0] https://lore.kernel.org/linux-mm/a2b7dc3a-3f1d-b8e1-4770-e261055f2b17@oracle.com/

> "Take advantage of the fact that most tail pages look the same (except
> the first two) to minimize struct page overhead."
> 
> ...I think all the other "we"s below can be dropped without losing any meaning.
> 
Indeed.

>> most tail pages look the same (except the first two). We allocate a
>> separate page for the vmemmap area which contains the head page and
>> separate for the next 64 pages. The rest of the subsections then reuse
>> this tail vmemmap page to initialize the rest of the tail pages.
>>
>> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
>> when initializing compound pagemap with big enough @align (e.g. 1G
>> PUD) it  will cross various sections. To be able to reuse tail pages
>> across sections belonging to the same gigantic page we fetch the
>> @range being mapped (nr_ranges + 1).  If the section being mapped is
>> not offset 0 of the @align, then lookup the PFN of the struct page
>> address that preceeds it and use that to populate the entire
> 
> s/preceeds/precedes/
> 
/me nods

>> section.
>>
>> On compound pagemaps with 2M align, this lets mechanism saves 6 pages
> 
> s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/
> 
Will fix.

>> out of the 8 necessary PFNs necessary to set the subsection's 512
>> struct pages being mapped. On a 1G compound pagemap it saves
>> 4094 pages.
>>
>> Altmap isn't supported yet, given various restrictions in altmap pfn
>> allocator, thus fallback to the already in use vmemmap_populate().
> 
> Perhaps also note that altmap for devmap mappings was there to relieve
> the pressure of inordinate amounts of memmap space to map terabytes of
> PMEM. With compound pages the motivation for altmaps for PMEM is
> reduced.
> 
OK, I suppose it's worth highlighting it.

But perhaps how 'reduced' this motivation is still not 100% clear to me.

Like we discussed over the RFC, the DEVMAP_PMD geometry we still take a
non-trivial hit of ~4G per Tb. And altmap is a rather quick way of having
heavily disproportioned RAM to PMEM ratios working without pay the RAM
penalty. Specially with the pinning numbers so heavily reduced.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h  |   2 +-
>>  mm/memremap.c       |   1 +
>>  mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 131 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 61474602c2b1..49d717ae40ae 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> -                           struct vmem_altmap *altmap);
>> +                           struct vmem_altmap *altmap, void *block);
>>  void *vmemmap_alloc_block(unsigned long size, int node);
>>  struct vmem_altmap;
>>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index d160853670c4..2e6bc0b1ff00 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>>  {
>>         struct mhp_params params = {
>>                 .altmap = pgmap_altmap(pgmap),
>> +               .pgmap = pgmap,
>>                 .pgprot = PAGE_KERNEL,
>>         };
>>         const int nr_range = pgmap->nr_range;
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 8814876edcfa..f57c5eada099 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>>  }
>>
>>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> -                                      struct vmem_altmap *altmap)
>> +                                      struct vmem_altmap *altmap, void *block)
> 
> For type-safety I would make this argument a 'struct page *' and drop
> the virt_to_page().
> 
Will do.

>>  {
>>         pte_t *pte = pte_offset_kernel(pmd, addr);
>>         if (pte_none(*pte)) {
>>                 pte_t entry;
>> -               void *p;
>> -
>> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> -               if (!p)
>> -                       return NULL;
>> +               void *p = block;
>> +
>> +               if (!block) {
>> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> +                       if (!p)
>> +                               return NULL;
>> +               } else if (!altmap) {
>> +                       get_page(virt_to_page(block));
> 
> This is either super subtle, or there is a missing put_page(). I'm
> assuming that in the shutdown path the same page will be freed
> multiple times because the same page is mapped multiple times.
> 

It's the former (subtlety):

1) When a PTE/PMD entry is freed from the init_mm there's a a free_pages() call to
this page allocated above. free_pages() at the top does a put_page_testzero() prior to
freeing the actual page.

2) Given this codepath is solely used by pgmap infra, we already have the page allocator
is available (i.e. slab_is_available()), hence we will always go towards the
alloc_pages_node() codepath.

> Have you validated that the accounting is correct and all allocated
> pages get freed? I feel this needs more than a comment, it needs a
> test to validate that the accounting continues to happen correctly as
> future vmemmap changes land.
> 

I can add a comment *at least*.

I checked the accounting. But let me be extra pedantic on this part and recheck
this once again as I settled with this part some time ago. I will followup on this
chunk, should this part be broken in some way.

>> +               }
>>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>>                 set_pte_at(&init_mm, addr, pte, entry);
>>         }
>> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>  }
>>
>>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> -                                             struct vmem_altmap *altmap)
>> +                                             struct vmem_altmap *altmap,
>> +                                             void *page, void **ptr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>         pmd = vmemmap_pmd_populate(pud, addr, node);
>>         if (!pmd)
>>                 return -ENOMEM;
>> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>>         if (!pte)
>>                 return -ENOMEM;
>>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>> +
>> +       if (ptr)
>> +               *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
>> +       return 0;
>>  }
>>
>>  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>         unsigned long addr = start;
>>
>>         for (; addr < end; addr += PAGE_SIZE) {
>> -               if (vmemmap_populate_address(addr, node, altmap))
>> +               if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_range(unsigned long start,
>> +                                           unsigned long end,
>> +                                           int node, void *page)
>> +{
>> +       unsigned long addr = start;
>> +
>> +       for (; addr < end; addr += PAGE_SIZE) {
>> +               if (vmemmap_populate_address(addr, node, NULL, page, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>> +                                                 void **ptr)
>> +{
>> +       return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>> +}
>> +
>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> 
> I think this can be replaced with a call to follow_pte(&init_mm...).
> 

Ah, of course! That would shorten things up too.

>> +{
>> +       pgd_t *pgd;
>> +       p4d_t *p4d;
>> +       pud_t *pud;
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       pgd = pgd_offset_k(addr);
>> +       if (pgd_none(*pgd))
>> +               return NULL;
>> +
>> +       p4d = p4d_offset(pgd, addr);
>> +       if (p4d_none(*p4d))
>> +               return NULL;
>> +
>> +       pud = pud_offset(p4d, addr);
>> +       if (pud_none(*pud))
>> +               return NULL;
>> +
>> +       pmd = pmd_offset(pud, addr);
>> +       if (pmd_none(*pmd))
>> +               return NULL;
>> +
>> +       pte = pte_offset_kernel(pmd, addr);
>> +       if (pte_none(*pte))
>> +               return NULL;
>> +
>> +       return pte;
>> +}
>> +
>> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> +                                                    unsigned long start,
>> +                                                    unsigned long end, int node,
>> +                                                    struct dev_pagemap *pgmap)
>> +{
>> +       unsigned long offset, size, addr;
>> +
>> +       /*
>> +        * For compound pages bigger than section size (e.g. 1G) fill the rest
> 
> Oh, I had to read this a couple times because I thought the "e.g." was
> referring to section size. How about:
> 
> s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/
> 
Will fix. But instead I'll fix to:

e.g. x86 1G compound pages with 128M section size

Because vmemmap_populate_compound_pages() is called per section, while being
subsection-aligned.

>> +        * of sections as tail pages.
>> +        *
>> +        * Note that memremap_pages() resets @nr_range value and will increment
>> +        * it after each range successful onlining. Thus the value or @nr_range
>> +        * at section memmap populate corresponds to the in-progress range
>> +        * being onlined that we care about.
>> +        */
>> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>> +       if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>> +           pgmap_align(pgmap) > SUBSECTION_SIZE) {
>> +               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
>> +
>> +               if (!ptep)
>> +                       return -ENOMEM;
> 
> Side note: I had been resistant to adding 'struct page' for altmap
> pages, but that would be a requirement to enable compound pages +
> altmap.
> 
Fair enough.

I was thinking about something a little simpler like adding a refcount to
altmap pfn allocator, to avoid arch changes and be a little more tidy
on space (16bytes rather than 64bytes). But I suspect I can avoid the said
arch changes even with a backend struct page.

>> +
> 
> Perhaps a comment here to the effect "recall the page that was
> allocated in a prior iteration and fill it in with tail pages".
> 
Will add.

>> +               return vmemmap_populate_range(start, end, node,
>> +                                             page_to_virt(pte_page(*ptep)));
> 
> If the @block parameter changes to a 'struct page *' it also saves
> some typing here.
> 
Indeed.

>> +       }
>> +
>> +       size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
>> +       for (addr = start; addr < end; addr += size) {
>> +               unsigned long next = addr, last = addr + size;
>> +               void *block;
>> +
>> +               /* Populate the head page vmemmap page */
>> +               if (vmemmap_populate_page(addr, node, NULL))
>> +                       return -ENOMEM;
>> +
>> +               /* Populate the tail pages vmemmap page */
>> +               block = NULL;
>> +               next = addr + PAGE_SIZE;
>> +               if (vmemmap_populate_page(next, node, &block))
>> +                       return -ENOMEM;
>> +
>> +               /* Reuse the previous page for the rest of tail pages */
>> +               next += PAGE_SIZE;
>> +               if (vmemmap_populate_range(next, last, node, block))
>>                         return -ENOMEM;
>>         }
>>
>> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>>  {
>>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>>         unsigned long end = start + nr_pages * sizeof(struct page);
>> +       unsigned int align = pgmap_align(pgmap);
>> +       int r;
>>
>>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>                 return NULL;
>>
>> -       if (vmemmap_populate(start, end, nid, altmap))
>> +       if (align > PAGE_SIZE && !altmap)
> 
> I also think this code will read better the devmap_geometry enum.
> 
True.

I am starting to like the ring of it with @geometry rather than @align to represent
how metadata is used. Should avoid confusion between device-dax @align and pagemap
@align.

>> +               r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>> +       else
>> +               r = vmemmap_populate(start, end, nid, altmap);
>> +
>> +       if (r < 0)
>>                 return NULL;
>>
>>         return pfn_to_page(pfn);
>> --
>> 2.17.1
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-06 10:23           ` Joao Martins
@ 2021-05-06 11:43             ` Matthew Wilcox
  2021-05-06 12:15               ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Matthew Wilcox @ 2021-05-06 11:43 UTC (permalink / raw)
  To: Joao Martins
  Cc: Aneesh Kumar K.V, Dan Williams, Linux MM, Ira Weiny,
	linux-nvdimm, Jason Gunthorpe, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton

On Thu, May 06, 2021 at 11:23:25AM +0100, Joao Martins wrote:
> >> I think it is ok for dax/nvdimm to continue to maintain their align
> >> value because it should be ok to have 4MB align if the device really
> >> wanted. However, when it goes to map that alignment with
> >> memremap_pages() it can pick a mode. For example, it's already the
> >> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> >> they're already separate concepts that can stay separate.
> > 
> > devdax namespace with align of 1G implies we expect to map them with 1G
> > pte entries? I didn't follow when you say we map them today with
> > DEVMAP_PTE entries.
> 
> This sort of confusion is largelly why Dan is suggesting a @geometry for naming rather
> than @align (which traditionally refers to page tables entry sizes in pagemap-related stuff).
> 
> DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages (DEVMAP_PTE),
> compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order (DEVMAP_PUD).
> 
> So, today:
> 
> * namespaces with align of 1G would use *struct pages of order-0* (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> * namespaces with align of 2M would use *struct pages of order-0* (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> 
> After this series:
> 
> * namespaces with align of 1G would use *compound struct pages of order-30* (DEVMAP_PUD)
> backed with PMD entries in the direct map.

order-18

> * namespaces with align of 1G would use *compound struct pages of order-21* (DEVMAP_PMD)
> backed with PTE entries in the direct map.

i think you mean "align of 2M", and order-9.

(note that these numbers are/can be different for powerpc since it
supports different TLB page sizes.  please don't get locked into x86
sizes, and don't ignore the benefits *to software* of managing memory
in sizes other than just those supported by the hardware).


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-06 11:43             ` Matthew Wilcox
@ 2021-05-06 12:15               ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-05-06 12:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Aneesh Kumar K.V, Dan Williams, Linux MM, Ira Weiny,
	linux-nvdimm, Jason Gunthorpe, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton

On 5/6/21 12:43 PM, Matthew Wilcox wrote:
> On Thu, May 06, 2021 at 11:23:25AM +0100, Joao Martins wrote:
>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>> value because it should be ok to have 4MB align if the device really
>>>> wanted. However, when it goes to map that alignment with
>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>> they're already separate concepts that can stay separate.
>>>
>>> devdax namespace with align of 1G implies we expect to map them with 1G
>>> pte entries? I didn't follow when you say we map them today with
>>> DEVMAP_PTE entries.
>>
>> This sort of confusion is largelly why Dan is suggesting a @geometry for naming rather
>> than @align (which traditionally refers to page tables entry sizes in pagemap-related stuff).
>>
>> DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages (DEVMAP_PTE),
>> compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order (DEVMAP_PUD).
>>
>> So, today:
>>
>> * namespaces with align of 1G would use *struct pages of order-0* (DEVMAP_PTE) backed with
>> PMD entries in the direct map.
>> * namespaces with align of 2M would use *struct pages of order-0* (DEVMAP_PTE) backed with
>> PMD entries in the direct map.
>>
>> After this series:
>>
>> * namespaces with align of 1G would use *compound struct pages of order-30* (DEVMAP_PUD)
>> backed with PMD entries in the direct map.
> 
> order-18
> 
Right, thanks for the correction.

Forgot to subtract PAGE_SIZE order (12).

>> * namespaces with align of 1G would use *compound struct pages of order-21* (DEVMAP_PMD)
>> backed with PTE entries in the direct map.
> 
> i think you mean "align of 2M", and order-9.
> 
True.

> (note that these numbers are/can be different for powerpc since it
> supports different TLB page sizes.  please don't get locked into x86
> sizes, and don't ignore the benefits *to software* of managing memory
> in sizes other than just those supported by the hardware).
> 
I am not ignoring that either that I think.

The page size looking values above is also what the consumer of this (device-dax) allows
you to use as @align, hence why you see DEVMAP_PTE, PMD and PUD as the sizes.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-05-06 10:27     ` Joao Martins
@ 2021-05-06 18:36       ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-05-06 18:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 5/6/21 11:27 AM, Joao Martins wrote:
> On 5/5/21 11:43 PM, Dan Williams wrote:
>> I suspect it's a good sign I'm only finding cosmetic and changelog
>> changes in the review... 
> 
> Hopefully it continues that way, but the meat of the series is located
> in patches 4, 6, 7 and 11. *Specially* 6 and 7.
> 
Ugh, I meant 7 and 8 *not* 6 and 7.

> I very strongly suspect I am gonna get non-cosmetic comments there.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-05-06 11:01     ` Joao Martins
@ 2021-05-10 19:19       ` Dan Williams
  2021-05-13 18:45         ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-05-10 19:19 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> >> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> >
> > I think this can be replaced with a call to follow_pte(&init_mm...).
> >
>
> Ah, of course! That would shorten things up too.

Now that I look closely, I notice that function disclaims being
suitable as a general purpose pte lookup utility. If it works for you,
great, but if not I think it's past time to create such a helper. I
know Ira is looking for one, and I contributed to the proliferation
when I added dev_pagemap_mapping_shift().


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-05-10 19:19       ` Dan Williams
@ 2021-05-13 18:45         ` Joao Martins
  2021-06-16 15:05           ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-05-13 18:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 5/10/21 8:19 PM, Dan Williams wrote:
> On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>>
>>> I think this can be replaced with a call to follow_pte(&init_mm...).
>>>
>>
>> Ah, of course! That would shorten things up too.
> 
> Now that I look closely, I notice that function disclaims being
> suitable as a general purpose pte lookup utility. 
> If it works for you,
> great, but if not I think it's past time to create such a helper. I
> know Ira is looking for one, and I contributed to the proliferation
> when I added dev_pagemap_mapping_shift().
> 
There's also x86 equivalents, like lookup_address() and lookup_address_in_pgd().
These two don't differ that much from vmemmap_lookup_address().

I can move this to an internal place e.g. mm/internal.h

The patch after this one, changes vmemmap_lookup_address() to stop at the PMD (to reuse
that across the next sections).


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:36         ` Joao Martins
  2021-05-05 23:03           ` Dan Williams
@ 2021-05-18 17:27           ` Joao Martins
  2021-05-18 19:56             ` Jane Chu
  2021-06-07 20:17             ` Dan Williams
  1 sibling, 2 replies; 64+ messages in thread
From: Joao Martins @ 2021-05-18 17:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 5/5/21 11:36 PM, Joao Martins wrote:
> On 5/5/21 11:20 PM, Dan Williams wrote:
>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>> --- a/include/linux/memremap.h
>>>>> +++ b/include/linux/memremap.h
>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>         struct completion done;
>>>>>         enum memory_type type;
>>>>>         unsigned int flags;
>>>>> +       unsigned long align;
>>>>
>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>> means "use non-compound base pages".

[...]

>>>> The non-zero value must be
>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>> Hmm, maybe it should be an
>>>> enum:
>>>>
>>>> enum devmap_geometry {
>>>>     DEVMAP_PTE,
>>>>     DEVMAP_PMD,
>>>>     DEVMAP_PUD,
>>>> }
>>>>
>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>
>> I think it is ok for dax/nvdimm to continue to maintain their align
>> value because it should be ok to have 4MB align if the device really
>> wanted. However, when it goes to map that alignment with
>> memremap_pages() it can pick a mode. For example, it's already the
>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>> they're already separate concepts that can stay separate.
>>
> Gotcha.

I am reconsidering part of the above. In general, yes, the meaning of devmap @align
represents a slightly different variation of the device @align i.e. how the metadata is
laid out **but** regardless of what kind of page table entries we use vmemmap.

By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
2) the geometry of metadata is very much tied to the value we pick to @align at namespace
provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
above? -- and 3) the value of geometry actually derives from dax device @align because we
will need to create compound pages representing a page size of @align value.

Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
device @align. In reality what we want to convey in @geometry is not page table sizes, but
just the page size used for the vmemmap of the dax device. Additionally, limiting its
value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
create compound pages of order 10 for the said vmemmap.

I am going to wait until you finish reviewing the remaining four patches of this series,
but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
setter/getter to audit its value? Thoughts?

		Joao


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-18 17:27           ` Joao Martins
@ 2021-05-18 19:56             ` Jane Chu
  2021-05-19 11:29               ` Joao Martins
  2021-06-07 20:17             ` Dan Williams
  1 sibling, 1 reply; 64+ messages in thread
From: Jane Chu @ 2021-05-18 19:56 UTC (permalink / raw)
  To: Joao Martins, Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Muchun Song, Mike Kravetz, Andrew Morton

On 5/18/2021 10:27 AM, Joao Martins wrote:

> On 5/5/21 11:36 PM, Joao Martins wrote:
>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>> --- a/include/linux/memremap.h
>>>>>> +++ b/include/linux/memremap.h
>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>          struct completion done;
>>>>>>          enum memory_type type;
>>>>>>          unsigned int flags;
>>>>>> +       unsigned long align;
>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>> means "use non-compound base pages".
> [...]
>
>>>>> The non-zero value must be
>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>> Hmm, maybe it should be an
>>>>> enum:
>>>>>
>>>>> enum devmap_geometry {
>>>>>      DEVMAP_PTE,
>>>>>      DEVMAP_PMD,
>>>>>      DEVMAP_PUD,
>>>>> }
>>>>>
>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>> value because it should be ok to have 4MB align if the device really
>>> wanted. However, when it goes to map that alignment with
>>> memremap_pages() it can pick a mode. For example, it's already the
>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>> they're already separate concepts that can stay separate.
>>>
>> Gotcha.
> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
> represents a slightly different variation of the device @align i.e. how the metadata is
> laid out **but** regardless of what kind of page table entries we use vmemmap.
>
> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
> above? -- and 3) the value of geometry actually derives from dax device @align because we
> will need to create compound pages representing a page size of @align value.
>
> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
> device @align. In reality what we want to convey in @geometry is not page table sizes, but
> just the page size used for the vmemmap of the dax device. Additionally, limiting its
> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
> create compound pages of order 10 for the said vmemmap.
>
> I am going to wait until you finish reviewing the remaining four patches of this series,
> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
> setter/getter to audit its value? Thoughts?
>
> 		Joao

Good points there.

My understanding is that  dax->align  conveys granularity of size while 
carving out a namespace,

it's a geometry attribute loosely akin to sector size of a spindle 
disk.  I tend to think that

device pagesize  has almost no relation to "align" in that, it's 
possible to have 1G "align" and

4K pagesize, or verse versa.  That is, with the advent of compound page 
support, it is possible

to totally separate the two concepts.

How about adding a new option to "ndctl create-namespace" that describes 
device creator's desired

pagesize, and another parameter to describe whether the pagesize shall 
be fixed or allowed to be split up,

such that, if the intention is to never split up 2M pagesize, then it 
would be possible to save a lot metadata

space on the device?

thanks,

-jane



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-18 19:56             ` Jane Chu
@ 2021-05-19 11:29               ` Joao Martins
  2021-05-19 18:36                 ` Jane Chu
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-05-19 11:29 UTC (permalink / raw)
  To: Jane Chu, Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Muchun Song, Mike Kravetz, Andrew Morton



On 5/18/21 8:56 PM, Jane Chu wrote:
> On 5/18/2021 10:27 AM, Joao Martins wrote:
> 
>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>> --- a/include/linux/memremap.h
>>>>>>> +++ b/include/linux/memremap.h
>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>>          struct completion done;
>>>>>>>          enum memory_type type;
>>>>>>>          unsigned int flags;
>>>>>>> +       unsigned long align;
>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>> means "use non-compound base pages".
>> [...]
>>
>>>>>> The non-zero value must be
>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>> Hmm, maybe it should be an
>>>>>> enum:
>>>>>>
>>>>>> enum devmap_geometry {
>>>>>>      DEVMAP_PTE,
>>>>>>      DEVMAP_PMD,
>>>>>>      DEVMAP_PUD,
>>>>>> }
>>>>>>
>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>> value because it should be ok to have 4MB align if the device really
>>>> wanted. However, when it goes to map that alignment with
>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>> they're already separate concepts that can stay separate.
>>>>
>>> Gotcha.
>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>> represents a slightly different variation of the device @align i.e. how the metadata is
>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>
>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>> will need to create compound pages representing a page size of @align value.
>>
>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>> just the page size used for the vmemmap of the dax device. Additionally, limiting its
>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>> create compound pages of order 10 for the said vmemmap.
>>
>> I am going to wait until you finish reviewing the remaining four patches of this series,
>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>> setter/getter to audit its value? Thoughts?
> 
> Good points there.
> 
> My understanding is that  dax->align  conveys granularity of size while 
> carving out a namespace it's a geometry attribute loosely akin to sector size of a spindle 
> disk.  I tend to think that device pagesize  has almost no relation to "align" in that, it's 
> possible to have 1G "align" and 4K pagesize, or verse versa.  That is, with the advent of compound page 
> support, it is possible to totally separate the two concepts.
> 
> How about adding a new option to "ndctl create-namespace" that describes 
> device creator's desired pagesize, and another parameter to describe whether the pagesize shall 
> be fixed or allowed to be split up, such that, if the intention is to never split up 2M pagesize, then it 
> would be possible to save a lot metadata space on the device?

Maybe that can be selected by the driver too, but it's an interesting point you raise
should we settle with the geometry (e.g. like a geometry sysfs entry IIUC your
suggestion?). device-dax for example would use geometry == align and therefore save space
(like what I propose in patch 10). But fsdax would retain the default that is geometry =
PAGE_SIZE and align = PMD_SIZE should it want to split pages.

Interestingly, devmap poisoning always occur at @align level regardless of @geometry.

What I am not sure is what value (vs added complexity) it brings to allow geometry *value*
to be selecteable by user given that so far we seem to only ever initialize metadata as
either sets of base pages [*] or sets of compound pages (of a size). And the difference
between both can possibly be summarized to split-ability like you say.

[*] that optionally can are morphed into compound pages by driver


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-19 11:29               ` Joao Martins
@ 2021-05-19 18:36                 ` Jane Chu
  0 siblings, 0 replies; 64+ messages in thread
From: Jane Chu @ 2021-05-19 18:36 UTC (permalink / raw)
  To: Joao Martins, Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Muchun Song, Mike Kravetz, Andrew Morton,
	Jane Chu


On 5/19/2021 4:29 AM, Joao Martins wrote:
>
> On 5/18/21 8:56 PM, Jane Chu wrote:
>> On 5/18/2021 10:27 AM, Joao Martins wrote:
>>
>>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>>>           struct completion done;
>>>>>>>>           enum memory_type type;
>>>>>>>>           unsigned int flags;
>>>>>>>> +       unsigned long align;
>>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>>> means "use non-compound base pages".
>>> [...]
>>>
>>>>>>> The non-zero value must be
>>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>>> Hmm, maybe it should be an
>>>>>>> enum:
>>>>>>>
>>>>>>> enum devmap_geometry {
>>>>>>>       DEVMAP_PTE,
>>>>>>>       DEVMAP_PMD,
>>>>>>>       DEVMAP_PUD,
>>>>>>> }
>>>>>>>
>>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>>> value because it should be ok to have 4MB align if the device really
>>>>> wanted. However, when it goes to map that alignment with
>>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>>> they're already separate concepts that can stay separate.
>>>>>
>>>> Gotcha.
>>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>>> represents a slightly different variation of the device @align i.e. how the metadata is
>>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>>
>>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>>> will need to create compound pages representing a page size of @align value.
>>>
>>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>>> just the page size used for the vmemmap of the dax device. Additionally, limiting its
>>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>>> create compound pages of order 10 for the said vmemmap.
>>>
>>> I am going to wait until you finish reviewing the remaining four patches of this series,
>>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>>> setter/getter to audit its value? Thoughts?
>> Good points there.
>>
>> My understanding is that  dax->align  conveys granularity of size while
>> carving out a namespace it's a geometry attribute loosely akin to sector size of a spindle
>> disk.  I tend to think that device pagesize  has almost no relation to "align" in that, it's
>> possible to have 1G "align" and 4K pagesize, or verse versa.  That is, with the advent of compound page
>> support, it is possible to totally separate the two concepts.
>>
>> How about adding a new option to "ndctl create-namespace" that describes
>> device creator's desired pagesize, and another parameter to describe whether the pagesize shall
>> be fixed or allowed to be split up, such that, if the intention is to never split up 2M pagesize, then it
>> would be possible to save a lot metadata space on the device?
> Maybe that can be selected by the driver too, but it's an interesting point you raise
> should we settle with the geometry (e.g. like a geometry sysfs entry IIUC your
> suggestion?). device-dax for example would use geometry == align and therefore save space
> (like what I propose in patch 10). But fsdax would retain the default that is geometry =
> PAGE_SIZE and align = PMD_SIZE should it want to split pages.

Let's see, I think this is what we have today

        | align   hpagesize  geometry  hpage-splittable
=======================================================
devdax | 4K..1G  2M,1G         4K     artificially no
fsdax  | 4K..1G  2M            4K     yes

So a hard no-split means  (hpagesize == geometry), and that does not apply

to fsdax for now. But is it not possible in future?  Some customer prefers

an optional  guarantee that their DAX hpage never been splitted up for 
the sake of

rdma efficiency.

>
> Interestingly, devmap poisoning always occur at @align level regardless of @geometry.
Yeah, it's a simplification that's not ideal, because after all, 
error-blast-radius != UserMapping-pagesize.
>
> What I am not sure is what value (vs added complexity) it brings to allow geometry *value*
> to be selecteable by user given that so far we seem to only ever initialize metadata as
> either sets of base pages [*] or sets of compound pages (of a size). And the difference
> between both can possibly be summarized to split-ability like you say.
>
> [*] that optionally can are morphed into compound pages by driver

Agreed.  For this series, it's simpler not to make the 
compound-page-size selectable.

thanks,

-jane



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
@ 2021-06-01 19:30   ` Dan Williams
  2021-06-07 12:02     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-06-01 19:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

Sorry for the delay, and the sync up on IRC. I think this subject
needs to change to "optimize memory savings for compound pud memmap
geometry", and move this to the back of the series to make it clear
which patches are base functionality and which extend the idea
further. As far as I can see this patch can move to the end of the
series. Some additional changelog feedback below:


On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Right now basepages are used to populate the PUD tail pages, and it
> picks the address of the previous page of the subsection that preceeds

s/preceeds/precedes/

> the memmap we are initializing.  This is done when a given memmap
> address isn't aligned to the pgmap @align (which is safe to do because
> @ranges are guaranteed to be aligned to @align).

You know what would help is if you could draw an ascii art picture of
the before and after of the head page vs tail page arrangement. I can
see how the words line up to the code, but it takes a while to get the
picture in my head and I think future work in this area will benefit
from having a place in Documentation that draws a picture of the
layout of the various geometries.

I've used asciiflow.com for these types of diagrams in the past.

>
> For pagemaps with an align which spans various sections, this means
> that PMD pages are unnecessarily allocated for reusing the same tail
> pages.  Effectively, on x86 a PUD can span 8 sections (depending on
> config), and a page is being  allocated a page for the PMD to reuse
> the tail vmemmap across the rest of the PTEs. In short effecitvely the
> PMD cover the tail vmemmap areas all contain the same PFN. So instead
> of doing this way, populate a new PMD on the second section of the
> compound page (tail vmemmap PMD), and then the following sections
> utilize the preceding PMD we previously populated which only contain
> tail pages).
>
> After this scheme for an 1GB pagemap aligned area, the first PMD
> (section) would contain head page and 32767 tail pages, where the
> second PMD contains the full 32768 tail pages.  The latter page gets
> its PMD reused across future section mapping of the same pagemap.
>
> Besides fewer pagetable entries allocated, keeping parity with
> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
> this further increases savings per compound page. For each PUD-aligned
> pagemap we go from 40960 bytes down to 16384 bytes. Rather than
> requiring 8 PMD page allocations we only need 2 (plus two base pages
> allocated for head and tail areas for the first PMD). 2M pages still
> require using base pages, though.

I would front load the above savings to the top of this discussion and
say something like:

"Currently, for compound PUD mappings, the implementation consumes X
GB per TB this can be optimized to Y GB per TB with the approach
detailed below."

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h  |  3 +-
>  mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 63 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 49d717ae40ae..9c1a676d6b95 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                           void *block);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>                             struct vmem_altmap *altmap, void *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index f57c5eada099..291a8a32480a 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>         return p;
>  }
>
> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                                      void *block)
>  {
>         pmd_t *pmd = pmd_offset(pud, addr);
>         if (pmd_none(*pmd)) {
> -               void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> -               if (!p)
> -                       return NULL;
> +               void *p = block;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> +                       if (!p)
> +                               return NULL;
> +               } else {
> +                       get_page(virt_to_page(block));
> +               }
>                 pmd_populate_kernel(&init_mm, pmd, p);
>         }
>         return pmd;
> @@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap,
> -                                             void *page, void **ptr)
> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
> +                                                 struct vmem_altmap *altmap,
> +                                                 void *page, pmd_t **ptr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
> -       pte_t *pte;
>
>         pgd = vmemmap_pgd_populate(addr, node);
>         if (!pgd)
> @@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pud = vmemmap_pud_populate(p4d, addr, node);
>         if (!pud)
>                 return -ENOMEM;
> -       pmd = vmemmap_pmd_populate(pud, addr, node);
> +       pmd = vmemmap_pmd_populate(pud, addr, node, page);
>         if (!pmd)
>                 return -ENOMEM;
> +       if (ptr)
> +               *ptr = pmd;
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap,
> +                                             void *page, void **ptr)
> +{
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
> +               return -ENOMEM;
> +
>         pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>         if (!pte)
>                 return -ENOMEM;
> @@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>         return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>  }
>
> -static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
> +                                               unsigned long end,
> +                                               int node, void *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PMD_SIZE) {
> +               if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
> -       pte_t *pte;
>
>         pgd = pgd_offset_k(addr);
>         if (pgd_none(*pgd))
> @@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>         if (pmd_none(*pmd))
>                 return NULL;
>
> -       pte = pte_offset_kernel(pmd, addr);
> -       if (pte_none(*pte))
> -               return NULL;
> -
> -       return pte;
> +       return pmd;
>  }
>
>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
> @@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>         if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>             pgmap_align(pgmap) > SUBSECTION_SIZE) {
> -               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
> +               pmd_t *pmdp;
> +               pte_t *ptep;
> +
> +               addr = start - PAGE_SIZE;
> +               pmdp = vmemmap_lookup_address(addr);
> +               if (!pmdp)
> +                       return -ENOMEM;
> +
> +               /* Reuse the tail pages vmemmap pmd page */

This comment really wants to be "See layout diagram in
Documentation/vm/compound_pagemaps.rst", because the reuse algorithm
is not immediately obvious.

Other than that the implementation looks ok to me, modulo previous
comments about @block type and the use of the "geometry" term.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-06-01 23:35   ` Dan Williams
  2021-06-07 13:48     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-06-01 23:35 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

Just like patch8 lets rename this "optimize memmap_init_zone_device()
for compound page maps"


On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> When a pgmap @align is set,

Rewrite this to talk about geometry rather than align per previous
patch comments.

> all pages are mapped at a given huge page
> alignment and thus uses compound pages to describe them as opposed to a

s/thus uses/use/

> struct per 4K.
>
> With @align > PAGE_SIZE and when struct pages are stored in ram
> (!altmap) most tail pages are reused. Consequently, the amount of unique
> struct pages is a lot smaller that the total amount of struct pages
> being mapped.
>
> When struct pages are initialize in memmap_init_zone_device, make
> sure that only unique struct pages are initialized i.e. the first 2
> 4K pages per @align which means 128 struct pages, instead of 32768 for
> 2M @align or 262144 for a 1G @align.

This is the important detail to put at the beginning and then follow
with the detailed description, i.e. something like:

"Currently memmap_init_zone_device() ends up initializing 32768 pages
when it only needs to initialize 128 given tail page reuse. That
number is worse with 1GB compound page geometries, 262144 instead of
128. Update memmap_init_zone_device() to skip redundant
initialization, details below:"

> Keep old behaviour with altmap given that it doesn't support reusal
> of tail vmemmap areas.

I think I like how you described this earlier as the (!altmap) path. So:

"The !altmap path is left alone since it does not support compound
page map geometries."

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3a77f9e43f3a..948dfad6754b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6277,6 +6277,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>         }
>  }
>
> +#define MEMMAP_NR_PAGES        (2 * (PAGE_SIZE/sizeof(struct page)))

This seems too generic a name for something that is specific to the
compound page init case... more below.

> +
>  void __ref memmap_init_zone_device(struct zone *zone,
>                                    unsigned long start_pfn,
>                                    unsigned long nr_pages,
> @@ -6287,6 +6289,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>         unsigned int pfn_align = pgmap_pfn_align(pgmap);
>         unsigned int order_align = order_base_2(pfn_align);
> +       unsigned long ntails = min_t(unsigned long, pfn_align, MEMMAP_NR_PAGES);

I'd rather delay this to a specific "if (compound)" branch below
because compound init is so different from the base page case.

>         unsigned long zone_idx = zone_idx(zone);
>         unsigned long start = jiffies;
>         int nid = pgdat->node_id;
> @@ -6302,6 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         if (altmap) {
>                 start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>                 nr_pages = end_pfn - start_pfn;
> +               ntails = pfn_align;

Why would the altmap case need to init ntails, won't the inner loop of
compound page setup be skipped?

>         }

Given all of the above I'm wondering if there should be a new
"compound specific" flavor of this routine rather than trying to
cleverly inter mingle the old path with the new. This is easier
because you have already done the work in previous patches to break
this into helpers. So just have memmap_init_zone_device() do it the
"old" way and memmap_init_compound() handle all the tail page init +
optimizations.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 10/11] device-dax: compound pagemap support
  2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
@ 2021-06-02  0:36   ` Dan Williams
  2021-06-07 13:59     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-06-02  0:36 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> dax devices are created with a fixed @align (huge page size) which
> is enforced through as well at mmap() of the device. Faults,
> consequently happen too at the specified @align specified at the
> creation, and those don't change through out dax device lifetime.
> MCEs poisons a whole dax huge page, as well as splits occurring at
> at the configured page size.

This paragraph last...

>
> Use the newly added compound pagemap facility which maps the
> assigned dax ranges as compound pages at a page size of @align.
> Currently, this means, that region/namespace bootstrap would take
> considerably less, given that you would initialize considerably less
> pages.

This paragraph should go first...

>
> On setups with 128G NVDIMMs the initialization with DRAM stored struct pages
> improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than
> a 1msec with 1G pages.

This paragraph second...


The reason for this ordering is to have increasingly more detail as
the changelog is read so that people that don't care about the details
can get the main theme immediately, and others that wonder why
device-dax is able to support this can read deeper.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 45 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index db92573c94e8..e3dcc4ad1727 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  }
>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>
> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
> +                            unsigned long fault_size,
> +                            struct address_space *f_mapping)
> +{
> +       unsigned long i;
> +       pgoff_t pgoff;
> +
> +       pgoff = linear_page_index(vmf->vma, vmf->address
> +                       & ~(fault_size - 1));

I know you are just copying this style from whomever wrote it this way
originally, but that person (me) was wrong this should be:

pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));

...you might do a lead-in cleanup patch before this one.


> +
> +       for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> +               struct page *page;
> +
> +               page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> +               if (page->mapping)
> +                       continue;
> +               page->mapping = f_mapping;
> +               page->index = pgoff + i;
> +       }
> +}
> +
> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
> +                                unsigned long fault_size,
> +                                struct address_space *f_mapping)
> +{
> +       struct page *head;
> +
> +       head = pfn_to_page(pfn_t_to_pfn(pfn));
> +       head = compound_head(head);
> +       if (head->mapping)
> +               return;
> +
> +       head->mapping = f_mapping;
> +       head->index = linear_page_index(vmf->vma, vmf->address
> +                       & ~(fault_size - 1));
> +}
> +
>  static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>                 enum page_entry_size pe_size)
>  {
> @@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>         }
>
>         if (rc == VM_FAULT_NOPAGE) {
> -               unsigned long i;
> -               pgoff_t pgoff;
> +               struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;

The device should already know its pagemap...

There is a distinction in dev_dax_probe() for "static" vs "dynamic"
pgmap, but once the pgmap is allocated it should be fine to assign it
back to dev_dax->pgmap in the "dynamic" case. That could be a lead-in
patch to make dev_dax->pgmap always valid.

>
>                 /*
>                  * In the device-dax case the only possibility for a
> @@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>                  * mapped. No need to consider the zero page, or racing
>                  * conflicting mappings.
>                  */
> -               pgoff = linear_page_index(vmf->vma, vmf->address
> -                               & ~(fault_size - 1));
> -               for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> -                       struct page *page;
> -
> -                       page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> -                       if (page->mapping)
> -                               continue;
> -                       page->mapping = filp->f_mapping;
> -                       page->index = pgoff + i;
> -               }
> +               if (pgmap->align > PAGE_SIZE)
> +                       set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
> +               else
> +                       set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
>         }
>         dax_read_unlock(id);
>
> @@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>         }
>
>         pgmap->type = MEMORY_DEVICE_GENERIC;
> +       if (dev_dax->align > PAGE_SIZE)
> +               pgmap->align = dev_dax->align;

Just needs updates for whatever renames you do for the "compound
geometry" terminology rather than subtle side effects of "align".

Other than that, looks good to me.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-06-02  1:05   ` Dan Williams
  2021-06-07 15:21     ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-06-02  1:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, nvdimm

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Much like hugetlbfs or THPs, treat device pagemaps with
> compound pages like the rest of GUP handling of compound pages.
>

How about:

"Use try_grab_compound_head() for device-dax GUP when configured with
a compound pagemap."

> Rather than incrementing the refcount every 4K, we record
> all sub pages and increment by @refs amount *once*.

"Rather than incrementing the refcount for each page, do one atomic
addition for all the pages to be pinned."

>
> Performance measured by gup_benchmark improves considerably
> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>
>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> [altmap]
> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

Hmm what is altmap representing here? The altmap case does not support
compound geometry, so this last test is comparing pinning this amount
of memory without compound pages where the memmap is in PMEM to the
speed *with* compound pages and the memmap in DRAM?

>
>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> [altmap with -m 127004]
> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 32 insertions(+), 20 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index b3e647c8b7ee..514f12157a0f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  }
>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>
> +
> +static int record_subpages(struct page *page, unsigned long addr,
> +                          unsigned long end, struct page **pages)
> +{
> +       int nr;
> +
> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
> +               pages[nr++] = page++;
> +
> +       return nr;
> +}
> +
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>                              unsigned long end, unsigned int flags,
>                              struct page **pages, int *nr)
>  {
> -       int nr_start = *nr;
> +       int refs, nr_start = *nr;
>         struct dev_pagemap *pgmap = NULL;
>
>         do {
> -               struct page *page = pfn_to_page(pfn);
> +               struct page *head, *page = pfn_to_page(pfn);
> +               unsigned long next;
>
>                 pgmap = get_dev_pagemap(pfn, pgmap);
>                 if (unlikely(!pgmap)) {
>                         undo_dev_pagemap(nr, nr_start, flags, pages);
>                         return 0;
>                 }
> -               SetPageReferenced(page);
> -               pages[*nr] = page;
> -               if (unlikely(!try_grab_page(page, flags))) {
> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
> +
> +               head = compound_head(page);
> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;

This looks a tad messy, and makes assumptions that upper layers are
not sending this routine multiple huge pages to map. next should be
set to the next compound page, not end.

> +               refs = record_subpages(page, addr, next, pages + *nr);
> +
> +               SetPageReferenced(head);
> +               head = try_grab_compound_head(head, refs, flags);
> +               if (!head) {
> +                       if (PageCompound(head)) {

@head is NULL here, I think you wanted to rename the result of
try_grab_compound_head() to something like pinned_head so that you
don't undo the work you did above. However I feel like there's one too
PageCompund() checks.


> +                               ClearPageReferenced(head);
> +                               put_dev_pagemap(pgmap);
> +                       } else {
> +                               undo_dev_pagemap(nr, nr_start, flags, pages);
> +                       }
>                         return 0;
>                 }
> -               (*nr)++;
> -               pfn++;
> -       } while (addr += PAGE_SIZE, addr != end);
> +               *nr += refs;
> +               pfn += refs;
> +       } while (addr += (refs << PAGE_SHIFT), addr != end);
>
>         if (pgmap)
>                 put_dev_pagemap(pgmap);
> @@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>  }
>  #endif
>
> -static int record_subpages(struct page *page, unsigned long addr,
> -                          unsigned long end, struct page **pages)
> -{
> -       int nr;
> -
> -       for (nr = 0; addr != end; addr += PAGE_SIZE)
> -               pages[nr++] = page++;
> -
> -       return nr;
> -}
> -
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>                                       unsigned long sz)
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  2021-06-01 19:30   ` Dan Williams
@ 2021-06-07 12:02     ` Joao Martins
  2021-06-07 19:47       ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-06-07 12:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 6/1/21 8:30 PM, Dan Williams wrote:
> Sorry for the delay, and the sync up on IRC. I think this subject
> needs to change to "optimize memory savings for compound pud memmap
> geometry", and move this to the back of the series to make it clear
> which patches are base functionality and which extend the idea
> further. 

OK

> As far as I can see this patch can move to the end of the
> series. 

Maybe its prefered that this patch could be deferred to out of the series as
a followup improvement, and leave this series with the base feature set only?

> Some additional changelog feedback below:
> 
> 
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Right now basepages are used to populate the PUD tail pages, and it
>> picks the address of the previous page of the subsection that preceeds
> 
> s/preceeds/precedes/
> 
Yeap.

>> the memmap we are initializing.  This is done when a given memmap
>> address isn't aligned to the pgmap @align (which is safe to do because
>> @ranges are guaranteed to be aligned to @align).
> 
> You know what would help is if you could draw an ascii art picture of
> the before and after of the head page vs tail page arrangement. I can
> see how the words line up to the code, but it takes a while to get the
> picture in my head and I think future work in this area will benefit
> from having a place in Documentation that draws a picture of the
> layout of the various geometries.
> 
Makes sense, I will add docs. Mike K. and others had similar trouble following
the page structs arrangement which ultimately lead to this section of commentary
at the beginning of the new source file added here ...

https://www.ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page.patch

> I've used asciiflow.com for these types of diagrams in the past.
>

... so perhaps I can borrow some of that and place it to
a common place like in Documentation/vm/compound_pagemaps.rst

This patch specifically would need a new diagram added on top covering
the PMD page case.

>>
>> For pagemaps with an align which spans various sections, this means
>> that PMD pages are unnecessarily allocated for reusing the same tail
>> pages.  Effectively, on x86 a PUD can span 8 sections (depending on
>> config), and a page is being  allocated a page for the PMD to reuse
>> the tail vmemmap across the rest of the PTEs. In short effecitvely the
>> PMD cover the tail vmemmap areas all contain the same PFN. So instead
>> of doing this way, populate a new PMD on the second section of the
>> compound page (tail vmemmap PMD), and then the following sections
>> utilize the preceding PMD we previously populated which only contain
>> tail pages).
>>
>> After this scheme for an 1GB pagemap aligned area, the first PMD
>> (section) would contain head page and 32767 tail pages, where the
>> second PMD contains the full 32768 tail pages.  The latter page gets
>> its PMD reused across future section mapping of the same pagemap.
>>
>> Besides fewer pagetable entries allocated, keeping parity with
>> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
>> this further increases savings per compound page. For each PUD-aligned
>> pagemap we go from 40960 bytes down to 16384 bytes. Rather than
>> requiring 8 PMD page allocations we only need 2 (plus two base pages
>> allocated for head and tail areas for the first PMD). 2M pages still
>> require using base pages, though.
> 
> I would front load the above savings to the top of this discussion and
> say something like:
> 
> "Currently, for compound PUD mappings, the implementation consumes X
> GB per TB this can be optimized to Y GB per TB with the approach
> detailed below."
> 
Got it.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h  |  3 +-
>>  mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
>>  2 files changed, 63 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 49d717ae40ae..9c1a676d6b95 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> +                           void *block);
>>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>>                             struct vmem_altmap *altmap, void *block);
>>  void *vmemmap_alloc_block(unsigned long size, int node);
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index f57c5eada099..291a8a32480a 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>>         return p;
>>  }
>>
>> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> +                                      void *block)
>>  {
>>         pmd_t *pmd = pmd_offset(pud, addr);
>>         if (pmd_none(*pmd)) {
>> -               void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> -               if (!p)
>> -                       return NULL;
>> +               void *p = block;
>> +
>> +               if (!block) {
>> +                       p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> +                       if (!p)
>> +                               return NULL;
>> +               } else {
>> +                       get_page(virt_to_page(block));
>> +               }
>>                 pmd_populate_kernel(&init_mm, pmd, p);
>>         }
>>         return pmd;
>> @@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>         return pgd;
>>  }
>>
>> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> -                                             struct vmem_altmap *altmap,
>> -                                             void *page, void **ptr)
>> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
>> +                                                 struct vmem_altmap *altmap,
>> +                                                 void *page, pmd_t **ptr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>>         pud_t *pud;
>>         pmd_t *pmd;
>> -       pte_t *pte;
>>
>>         pgd = vmemmap_pgd_populate(addr, node);
>>         if (!pgd)
>> @@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>         pud = vmemmap_pud_populate(p4d, addr, node);
>>         if (!pud)
>>                 return -ENOMEM;
>> -       pmd = vmemmap_pmd_populate(pud, addr, node);
>> +       pmd = vmemmap_pmd_populate(pud, addr, node, page);
>>         if (!pmd)
>>                 return -ENOMEM;
>> +       if (ptr)
>> +               *ptr = pmd;
>> +       return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> +                                             struct vmem_altmap *altmap,
>> +                                             void *page, void **ptr)
>> +{
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
>> +               return -ENOMEM;
>> +
>>         pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>>         if (!pte)
>>                 return -ENOMEM;
>> @@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>>         return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>>  }
>>
>> -static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
>> +                                               unsigned long end,
>> +                                               int node, void *page)
>> +{
>> +       unsigned long addr = start;
>> +
>> +       for (; addr < end; addr += PMD_SIZE) {
>> +               if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>>         pud_t *pud;
>>         pmd_t *pmd;
>> -       pte_t *pte;
>>
>>         pgd = pgd_offset_k(addr);
>>         if (pgd_none(*pgd))
>> @@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>         if (pmd_none(*pmd))
>>                 return NULL;
>>
>> -       pte = pte_offset_kernel(pmd, addr);
>> -       if (pte_none(*pte))
>> -               return NULL;
>> -
>> -       return pte;
>> +       return pmd;
>>  }
>>
>>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> @@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>>         if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>>             pgmap_align(pgmap) > SUBSECTION_SIZE) {
>> -               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
>> +               pmd_t *pmdp;
>> +               pte_t *ptep;
>> +
>> +               addr = start - PAGE_SIZE;
>> +               pmdp = vmemmap_lookup_address(addr);
>> +               if (!pmdp)
>> +                       return -ENOMEM;
>> +
>> +               /* Reuse the tail pages vmemmap pmd page */
> 
> This comment really wants to be "See layout diagram in
> Documentation/vm/compound_pagemaps.rst", because the reuse algorithm
> is not immediately obvious.
>
ACK

> Other than that the implementation looks ok to me, modulo previous
> comments about @block type and the use of the "geometry" term.
> 
OK.

Btw, speaking of geometry, could you have a look at this thread:

https://lore.kernel.org/linux-mm/8c922a58-c901-1ad9-5d19-1182bd6dea1e@oracle.com/

.. and let me know what you think?


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-01 23:35   ` Dan Williams
@ 2021-06-07 13:48     ` Joao Martins
  2021-06-07 19:32       ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-06-07 13:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 6/2/21 12:35 AM, Dan Williams wrote:
> Just like patch8 lets rename this "optimize memmap_init_zone_device()
> for compound page maps"
> 
> 
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> When a pgmap @align is set,
> 
> Rewrite this to talk about geometry rather than align per previous
> patch comments.
> 
OK.

>> all pages are mapped at a given huge page
>> alignment and thus uses compound pages to describe them as opposed to a
> 
> s/thus uses/use/
>
/me nods

>> struct per 4K.
>>
>> With @align > PAGE_SIZE and when struct pages are stored in ram
>> (!altmap) most tail pages are reused. Consequently, the amount of unique
>> struct pages is a lot smaller that the total amount of struct pages
>> being mapped.
>>
>> When struct pages are initialize in memmap_init_zone_device, make
>> sure that only unique struct pages are initialized i.e. the first 2
>> 4K pages per @align which means 128 struct pages, instead of 32768 for
>> 2M @align or 262144 for a 1G @align.
> 
> This is the important detail to put at the beginning and then follow
> with the detailed description, i.e. something like:
> 
> "Currently memmap_init_zone_device() ends up initializing 32768 pages
> when it only needs to initialize 128 given tail page reuse. That
> number is worse with 1GB compound page geometries, 262144 instead of
> 128. Update memmap_init_zone_device() to skip redundant
> initialization, details below:"
> 
/me nods

>> Keep old behaviour with altmap given that it doesn't support reusal
>> of tail vmemmap areas.
> 
> I think I like how you described this earlier as the (!altmap) path. So:
> 
> "The !altmap path is left alone since it does not support compound
> page map geometries."
> 
It's more accurate to say that "does not support memory savings with
compound page map geometries."

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/page_alloc.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3a77f9e43f3a..948dfad6754b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6277,6 +6277,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>>         }
>>  }
>>
>> +#define MEMMAP_NR_PAGES        (2 * (PAGE_SIZE/sizeof(struct page)))
> 
> This seems too generic a name for something that is specific to the
> compound page init case... more below.
> 
>> +
>>  void __ref memmap_init_zone_device(struct zone *zone,
>>                                    unsigned long start_pfn,
>>                                    unsigned long nr_pages,
>> @@ -6287,6 +6289,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>>         unsigned int pfn_align = pgmap_pfn_align(pgmap);
>>         unsigned int order_align = order_base_2(pfn_align);
>> +       unsigned long ntails = min_t(unsigned long, pfn_align, MEMMAP_NR_PAGES);
> 
> I'd rather delay this to a specific "if (compound)" branch below
> because compound init is so different from the base page case.
> 
Perhaps this is not obvious, but both altmap *and* non-altmap case would use compound pages.

This ntails default value difference refers to the number of *unique* pages (...)

>>         unsigned long zone_idx = zone_idx(zone);
>>         unsigned long start = jiffies;
>>         int nid = pgdat->node_id;
>> @@ -6302,6 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         if (altmap) {
>>                 start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>>                 nr_pages = end_pfn - start_pfn;
>> +               ntails = pfn_align;
> 
> Why would the altmap case need to init ntails, won't the inner loop of
> compound page setup be skipped?
> 
(...) Which means the answer here is no. The altmap case does initialize tail pages, but
contrary to the non-altmap, all tail struct pages are unique (no savings) and thus all
need to be initialized.

>>         }
> 
> Given all of the above I'm wondering if there should be a new
> "compound specific" flavor of this routine rather than trying to
> cleverly inter mingle the old path with the new. This is easier
> because you have already done the work in previous patches to break
> this into helpers. So just have memmap_init_zone_device() do it the
> "old" way and memmap_init_compound() handle all the tail page init +
> optimizations.
> 
I can separate it out, should be easier to follow.

Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
more accurately represents what goes on the page tables. That's regardless of
vmemmap-based gains, and hence why my train of thought was to not separate it.

After this series, all devmap pages where @geometry matches @align will have compound
pages be used instead. And we enable that in device-dax as first user (in the next patch).
altmap or not so far just differentiates on the uniqueness of struct pages as the former
doesn't reuse base pages that only contain tail pages and consequently makes us initialize
all tail struct pages.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 10/11] device-dax: compound pagemap support
  2021-06-02  0:36   ` Dan Williams
@ 2021-06-07 13:59     ` Joao Martins
  0 siblings, 0 replies; 64+ messages in thread
From: Joao Martins @ 2021-06-07 13:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Linux NVDIMM



On 6/2/21 1:36 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> dax devices are created with a fixed @align (huge page size) which
>> is enforced through as well at mmap() of the device. Faults,
>> consequently happen too at the specified @align specified at the
>> creation, and those don't change through out dax device lifetime.
>> MCEs poisons a whole dax huge page, as well as splits occurring at
>> at the configured page size.
> 
> This paragraph last...
> 
/me nods

>>
>> Use the newly added compound pagemap facility which maps the
>> assigned dax ranges as compound pages at a page size of @align.
>> Currently, this means, that region/namespace bootstrap would take
>> considerably less, given that you would initialize considerably less
>> pages.
> 
> This paragraph should go first...
> 
/me nods

>>
>> On setups with 128G NVDIMMs the initialization with DRAM stored struct pages
>> improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than
>> a 1msec with 1G pages.
> 
> This paragraph second...
> 
/me nods

> 
> The reason for this ordering is to have increasingly more detail as
> the changelog is read so that people that don't care about the details
> can get the main theme immediately, and others that wonder why
> device-dax is able to support this can read deeper.
> 
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 45 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index db92573c94e8..e3dcc4ad1727 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>>  }
>>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>
>> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +                            unsigned long fault_size,
>> +                            struct address_space *f_mapping)
>> +{
>> +       unsigned long i;
>> +       pgoff_t pgoff;
>> +
>> +       pgoff = linear_page_index(vmf->vma, vmf->address
>> +                       & ~(fault_size - 1));
> 
> I know you are just copying this style from whomever wrote it this way
> originally, but that person (me) was wrong this should be:
> 
> pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
> 
> ...you might do a lead-in cleanup patch before this one.
> 
Yeap, will do.

> 
>> +
>> +       for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>> +               struct page *page;
>> +
>> +               page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>> +               if (page->mapping)
>> +                       continue;
>> +               page->mapping = f_mapping;
>> +               page->index = pgoff + i;
>> +       }
>> +}
>> +
>> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +                                unsigned long fault_size,
>> +                                struct address_space *f_mapping)
>> +{
>> +       struct page *head;
>> +
>> +       head = pfn_to_page(pfn_t_to_pfn(pfn));
>> +       head = compound_head(head);
>> +       if (head->mapping)
>> +               return;
>> +
>> +       head->mapping = f_mapping;
>> +       head->index = linear_page_index(vmf->vma, vmf->address
>> +                       & ~(fault_size - 1));
>> +}
>> +
>>  static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>                 enum page_entry_size pe_size)
>>  {
>> @@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>         }
>>
>>         if (rc == VM_FAULT_NOPAGE) {
>> -               unsigned long i;
>> -               pgoff_t pgoff;
>> +               struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;
> 
> The device should already know its pagemap...
> 
> There is a distinction in dev_dax_probe() for "static" vs "dynamic"
> pgmap, but once the pgmap is allocated it should be fine to assign it
> back to dev_dax->pgmap in the "dynamic" case. That could be a lead-in
> patch to make dev_dax->pgmap always valid.
> 
I suppose you mean to always set dev_dax->pgmap at the end of the
'if (!pgmap)' in dev_dax_probe() after we allocate the pgmap.

I will make this a separate cleanup patch as you suggested.

>>
>>                 /*
>>                  * In the device-dax case the only possibility for a
>> @@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>                  * mapped. No need to consider the zero page, or racing
>>                  * conflicting mappings.
>>                  */
>> -               pgoff = linear_page_index(vmf->vma, vmf->address
>> -                               & ~(fault_size - 1));
>> -               for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>> -                       struct page *page;
>> -
>> -                       page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>> -                       if (page->mapping)
>> -                               continue;
>> -                       page->mapping = filp->f_mapping;
>> -                       page->index = pgoff + i;
>> -               }
>> +               if (pgmap->align > PAGE_SIZE)
>> +                       set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
>> +               else
>> +                       set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
>>         }
>>         dax_read_unlock(id);
>>
>> @@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>         }
>>
>>         pgmap->type = MEMORY_DEVICE_GENERIC;
>> +       if (dev_dax->align > PAGE_SIZE)
>> +               pgmap->align = dev_dax->align;
> 
> Just needs updates for whatever renames you do for the "compound
> geometry" terminology rather than subtle side effects of "align".
> 
> Other than that, looks good to me.
> 
OK, will do.

Thanks!


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-06-02  1:05   ` Dan Williams
@ 2021-06-07 15:21     ` Joao Martins
  2021-06-07 19:22       ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-06-07 15:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, nvdimm

On 6/2/21 2:05 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Much like hugetlbfs or THPs, treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
> 
> How about:
> 
> "Use try_grab_compound_head() for device-dax GUP when configured with
> a compound pagemap."
> 
Yeap, a bit clearer indeed.

>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
> 
> "Rather than incrementing the refcount for each page, do one atomic
> addition for all the pages to be pinned."
> 
ACK.

>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>>
>>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
>> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
>> [altmap]
>> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
>> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
> 
> Hmm what is altmap representing here? The altmap case does not support
> compound geometry, 

It does support compound geometry and so we use compound pages with altmap case.
What altmap doesn't support is the memory savings in the vmemmap that can be
done when using compound pages. That's what is represented here.

> so this last test is comparing pinning this amount
> of memory without compound pages where the memmap is in PMEM to the
> speed *with* compound pages and the memmap in DRAM?
> 
The test compares pinning this amount of memory with compound pages placed
in PMEM and in DRAM. It just exposes just how ineficient this can get if huge pages aren't
represented with compound pages.

>>
>>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
>> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
>> [altmap with -m 127004]
>> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
>> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 32 insertions(+), 20 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index b3e647c8b7ee..514f12157a0f 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>  }
>>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +                          unsigned long end, struct page **pages)
>> +{
>> +       int nr;
>> +
>> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +               pages[nr++] = page++;
>> +
>> +       return nr;
>> +}
>> +
>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>                              unsigned long end, unsigned int flags,
>>                              struct page **pages, int *nr)
>>  {
>> -       int nr_start = *nr;
>> +       int refs, nr_start = *nr;
>>         struct dev_pagemap *pgmap = NULL;
>>
>>         do {
>> -               struct page *page = pfn_to_page(pfn);
>> +               struct page *head, *page = pfn_to_page(pfn);
>> +               unsigned long next;
>>
>>                 pgmap = get_dev_pagemap(pfn, pgmap);
>>                 if (unlikely(!pgmap)) {
>>                         undo_dev_pagemap(nr, nr_start, flags, pages);
>>                         return 0;
>>                 }
>> -               SetPageReferenced(page);
>> -               pages[*nr] = page;
>> -               if (unlikely(!try_grab_page(page, flags))) {
>> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
>> +
>> +               head = compound_head(page);
>> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;
> 
> This looks a tad messy, and makes assumptions that upper layers are
> not sending this routine multiple huge pages to map. next should be
> set to the next compound page, not end.

Although for devmap (and same could be said for hugetlbfs), __gup_device_huge() (as called
by __gup_device_huge_{pud,pmd}) would only ever be called on a compound page which
represents the same level, as opposed to many compound pages i.e. @end already represents
the next compound page of the PMD or PUD level.

But of course, should we represent devmap pages in geometries other than the values of
hpagesize/align other than PMD or PUD size then it's true that relying on @end value being
next compound page is fragile. But so as the rest of the surrounding code.

> 
>> +               refs = record_subpages(page, addr, next, pages + *nr);
>> +
>> +               SetPageReferenced(head);
>> +               head = try_grab_compound_head(head, refs, flags);
>> +               if (!head) {
>> +                       if (PageCompound(head)) {
> 
> @head is NULL here, I think you wanted to rename the result of
> try_grab_compound_head() to something like pinned_head so that you
> don't undo the work you did above. 

Yes. pinned_head is what I actually should have written. Let me fix that.

> However I feel like there's one too
> PageCompund() checks.
> 

I agree, but I am not fully sure how I can remove them :(

The previous approach was to separate the logic into two distinct helpers namely
__gup_device_huge() and __gup_device_compound_huge(). But that sort of special casing
wasn't a good idea, so I tried merging both cases in __gup_device_huge() solely
differentiating on PageCompound().

I could make this slightly less bad by moving the error case PageCompound checks to
undo_dev_pagemap() and record_subpages().

But we still have the pagemap refcount to be taken until your other series removes the
need for it. So perhaps I should place the remaining PageCompound based check inside
record_subpages to accomodate the PAGE_SIZE geometry case (similarly hinted by Jason in
the previous version but that I didn't fully address).

How does the above sound?

longterm once we stop having devmap use non compound struct pages on PMDs/PUDs and the
pgmap refcount on gup is removed then perhaps we can move to existing regular huge page
path that is not devmap specific.

> 
>> +                               ClearPageReferenced(head);
>> +                               put_dev_pagemap(pgmap);
>> +                       } else {
>> +                               undo_dev_pagemap(nr, nr_start, flags, pages);
>> +                       }
>>                         return 0;
>>                 }
>> -               (*nr)++;
>> -               pfn++;
>> -       } while (addr += PAGE_SIZE, addr != end);
>> +               *nr += refs;
>> +               pfn += refs;
>> +       } while (addr += (refs << PAGE_SHIFT), addr != end);
>>
>>         if (pgmap)
>>                 put_dev_pagemap(pgmap);
>> @@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>>  }
>>  #endif
>>
>> -static int record_subpages(struct page *page, unsigned long addr,
>> -                          unsigned long end, struct page **pages)
>> -{
>> -       int nr;
>> -
>> -       for (nr = 0; addr != end; addr += PAGE_SIZE)
>> -               pages[nr++] = page++;
>> -
>> -       return nr;
>> -}
>> -
>>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>>                                       unsigned long sz)
>> --
>> 2.17.1
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-06-07 15:21     ` Joao Martins
@ 2021-06-07 19:22       ` Dan Williams
  0 siblings, 0 replies; 64+ messages in thread
From: Dan Williams @ 2021-06-07 19:22 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, nvdimm

On Mon, Jun 7, 2021 at 8:22 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 6/2/21 2:05 AM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Much like hugetlbfs or THPs, treat device pagemaps with
> >> compound pages like the rest of GUP handling of compound pages.
> >>
> >
> > How about:
> >
> > "Use try_grab_compound_head() for device-dax GUP when configured with
> > a compound pagemap."
> >
> Yeap, a bit clearer indeed.
>
> >> Rather than incrementing the refcount every 4K, we record
> >> all sub pages and increment by @refs amount *once*.
> >
> > "Rather than incrementing the refcount for each page, do one atomic
> > addition for all the pages to be pinned."
> >
> ACK.
>
> >>
> >> Performance measured by gup_benchmark improves considerably
> >> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
> >>
> >>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> >> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> >> [altmap]
> >> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> >> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
> >
> > Hmm what is altmap representing here? The altmap case does not support
> > compound geometry,
>
> It does support compound geometry and so we use compound pages with altmap case.
> What altmap doesn't support is the memory savings in the vmemmap that can be
> done when using compound pages. That's what is represented here.

Ah, I missed that detail, might be good to mention this in the
Documentation/vm/ overview doc for this capability.

>
> > so this last test is comparing pinning this amount
> > of memory without compound pages where the memmap is in PMEM to the
> > speed *with* compound pages and the memmap in DRAM?
> >
> The test compares pinning this amount of memory with compound pages placed
> in PMEM and in DRAM. It just exposes just how ineficient this can get if huge pages aren't
> represented with compound pages.

Got it.

>
> >>
> >>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> >> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> >> [altmap with -m 127004]
> >> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> >> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
> >>  1 file changed, 32 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index b3e647c8b7ee..514f12157a0f 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >>  }
> >>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >>
> >> +
> >> +static int record_subpages(struct page *page, unsigned long addr,
> >> +                          unsigned long end, struct page **pages)
> >> +{
> >> +       int nr;
> >> +
> >> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
> >> +               pages[nr++] = page++;
> >> +
> >> +       return nr;
> >> +}
> >> +
> >>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>                              unsigned long end, unsigned int flags,
> >>                              struct page **pages, int *nr)
> >>  {
> >> -       int nr_start = *nr;
> >> +       int refs, nr_start = *nr;
> >>         struct dev_pagemap *pgmap = NULL;
> >>
> >>         do {
> >> -               struct page *page = pfn_to_page(pfn);
> >> +               struct page *head, *page = pfn_to_page(pfn);
> >> +               unsigned long next;
> >>
> >>                 pgmap = get_dev_pagemap(pfn, pgmap);
> >>                 if (unlikely(!pgmap)) {
> >>                         undo_dev_pagemap(nr, nr_start, flags, pages);
> >>                         return 0;
> >>                 }
> >> -               SetPageReferenced(page);
> >> -               pages[*nr] = page;
> >> -               if (unlikely(!try_grab_page(page, flags))) {
> >> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
> >> +
> >> +               head = compound_head(page);
> >> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;
> >
> > This looks a tad messy, and makes assumptions that upper layers are
> > not sending this routine multiple huge pages to map. next should be
> > set to the next compound page, not end.
>
> Although for devmap (and same could be said for hugetlbfs), __gup_device_huge() (as called
> by __gup_device_huge_{pud,pmd}) would only ever be called on a compound page which
> represents the same level, as opposed to many compound pages i.e. @end already represents
> the next compound page of the PMD or PUD level.
>
> But of course, should we represent devmap pages in geometries other than the values of
> hpagesize/align other than PMD or PUD size then it's true that relying on @end value being
> next compound page is fragile. But so as the rest of the surrounding code.

Ok, for now maybe a:

/* @end is assumed to be limited at most 1 compound page */

...would remind whoever refactors this later about the assumption.

>
> >
> >> +               refs = record_subpages(page, addr, next, pages + *nr);
> >> +
> >> +               SetPageReferenced(head);
> >> +               head = try_grab_compound_head(head, refs, flags);
> >> +               if (!head) {
> >> +                       if (PageCompound(head)) {
> >
> > @head is NULL here, I think you wanted to rename the result of
> > try_grab_compound_head() to something like pinned_head so that you
> > don't undo the work you did above.
>
> Yes. pinned_head is what I actually should have written. Let me fix that.
>
> > However I feel like there's one too
> > PageCompund() checks.
> >
>
> I agree, but I am not fully sure how I can remove them :(

If you fix the bug above that's sufficient for me, I may be wishing
for something more pretty but is not possible in practice...

>
> The previous approach was to separate the logic into two distinct helpers namely
> __gup_device_huge() and __gup_device_compound_huge(). But that sort of special casing
> wasn't a good idea, so I tried merging both cases in __gup_device_huge() solely
> differentiating on PageCompound().
>
> I could make this slightly less bad by moving the error case PageCompound checks to
> undo_dev_pagemap() and record_subpages().
>
> But we still have the pagemap refcount to be taken until your other series removes the
> need for it. So perhaps I should place the remaining PageCompound based check inside
> record_subpages to accomodate the PAGE_SIZE geometry case (similarly hinted by Jason in
> the previous version but that I didn't fully address).
>
> How does the above sound?

Sounds worth a try, but not a hard requirement for this to move
forward from my perspective.

>
> longterm once we stop having devmap use non compound struct pages on PMDs/PUDs and the
> pgmap refcount on gup is removed then perhaps we can move to existing regular huge page
> path that is not devmap specific.
>

Ok.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-07 13:48     ` Joao Martins
@ 2021-06-07 19:32       ` Dan Williams
  2021-06-14 18:41         ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-06-07 19:32 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Mon, Jun 7, 2021 at 6:49 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> > Given all of the above I'm wondering if there should be a new
> > "compound specific" flavor of this routine rather than trying to
> > cleverly inter mingle the old path with the new. This is easier
> > because you have already done the work in previous patches to break
> > this into helpers. So just have memmap_init_zone_device() do it the
> > "old" way and memmap_init_compound() handle all the tail page init +
> > optimizations.
> >
> I can separate it out, should be easier to follow.
>
> Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
> more accurately represents what goes on the page tables. That's regardless of
> vmemmap-based gains, and hence why my train of thought was to not separate it.
>
> After this series, all devmap pages where @geometry matches @align will have compound
> pages be used instead. And we enable that in device-dax as first user (in the next patch).
> altmap or not so far just differentiates on the uniqueness of struct pages as the former
> doesn't reuse base pages that only contain tail pages and consequently makes us initialize
> all tail struct pages.

I think combining the cases into a common central routine makes the
code that much harder to maintain. A small duplication cost is worth
it in my opinion to help readability / maintainability.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  2021-06-07 12:02     ` Joao Martins
@ 2021-06-07 19:47       ` Dan Williams
  0 siblings, 0 replies; 64+ messages in thread
From: Dan Williams @ 2021-06-07 19:47 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Mon, Jun 7, 2021 at 5:03 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 6/1/21 8:30 PM, Dan Williams wrote:
> > Sorry for the delay, and the sync up on IRC. I think this subject
> > needs to change to "optimize memory savings for compound pud memmap
> > geometry", and move this to the back of the series to make it clear
> > which patches are base functionality and which extend the idea
> > further.
>
> OK
>
> > As far as I can see this patch can move to the end of the
> > series.
>
> Maybe its prefered that this patch could be deferred to out of the series as
> a followup improvement, and leave this series with the base feature set only?

My preference is to keep it in just clarify that it's an optimization
and make sure it does not come before any fundamental patches that
implement the base support.

> > Some additional changelog feedback below:
> >
> >
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Right now basepages are used to populate the PUD tail pages, and it
> >> picks the address of the previous page of the subsection that preceeds
> >
> > s/preceeds/precedes/
> >
> Yeap.
>
> >> the memmap we are initializing.  This is done when a given memmap
> >> address isn't aligned to the pgmap @align (which is safe to do because
> >> @ranges are guaranteed to be aligned to @align).
> >
> > You know what would help is if you could draw an ascii art picture of
> > the before and after of the head page vs tail page arrangement. I can
> > see how the words line up to the code, but it takes a while to get the
> > picture in my head and I think future work in this area will benefit
> > from having a place in Documentation that draws a picture of the
> > layout of the various geometries.
> >
> Makes sense, I will add docs. Mike K. and others had similar trouble following
> the page structs arrangement which ultimately lead to this section of commentary
> at the beginning of the new source file added here ...
>
> https://www.ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page.patch

Ah, looks good, but that really belongs in Documentation/ not a comment block.

>
> > I've used asciiflow.com for these types of diagrams in the past.
> >
>
> ... so perhaps I can borrow some of that and place it to
> a common place like in Documentation/vm/compound_pagemaps.rst
>
> This patch specifically would need a new diagram added on top covering
> the PMD page case.

Sounds good.

[..]
> > Other than that the implementation looks ok to me, modulo previous
> > comments about @block type and the use of the "geometry" term.
> >
> OK.
>
> Btw, speaking of geometry, could you have a look at this thread:
>
> https://lore.kernel.org/linux-mm/8c922a58-c901-1ad9-5d19-1182bd6dea1e@oracle.com/
>
> .. and let me know what you think?

Ok, will take a look


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-18 17:27           ` Joao Martins
  2021-05-18 19:56             ` Jane Chu
@ 2021-06-07 20:17             ` Dan Williams
  2021-06-07 20:47               ` Joao Martins
  1 sibling, 1 reply; 64+ messages in thread
From: Dan Williams @ 2021-06-07 20:17 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Tue, May 18, 2021 at 10:28 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 5/5/21 11:36 PM, Joao Martins wrote:
> > On 5/5/21 11:20 PM, Dan Williams wrote:
> >> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>> On 5/5/21 7:44 PM, Dan Williams wrote:
> >>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >>>>> index b46f63dcaed3..bb28d82dda5e 100644
> >>>>> --- a/include/linux/memremap.h
> >>>>> +++ b/include/linux/memremap.h
> >>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
> >>>>>         struct completion done;
> >>>>>         enum memory_type type;
> >>>>>         unsigned int flags;
> >>>>> +       unsigned long align;
> >>>>
> >>>> I think this wants some kernel-doc above to indicate that non-zero
> >>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> >>>> means "use non-compound base pages".
>
> [...]
>
> >>>> The non-zero value must be
> >>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
> >>>> Hmm, maybe it should be an
> >>>> enum:
> >>>>
> >>>> enum devmap_geometry {
> >>>>     DEVMAP_PTE,
> >>>>     DEVMAP_PMD,
> >>>>     DEVMAP_PUD,
> >>>> }
> >>>>
> >>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
> >>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
> >>
> >> I think it is ok for dax/nvdimm to continue to maintain their align
> >> value because it should be ok to have 4MB align if the device really
> >> wanted. However, when it goes to map that alignment with
> >> memremap_pages() it can pick a mode. For example, it's already the
> >> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> >> they're already separate concepts that can stay separate.
> >>
> > Gotcha.
>
> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
> represents a slightly different variation of the device @align i.e. how the metadata is
> laid out **but** regardless of what kind of page table entries we use vmemmap.
>
> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
> above? -- and 3) the value of geometry actually derives from dax device @align because we
> will need to create compound pages representing a page size of @align value.
>
> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
> device @align. In reality what we want to convey in @geometry is not page table sizes, but
> just the page size used for the vmemmap of the dax device.

Good point, the names "PTE, PMD, PUD" imply the hardware mapping size,
not the software compound page size.

> Additionally, limiting its
> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
> create compound pages of order 10 for the said vmemmap.
>
> I am going to wait until you finish reviewing the remaining four patches of this series,
> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
> setter/getter to audit its value? Thoughts?

I do see what you mean about the confusion DEVMAP_{PTE,PMD,PUD}
introduces, but I still think the device-dax align and the
organization of the 'struct page' metadata are distinct concepts. So
I'm happy with any color of the bikeshed as long as the 2 concepts are
distinct. How about calling it  "compound_page_order"? Open to other
ideas...


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-06-07 20:17             ` Dan Williams
@ 2021-06-07 20:47               ` Joao Martins
  2021-06-07 21:00                 ` Joao Martins
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-06-07 20:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 6/7/21 9:17 PM, Dan Williams wrote:
> On Tue, May 18, 2021 at 10:28 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>> --- a/include/linux/memremap.h
>>>>>>> +++ b/include/linux/memremap.h
>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>>         struct completion done;
>>>>>>>         enum memory_type type;
>>>>>>>         unsigned int flags;
>>>>>>> +       unsigned long align;
>>>>>>
>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>> means "use non-compound base pages".
>>
>> [...]
>>
>>>>>> The non-zero value must be
>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>> Hmm, maybe it should be an
>>>>>> enum:
>>>>>>
>>>>>> enum devmap_geometry {
>>>>>>     DEVMAP_PTE,
>>>>>>     DEVMAP_PMD,
>>>>>>     DEVMAP_PUD,
>>>>>> }
>>>>>>
>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>>
>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>> value because it should be ok to have 4MB align if the device really
>>>> wanted. However, when it goes to map that alignment with
>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>> they're already separate concepts that can stay separate.
>>>>
>>> Gotcha.
>>
>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>> represents a slightly different variation of the device @align i.e. how the metadata is
>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>
>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>> will need to create compound pages representing a page size of @align value.
>>
>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>> just the page size used for the vmemmap of the dax device.
> 
> Good point, the names "PTE, PMD, PUD" imply the hardware mapping size,
> not the software compound page size.
> 
>> Additionally, limiting its
>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>> create compound pages of order 10 for the said vmemmap.
>>
>> I am going to wait until you finish reviewing the remaining four patches of this series,
>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>> setter/getter to audit its value? Thoughts?
> 
> I do see what you mean about the confusion DEVMAP_{PTE,PMD,PUD}
> introduces, but I still think the device-dax align and the
> organization of the 'struct page' metadata are distinct concepts. So
> I'm happy with any color of the bikeshed as long as the 2 concepts are
> distinct. How about calling it  "compound_page_order"? Open to other
> ideas...
> 
I actually like the name of @geometry. The only thing better would be @vmemmap_geometry
solely because it makes it clear that its the vmemmap that we are talking about -- but
might be unnecssarily verbose. And I still agree that is separate concept that should be
named differently *at least*.

But naming aside, I was trying to get at was to avoid a second geometry value validation
i.e. to be validated the value and set with a value such as DEVMAP_PTE, DEVMAP_PMD and
DEVMAP_PUD. That to me sounds a little redundant, when the geometry value depends on what
align is going to be used from. Here my metnion of @align refers to what's used to create
the dax device, not the mmap() align [which can be lower than the device one]. The dax
device align is the one used to decide whether to use PTEs, PMDs or PUDs at dax fault handler.

So separate concepts, but still its value dependent on one another. At least unless we
want to allow geometry values different than those set by --align as Jane suggested.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-06-07 20:47               ` Joao Martins
@ 2021-06-07 21:00                 ` Joao Martins
  2021-06-07 21:57                   ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-06-07 21:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Linux NVDIMM



On 6/7/21 9:47 PM, Joao Martins wrote:
> On 6/7/21 9:17 PM, Dan Williams wrote:
>> On Tue, May 18, 2021 at 10:28 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>>>         struct completion done;
>>>>>>>>         enum memory_type type;
>>>>>>>>         unsigned int flags;
>>>>>>>> +       unsigned long align;
>>>>>>>
>>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>>> means "use non-compound base pages".
>>>
>>> [...]
>>>
>>>>>>> The non-zero value must be
>>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>>> Hmm, maybe it should be an
>>>>>>> enum:
>>>>>>>
>>>>>>> enum devmap_geometry {
>>>>>>>     DEVMAP_PTE,
>>>>>>>     DEVMAP_PMD,
>>>>>>>     DEVMAP_PUD,
>>>>>>> }
>>>>>>>
>>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>>>
>>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>>> value because it should be ok to have 4MB align if the device really
>>>>> wanted. However, when it goes to map that alignment with
>>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>>> they're already separate concepts that can stay separate.
>>>>>
>>>> Gotcha.
>>>
>>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>>> represents a slightly different variation of the device @align i.e. how the metadata is
>>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>>
>>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>>> will need to create compound pages representing a page size of @align value.
>>>
>>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>>> just the page size used for the vmemmap of the dax device.
>>
>> Good point, the names "PTE, PMD, PUD" imply the hardware mapping size,
>> not the software compound page size.
>>
>>> Additionally, limiting its
>>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>>> create compound pages of order 10 for the said vmemmap.
>>>
>>> I am going to wait until you finish reviewing the remaining four patches of this series,
>>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>>> setter/getter to audit its value? Thoughts?
>>
>> I do see what you mean about the confusion DEVMAP_{PTE,PMD,PUD}
>> introduces, but I still think the device-dax align and the
>> organization of the 'struct page' metadata are distinct concepts. So
>> I'm happy with any color of the bikeshed as long as the 2 concepts are
>> distinct. How about calling it  "compound_page_order"? Open to other
>> ideas...
>>
> I actually like the name of @geometry. The only thing better would be @vmemmap_geometry
> solely because it makes it clear that its the vmemmap that we are talking about -- but
> might be unnecssarily verbose. And I still agree that is separate concept that should be
> named differently *at least*.
> 
> But naming aside, I was trying to get at was to avoid a second geometry value validation
> i.e. to be validated the value and set with a value such as DEVMAP_PTE, DEVMAP_PMD and
> DEVMAP_PUD. 

Sorry my english keeps getting broken, I meant this instead:

But naming aside, what I am trying to get at is to remove the second geometry value
validation i.e. for @geometry to not be validated a second time to be set to DEVMAP_PTE,
DEVMAP_PMD or DEVMAP_PUD.

> That to me sounds a little redundant, when the geometry value depends on what
> align is going to be used from. Here my metnion of @align refers to what's used to create
> the dax device, not the mmap() align [which can be lower than the device one]. The dax
> device align is the one used to decide whether to use PTEs, PMDs or PUDs at dax fault handler.
> 
> So separate concepts, but still its value dependent on one another. At least unless we
> want to allow geometry values different than those set by --align as Jane suggested.
> 

And I should add:

I can maintain the DEVMAP_* enum values, but then these will need to be changed in tandem
anytime a new @align value is supported. Or instead we use the name @geometry albeit with
still as an unsigned long type . Or rather than an unsigned long perhaps making another
type and its value obtained/changed with getter/setter.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-06-07 21:00                 ` Joao Martins
@ 2021-06-07 21:57                   ` Dan Williams
  0 siblings, 0 replies; 64+ messages in thread
From: Dan Williams @ 2021-06-07 21:57 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Linux NVDIMM

On Mon, Jun 7, 2021 at 2:02 PM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> > But naming aside, I was trying to get at was to avoid a second geometry value validation
> > i.e. to be validated the value and set with a value such as DEVMAP_PTE, DEVMAP_PMD and
> > DEVMAP_PUD.
>
> Sorry my english keeps getting broken, I meant this instead:
>
> But naming aside, what I am trying to get at is to remove the second geometry value
> validation i.e. for @geometry to not be validated a second time to be set to DEVMAP_PTE,
> DEVMAP_PMD or DEVMAP_PUD.
>
> > That to me sounds a little redundant, when the geometry value depends on what
> > align is going to be used from. Here my metnion of @align refers to what's used to create
> > the dax device, not the mmap() align [which can be lower than the device one]. The dax
> > device align is the one used to decide whether to use PTEs, PMDs or PUDs at dax fault handler.
> >
> > So separate concepts, but still its value dependent on one another. At least unless we
> > want to allow geometry values different than those set by --align as Jane suggested.
> >
>
> And I should add:
>
> I can maintain the DEVMAP_* enum values, but then these will need to be changed in tandem
> anytime a new @align value is supported. Or instead we use the name @geometry albeit with
> still as an unsigned long type . Or rather than an unsigned long perhaps making another
> type and its value obtained/changed with getter/setter.

No, feel free to drop the enum and use an explicit "geometry" value
for the vmemmap.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-07 19:32       ` Dan Williams
@ 2021-06-14 18:41         ` Joao Martins
  2021-06-14 23:07           ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-06-14 18:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 6/7/21 8:32 PM, Dan Williams wrote:
> On Mon, Jun 7, 2021 at 6:49 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>> Given all of the above I'm wondering if there should be a new
>>> "compound specific" flavor of this routine rather than trying to
>>> cleverly inter mingle the old path with the new. This is easier
>>> because you have already done the work in previous patches to break
>>> this into helpers. So just have memmap_init_zone_device() do it the
>>> "old" way and memmap_init_compound() handle all the tail page init +
>>> optimizations.
>>>
>> I can separate it out, should be easier to follow.
>>
>> Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
>> more accurately represents what goes on the page tables. That's regardless of
>> vmemmap-based gains, and hence why my train of thought was to not separate it.
>>
>> After this series, all devmap pages where @geometry matches @align will have compound
>> pages be used instead. And we enable that in device-dax as first user (in the next patch).
>> altmap or not so far just differentiates on the uniqueness of struct pages as the former
>> doesn't reuse base pages that only contain tail pages and consequently makes us initialize
>> all tail struct pages.
> 
> I think combining the cases into a common central routine makes the
> code that much harder to maintain. A small duplication cost is worth
> it in my opinion to help readability / maintainability.
> 
I am addressing this comment and taking a step back. By just moving the tail page init to
memmap_init_compound() this gets a lot more readable. Albeit now I think having separate
top-level loops over pfns, doesn't bring much improvement there.

Here's what I have by moving just tails init to a separate routine. See your original
suggestion after the scissors mark. I have a slight inclination towards the first one, but
no really strong preference. Thoughts?

[...]

static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
                                        unsigned long zone_idx, int nid,
                                        struct dev_pagemap *pgmap,
                                        unsigned long nr_pages)
{
        unsigned int order_align = order_base_2(nr_pages);
        unsigned long i;

        __SetPageHead(page);

        for (i = 1; i < nr_pages; i++) {
                __init_zone_device_page(page + i, pfn + i, zone_idx,
                                        nid, pgmap);
                prep_compound_tail(page, i);

                /*
                 * The first and second tail pages need to
                 * initialized first, hence the head page is
                 * prepared last.
                 */
                if (i == 2)
                        prep_compound_head(page, order_align);
        }
}

void __ref memmap_init_zone_device(struct zone *zone,
                                   unsigned long start_pfn,
                                   unsigned long nr_pages,
                                   struct dev_pagemap *pgmap)
{
        unsigned long pfn, end_pfn = start_pfn + nr_pages;
        struct pglist_data *pgdat = zone->zone_pgdat;
        struct vmem_altmap *altmap = pgmap_altmap(pgmap);
        unsigned long pfns_per_compound = pgmap_pfn_geometry(pgmap);
        unsigned long zone_idx = zone_idx(zone);
        unsigned long start = jiffies;
        int nid = pgdat->node_id;

        if (WARN_ON_ONCE(!pgmap || zone_idx(zone) != ZONE_DEVICE))
                return;

        /*
         * The call to memmap_init_zone should have already taken care
         * of the pages reserved for the memmap, so we can just jump to
         * the end of that region and start processing the device pages.
         */
        if (altmap) {
                start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
                nr_pages = end_pfn - start_pfn;
        }

        for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);

                if (pfns_per_compound == 1)
                        continue;

                memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
                                     pfns_per_compound);
        }

        pr_info("%s initialised %lu pages in %ums\n", __func__,
                nr_pages, jiffies_to_msecs(jiffies - start));
}


[...]

--->8-----
Whereas your original suggestion would look more like this:

[...]

static void __ref memmap_init_compound(unsigned long zone_idx, int nid,
                                        struct dev_pagemap *pgmap,
                                        unsigned long start_pfn,
                                        unsigned long end_pfn)
{
        unsigned long pfns_per_compound = pgmap_pfn_geometry(pgmap);
        unsigned int order_align = order_base_2(pfns_per_compound);
        unsigned long i;

        for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);

                __SetPageHead(page);

                for (i = 1; i < pfns_per_compound; i++) {
                        __init_zone_device_page(page + i, pfn + i, zone_idx,
                                                nid, pgmap);
                        prep_compound_tail(page, i);

                        /*
                         * The first and second tail pages need to
                         * initialized first, hence the head page is
                         * prepared last.
                         */
                        if (i == 2)
                                prep_compound_head(page, order_align);
                }
        }
}

static void __ref memmap_init_base(unsigned long zone_idx, int nid,
                                         struct dev_pagemap *pgmap,
                                         unsigned long start_pfn,
                                         unsigned long end_pfn)
{
        for (pfn = start_pfn; pfn < end_pfn; pfn++) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
        }
}

void __ref memmap_init_zone_device(struct zone *zone,
                                   unsigned long start_pfn,
                                   unsigned long nr_pages,
                                   struct dev_pagemap *pgmap)
{
        unsigned long pfn, end_pfn = start_pfn + nr_pages;
        struct pglist_data *pgdat = zone->zone_pgdat;
        struct vmem_altmap *altmap = pgmap_altmap(pgmap);
        bool compound = pgmap_geometry(pgmap) > PAGE_SIZE;
        unsigned long zone_idx = zone_idx(zone);
        unsigned long start = jiffies;
        int nid = pgdat->node_id;

        if (WARN_ON_ONCE(!pgmap || zone_idx(zone) != ZONE_DEVICE))
                return;

        /*
         * The call to memmap_init_zone should have already taken care
         * of the pages reserved for the memmap, so we can just jump to
         * the end of that region and start processing the device pages.
         */
        if (altmap) {
                start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
                nr_pages = end_pfn - start_pfn;
        }

        if (compound)
                memmap_init_compound(zone_idx, nid, pgmap, start_pfn, end_pfn);
        else
                memmap_init_base(zone_idx, nid, pgmap, start_pfn, end_pfn);

        pr_info("%s initialised %lu pages in %ums\n", __func__,
                nr_pages, jiffies_to_msecs(jiffies - start));
}



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-14 18:41         ` Joao Martins
@ 2021-06-14 23:07           ` Dan Williams
  0 siblings, 0 replies; 64+ messages in thread
From: Dan Williams @ 2021-06-14 23:07 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Mon, Jun 14, 2021 at 11:42 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 6/7/21 8:32 PM, Dan Williams wrote:
> > On Mon, Jun 7, 2021 at 6:49 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> > [..]
> >>> Given all of the above I'm wondering if there should be a new
> >>> "compound specific" flavor of this routine rather than trying to
> >>> cleverly inter mingle the old path with the new. This is easier
> >>> because you have already done the work in previous patches to break
> >>> this into helpers. So just have memmap_init_zone_device() do it the
> >>> "old" way and memmap_init_compound() handle all the tail page init +
> >>> optimizations.
> >>>
> >> I can separate it out, should be easier to follow.
> >>
> >> Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
> >> more accurately represents what goes on the page tables. That's regardless of
> >> vmemmap-based gains, and hence why my train of thought was to not separate it.
> >>
> >> After this series, all devmap pages where @geometry matches @align will have compound
> >> pages be used instead. And we enable that in device-dax as first user (in the next patch).
> >> altmap or not so far just differentiates on the uniqueness of struct pages as the former
> >> doesn't reuse base pages that only contain tail pages and consequently makes us initialize
> >> all tail struct pages.
> >
> > I think combining the cases into a common central routine makes the
> > code that much harder to maintain. A small duplication cost is worth
> > it in my opinion to help readability / maintainability.
> >
> I am addressing this comment and taking a step back. By just moving the tail page init to
> memmap_init_compound() this gets a lot more readable. Albeit now I think having separate
> top-level loops over pfns, doesn't bring much improvement there.
>
> Here's what I have by moving just tails init to a separate routine. See your original
> suggestion after the scissors mark. I have a slight inclination towards the first one, but
> no really strong preference. Thoughts?

I like your "first one" too.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-05-13 18:45         ` Joao Martins
@ 2021-06-16 15:05           ` Joao Martins
  2021-06-16 23:35             ` Dan Williams
  0 siblings, 1 reply; 64+ messages in thread
From: Joao Martins @ 2021-06-16 15:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On 5/13/21 7:45 PM, Joao Martins wrote:
> On 5/10/21 8:19 PM, Dan Williams wrote:
>> On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> [..]
>>>>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>>>
>>>> I think this can be replaced with a call to follow_pte(&init_mm...).
>>>
>>> Ah, of course! That would shorten things up too.
>>
>> Now that I look closely, I notice that function disclaims being
>> suitable as a general purpose pte lookup utility. 
>> If it works for you,
>> great, but if not I think it's past time to create such a helper. I
>> know Ira is looking for one, and I contributed to the proliferation
>> when I added dev_pagemap_mapping_shift().
>>
> There's also x86 equivalents, like lookup_address() and lookup_address_in_pgd().
> These two don't differ that much from vmemmap_lookup_address().
> 
> I can move this to an internal place e.g. mm/internal.h
> 
> The patch after this one, changes vmemmap_lookup_address() to stop at the PMD (to reuse
> that across the next sections).
> 
Thinking again on this particular comment, but more on the actual need for the lookup
function. It is very specific to 1G geometry case being spread over multiple sections.

Given section population *needs* to succeed for the followup sections to continue to be
populated, perhaps we simplify this a whole lot e.g. below comparing this patch before and
after.

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 3d671e3e804d..c06796fcc77d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -554,37 +554,6 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr,
int node,
        return vmemmap_populate_address(addr, node, NULL, NULL, page);
 }

-static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
-{
-       pgd_t *pgd;
-       p4d_t *p4d;
-       pud_t *pud;
-       pmd_t *pmd;
-       pte_t *pte;
-
-       pgd = pgd_offset_k(addr);
-       if (pgd_none(*pgd))
-               return NULL;
-
-       p4d = p4d_offset(pgd, addr);
-       if (p4d_none(*p4d))
-               return NULL;
-
-       pud = pud_offset(p4d, addr);
-       if (pud_none(*pud))
-               return NULL;
-
-       pmd = pmd_offset(pud, addr);
-       if (pmd_none(*pmd))
-               return NULL;
-
-       pte = pte_offset_kernel(pmd, addr);
-       if (pte_none(*pte))
-               return NULL;
-
-       return pte;
-}
-
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
                                                     unsigned long start,
                                                     unsigned long end, int node,
@@ -605,8 +574,10 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
start_pfn,
        offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
        if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
            pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
-               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+               pte_t *ptep;

+               addr = start - PAGE_SIZE;
+               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
                if (!ptep)
                        return -ENOMEM;

The 'if (!ptep)' cannot happen in pratice AFAIU so could remove that as well.

Thoughts?


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-06-16 15:05           ` Joao Martins
@ 2021-06-16 23:35             ` Dan Williams
  0 siblings, 0 replies; 64+ messages in thread
From: Dan Williams @ 2021-06-16 23:35 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Wed, Jun 16, 2021 at 8:06 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 5/13/21 7:45 PM, Joao Martins wrote:
> > On 5/10/21 8:19 PM, Dan Williams wrote:
> >> On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> [..]
> >>>>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> >>>>
> >>>> I think this can be replaced with a call to follow_pte(&init_mm...).
> >>>
> >>> Ah, of course! That would shorten things up too.
> >>
> >> Now that I look closely, I notice that function disclaims being
> >> suitable as a general purpose pte lookup utility.
> >> If it works for you,
> >> great, but if not I think it's past time to create such a helper. I
> >> know Ira is looking for one, and I contributed to the proliferation
> >> when I added dev_pagemap_mapping_shift().
> >>
> > There's also x86 equivalents, like lookup_address() and lookup_address_in_pgd().
> > These two don't differ that much from vmemmap_lookup_address().
> >
> > I can move this to an internal place e.g. mm/internal.h
> >
> > The patch after this one, changes vmemmap_lookup_address() to stop at the PMD (to reuse
> > that across the next sections).
> >
> Thinking again on this particular comment, but more on the actual need for the lookup
> function. It is very specific to 1G geometry case being spread over multiple sections.
>
> Given section population *needs* to succeed for the followup sections to continue to be
> populated, perhaps we simplify this a whole lot e.g. below comparing this patch before and
> after.
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 3d671e3e804d..c06796fcc77d 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -554,37 +554,6 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr,
> int node,
>         return vmemmap_populate_address(addr, node, NULL, NULL, page);
>  }
>
> -static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> -{
> -       pgd_t *pgd;
> -       p4d_t *p4d;
> -       pud_t *pud;
> -       pmd_t *pmd;
> -       pte_t *pte;
> -
> -       pgd = pgd_offset_k(addr);
> -       if (pgd_none(*pgd))
> -               return NULL;
> -
> -       p4d = p4d_offset(pgd, addr);
> -       if (p4d_none(*p4d))
> -               return NULL;
> -
> -       pud = pud_offset(p4d, addr);
> -       if (pud_none(*pud))
> -               return NULL;
> -
> -       pmd = pmd_offset(pud, addr);
> -       if (pmd_none(*pmd))
> -               return NULL;
> -
> -       pte = pte_offset_kernel(pmd, addr);
> -       if (pte_none(*pte))
> -               return NULL;
> -
> -       return pte;
> -}
> -
>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>                                                      unsigned long start,
>                                                      unsigned long end, int node,
> @@ -605,8 +574,10 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
> start_pfn,
>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>         if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
>             pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
> -               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
> +               pte_t *ptep;
>
> +               addr = start - PAGE_SIZE;
> +               ptep = pte_offset_kernel(pmd_off_k(addr), addr);

It deserves a comment about the guarantee, but looks good.

>                 if (!ptep)
>                         return -ENOMEM;
>
> The 'if (!ptep)' cannot happen in pratice AFAIU so could remove that as well.
>
> Thoughts?


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2021-06-16 23:35 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-04-24  0:12   ` Dan Williams
2021-04-24 19:00     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-04-24  0:16   ` Dan Williams
2021-04-24 19:05     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-04-24  0:18   ` Dan Williams
2021-04-24 19:05     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-05-05 18:44   ` Dan Williams
2021-05-05 18:58     ` Matthew Wilcox
2021-05-05 19:49     ` Joao Martins
2021-05-05 22:20       ` Dan Williams
2021-05-05 22:36         ` Joao Martins
2021-05-05 23:03           ` Dan Williams
2021-05-06 10:12             ` Joao Martins
2021-05-18 17:27           ` Joao Martins
2021-05-18 19:56             ` Jane Chu
2021-05-19 11:29               ` Joao Martins
2021-05-19 18:36                 ` Jane Chu
2021-06-07 20:17             ` Dan Williams
2021-06-07 20:47               ` Joao Martins
2021-06-07 21:00                 ` Joao Martins
2021-06-07 21:57                   ` Dan Williams
2021-05-06  8:05         ` Aneesh Kumar K.V
2021-05-06 10:23           ` Joao Martins
2021-05-06 11:43             ` Matthew Wilcox
2021-05-06 12:15               ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-05-05 22:34   ` Dan Williams
2021-05-05 22:37     ` Joao Martins
2021-05-05 23:14       ` Dan Williams
2021-05-06 10:24         ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
2021-05-05 22:43   ` Dan Williams
2021-05-06 10:27     ` Joao Martins
2021-05-06 18:36       ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-05-06  1:18   ` Dan Williams
2021-05-06 11:01     ` Joao Martins
2021-05-10 19:19       ` Dan Williams
2021-05-13 18:45         ` Joao Martins
2021-06-16 15:05           ` Joao Martins
2021-06-16 23:35             ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
2021-06-01 19:30   ` Dan Williams
2021-06-07 12:02     ` Joao Martins
2021-06-07 19:47       ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-06-01 23:35   ` Dan Williams
2021-06-07 13:48     ` Joao Martins
2021-06-07 19:32       ` Dan Williams
2021-06-14 18:41         ` Joao Martins
2021-06-14 23:07           ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
2021-06-02  0:36   ` Dan Williams
2021-06-07 13:59     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-06-02  1:05   ` Dan Williams
2021-06-07 15:21     ` Joao Martins
2021-06-07 19:22       ` Dan Williams
2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).