linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
@ 2021-03-25 23:09 Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
                   ` (11 more replies)
  0 siblings, 12 replies; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Hey,

This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 

[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).

So to summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.

Patches 5-10: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap.

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally.

Patch 11: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:

                                                   before     after
    (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
    (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
    (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
    (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms

    altmap performance gets specially interesting when pinning a pmem dimm:

                                                   before     after
    (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
    (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
    (128G get_user_pages_fast altmap 2M page size)  ~3.91 ms -> ~70 ms
    (128G pin_user_pages_fast altmap 2M page size)  ~3.97 ms -> ~74 ms

The unpinning improvement patches are in mmotm/linux-next so removed from this
series.

I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.

Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).

Comments and suggestions very much appreciated!

Changelog,

 RFC -> v1:
 (New patches 1-3, 5-8 but the diffstat is that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

Thanks,
	Joao

Joao Martins (11):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  mm/sparse-vmemmap: populate compound pagemaps
  mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  mm/page_alloc: reuse tail struct pages for compound pagemaps
  device-dax: compound pagemap support
  mm/gup: grab head page refcount once for group of subpages

 drivers/dax/device.c           |  58 +++++++--
 include/linux/memory_hotplug.h |   5 +-
 include/linux/memremap.h       |  13 ++
 include/linux/mm.h             |   8 +-
 mm/gup.c                       |  52 +++++---
 mm/memory-failure.c            |   2 +
 mm/memory_hotplug.c            |   3 +-
 mm/memremap.c                  |   9 +-
 mm/page_alloc.c                | 126 +++++++++++++------
 mm/sparse-vmemmap.c            | 221 +++++++++++++++++++++++++++++----
 mm/sparse.c                    |  24 ++--
 11 files changed, 406 insertions(+), 115 deletions(-)

-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-04-24  0:12   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()).  For pagemap with compound pages fetch the
compound_head in case we are handling a tail page memory failure.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory-failure.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..94240d772623 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto out;
 	}
 
+	page = compound_head(page);
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-04-24  0:16   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c53fe4fa10bf..43dd98446b0b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page), FPI_NONE);
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	set_page_count(p, 0);
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		set_page_count(p, 0);
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-04-24  0:18   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Move struct page init to an helper function __init_zone_device_page().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 43dd98446b0b..58974067bbd4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap)
+{
+
+	__init_single_page(page, pfn, zone_idx, nid);
+
+	/*
+	 * Mark page reserved as it will need to wait for onlining
+	 * phase for it to be fully associated with a zone.
+	 *
+	 * We can use the non-atomic __set_bit operation for setting
+	 * the flag as we are still initializing the pages.
+	 */
+	__SetPageReserved(page);
+
+	/*
+	 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
+	 * ever freed or placed on a driver-private list.
+	 */
+	page->pgmap = pgmap;
+	page->zone_device_data = NULL;
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		cond_resched();
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-		 * because this is done early in section_activate()
-		 */
-		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (2 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-05 18:44   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Add a new align property for struct dev_pagemap which specifies that a
pagemap is composed of a set of compound pages of size @align, instead
of base pages. When these pages are initialised, most are initialised as
tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memremap.h | 13 +++++++++++++
 mm/memremap.c            |  8 ++++++--
 mm/page_alloc.c          | 24 +++++++++++++++++++++++-
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index b46f63dcaed3..bb28d82dda5e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -114,6 +114,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long align;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
+{
+	if (!pgmap || !pgmap->align)
+		return PAGE_SIZE;
+	return pgmap->align;
+}
+
+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+	return PHYS_PFN(pgmap_align(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..d160853670c4 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	if (pgmap_align(pgmap) > PAGE_SIZE)
+		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
+	else
+		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+				- pfn_first(pgmap, range_id));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58974067bbd4..3a77f9e43f3a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfn_align = pgmap_pfn_align(pgmap);
+	unsigned int order_align = order_base_2(pfn_align);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
 		struct page *page = pfn_to_page(pfn);
+		unsigned long i;
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfn_align == 1)
+			continue;
+
+		__SetPageHead(page);
+
+		for (i = 1; i < pfn_align; i++) {
+			__init_zone_device_page(page + i, pfn + i, zone_idx,
+						nid, pgmap);
+			prep_compound_tail(page, i);
+
+			/*
+			 * The first and second tail pages need to
+			 * initialized first, hence the head page is
+			 * prepared last.
+			 */
+			if (i == 2)
+				prep_compound_head(page, order_align);
+		}
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (3 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-05 22:34   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

@altmap is stored in a dev_pagemap, but it will be repurposed for
hotplug memory for storing the memmap in the hotplugged memory[*] and
reusing the altmap infrastructure to that end. This is to say that
altmap can't be replaced with a @pgmap as it is going to cover more than
dev_pagemap backend altmaps.

So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:

	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap

Passing @pgmap allows __populate_section_memmap() to both fetch the align
in which memmap metadata is created for and also to let sparse-vmemmap
fetch pgmap ranges to co-relate to a given section and pick whether to just
reuse tail pages from past onlined sections.

[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memory_hotplug.h |  5 ++++-
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  3 ++-
 mm/sparse.c                    | 24 +++++++++++++++---------
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a85d4b7d15c2..45532192934c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,7 @@ struct mem_section;
 struct memory_block;
 struct resource;
 struct vmem_altmap;
+struct dev_pagemap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
@@ -72,6 +73,7 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	struct dev_pagemap *pgmap;
 };
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -360,7 +362,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb1e191da319..61474602c2b1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3033,7 +3033,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e0b499a1fee4..2df3b2a7b4b5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -384,7 +384,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+					 params->pgmap);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16183d85a7d5..370728c206ee 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -249,7 +249,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index be66a62e22c3..c2abf1281a89 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -443,7 +443,8 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -542,7 +543,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			break;
 
 		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-				nid, NULL);
+				nid, NULL, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -648,9 +649,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -719,7 +721,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 }
 #else
 struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	return kvmalloc_node(array_size(sizeof(struct page),
 					PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -842,7 +845,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -874,7 +878,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
@@ -889,6 +893,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
  * @altmap: device page map
+ * @pgmap: device page map object that owns the section
  *
  * This is only intended for hotplug.
  *
@@ -902,7 +907,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -913,7 +919,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (4 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-05 22:43   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Move the actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages()
walk through all base pages it needs to populate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 370728c206ee..8814876edcfa 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap)
 {
-	unsigned long addr = start;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	pgd = vmemmap_pgd_populate(addr, node);
+	if (!pgd)
+		return -ENOMEM;
+	p4d = vmemmap_p4d_populate(pgd, addr, node);
+	if (!p4d)
+		return -ENOMEM;
+	pud = vmemmap_pud_populate(p4d, addr, node);
+	if (!pud)
+		return -ENOMEM;
+	pmd = vmemmap_pmd_populate(pud, addr, node);
+	if (!pmd)
+		return -ENOMEM;
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	if (!pte)
+		return -ENOMEM;
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	unsigned long addr = start;
+
 	for (; addr < end; addr += PAGE_SIZE) {
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
-			return -ENOMEM;
-		p4d = vmemmap_p4d_populate(pgd, addr, node);
-		if (!p4d)
-			return -ENOMEM;
-		pud = vmemmap_pud_populate(p4d, addr, node);
-		if (!pud)
-			return -ENOMEM;
-		pmd = vmemmap_pmd_populate(pud, addr, node);
-		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
-		if (!pte)
+		if (vmemmap_populate_address(addr, node, altmap))
 			return -ENOMEM;
-		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 	}
 
 	return 0;
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (5 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-05-06  1:18   ` Dan Williams
  2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

To minimize struct page overhead we take advantage of the fact that
most tail pages look the same (except the first two). We allocate a
separate page for the vmemmap area which contains the head page and
separate for the next 64 pages. The rest of the subsections then reuse
this tail vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound pagemap with big enough @align (e.g. 1G
PUD) it  will cross various sections. To be able to reuse tail pages
across sections belonging to the same gigantic page we fetch the
@range being mapped (nr_ranges + 1).  If the section being mapped is
not offset 0 of the @align, then lookup the PFN of the struct page
address that preceeds it and use that to populate the entire
section.

On compound pagemaps with 2M align, this lets mechanism saves 6 pages
out of the 8 necessary PFNs necessary to set the subsection's 512
struct pages being mapped. On a 1G compound pagemap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |   2 +-
 mm/memremap.c       |   1 +
 mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 131 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 61474602c2b1..49d717ae40ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index d160853670c4..2e6bc0b1ff00 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8814876edcfa..f57c5eada099 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap, void *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p;
-
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else if (!altmap) {
+			get_page(virt_to_page(block));
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	if (ptr)
+		*ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
+	return 0;
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
@@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	unsigned long addr = start;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		if (vmemmap_populate_address(addr, node, altmap))
+		if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		if (vmemmap_populate_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  void **ptr)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
+}
+
+static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	pte = pte_offset_kernel(pmd, addr);
+	if (pte_none(*pte))
+		return NULL;
+
+	return pte;
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long offset, size, addr;
+
+	/*
+	 * For compound pages bigger than section size (e.g. 1G) fill the rest
+	 * of sections as tail pages.
+	 *
+	 * Note that memremap_pages() resets @nr_range value and will increment
+	 * it after each range successful onlining. Thus the value or @nr_range
+	 * at section memmap populate corresponds to the in-progress range
+	 * being onlined that we care about.
+	 */
+	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
+	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
+	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
+		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+
+		if (!ptep)
+			return -ENOMEM;
+
+		return vmemmap_populate_range(start, end, node,
+					      page_to_virt(pte_page(*ptep)));
+	}
+
+	size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		void *block;
+
+		/* Populate the head page vmemmap page */
+		if (vmemmap_populate_page(addr, node, NULL))
+			return -ENOMEM;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		if (vmemmap_populate_page(next, node, &block))
+			return -ENOMEM;
+
+		/* Reuse the previous page for the rest of tail pages */
+		next += PAGE_SIZE;
+		if (vmemmap_populate_range(next, last, node, block))
 			return -ENOMEM;
 	}
 
@@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	unsigned int align = pgmap_align(pgmap);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (align > PAGE_SIZE && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (6 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that preceeds
the memmap we are initializing.  This is done when a given memmap
address isn't aligned to the pgmap @align (which is safe to do because
@ranges are guaranteed to be aligned to @align).

For pagemaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages.  Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being  allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD we previously populated which only contain
tail pages).

After this scheme for an 1GB pagemap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages.  The latter page gets
its PMD reused across future section mapping of the same pagemap.

Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. For each PUD-aligned
pagemap we go from 40960 bytes down to 16384 bytes. Rather than
requiring 8 PMD page allocations we only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |  3 +-
 mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 63 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 49d717ae40ae..9c1a676d6b95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+			    void *block);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index f57c5eada099..291a8a32480a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+				       void *block)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+			if (!p)
+				return NULL;
+		} else {
+			get_page(virt_to_page(block));
+		}
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      void *page, void **ptr)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+						  struct vmem_altmap *altmap,
+						  void *page, pmd_t **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = vmemmap_pgd_populate(addr, node);
 	if (!pgd)
@@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pud = vmemmap_pud_populate(p4d, addr, node);
 	if (!pud)
 		return -ENOMEM;
-	pmd = vmemmap_pmd_populate(pud, addr, node);
+	pmd = vmemmap_pmd_populate(pud, addr, node, page);
 	if (!pmd)
 		return -ENOMEM;
+	if (ptr)
+		*ptr = pmd;
+	return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
+{
+	pmd_t *pmd;
+	pte_t *pte;
+
+	if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
+		return -ENOMEM;
+
 	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
@@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
 	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
 }
 
-static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+						unsigned long end,
+						int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PMD_SIZE) {
+		if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd))
@@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
 	if (pmd_none(*pmd))
 		return NULL;
 
-	pte = pte_offset_kernel(pmd, addr);
-	if (pte_none(*pte))
-		return NULL;
-
-	return pte;
+	return pmd;
 }
 
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
@@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
 	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
 	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
-		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		addr = start - PAGE_SIZE;
+		pmdp = vmemmap_lookup_address(addr);
+		if (!pmdp)
+			return -ENOMEM;
+
+		/* Reuse the tail pages vmemmap pmd page */
+		if (offset % pgmap->align > PFN_PHYS(PAGES_PER_SECTION))
+			return vmemmap_populate_pmd_range(start, end, node,
+						page_to_virt(pmd_page(*pmdp)));
 
-		if (!ptep)
+		/* Populate the tail pages vmemmap pmd page */
+		ptep = pte_offset_kernel(pmdp, addr);
+		if (pte_none(*ptep))
 			return -ENOMEM;
 
 		return vmemmap_populate_range(start, end, node,
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (7 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

When a pgmap @align is set, all pages are mapped at a given huge page
alignment and thus uses compound pages to describe them as opposed to a
struct per 4K.

With @align > PAGE_SIZE and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of unique
struct pages is a lot smaller that the total amount of struct pages
being mapped.

When struct pages are initialize in memmap_init_zone_device, make
sure that only unique struct pages are initialized i.e. the first 2
4K pages per @align which means 128 struct pages, instead of 32768 for
2M @align or 262144 for a 1G @align.

Keep old behaviour with altmap given that it doesn't support reusal
of tail vmemmap areas.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3a77f9e43f3a..948dfad6754b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6277,6 +6277,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+#define MEMMAP_NR_PAGES	(2 * (PAGE_SIZE/sizeof(struct page)))
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6287,6 +6289,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
 	unsigned int pfn_align = pgmap_pfn_align(pgmap);
 	unsigned int order_align = order_base_2(pfn_align);
+	unsigned long ntails = min_t(unsigned long, pfn_align, MEMMAP_NR_PAGES);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6302,6 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	if (altmap) {
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 		nr_pages = end_pfn - start_pfn;
+		ntails = pfn_align;
 	}
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
@@ -6315,7 +6319,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__SetPageHead(page);
 
-		for (i = 1; i < pfn_align; i++) {
+		for (i = 1; i < ntails; i++) {
 			__init_zone_device_page(page + i, pfn + i, zone_idx,
 						nid, pgmap);
 			prep_compound_tail(page, i);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 10/11] device-dax: compound pagemap support
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (8 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
  2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  11 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

dax devices are created with a fixed @align (huge page size) which
is enforced through as well at mmap() of the device. Faults,
consequently happen too at the specified @align specified at the
creation, and those don't change through out dax device lifetime.
MCEs poisons a whole dax huge page, as well as splits occurring at
at the configured page size.

Use the newly added compound pagemap facility which maps the
assigned dax ranges as compound pages at a page size of @align.
Currently, this means, that region/namespace bootstrap would take
considerably less, given that you would initialize considerably less
pages.

On setups with 128G NVDIMMs the initialization with DRAM stored struct pages
improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than
a 1msec with 1G pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index db92573c94e8..e3dcc4ad1727 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned long fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned long fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head = compound_head(head);
+	if (head->mapping)
+		return;
+
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap->align > PAGE_SIZE)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (dev_dax->align > PAGE_SIZE)
+		pgmap->align = dev_dax->align;
+
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (9 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  11 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Much like hugetlbfs or THPs, treat device pagemaps with
compound pages like the rest of GUP handling of compound pages.

Rather than incrementing the refcount every 4K, we record
all sub pages and increment by @refs amount *once*.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:

 $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
(pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
[altmap]
(get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
(pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

 $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
(pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
[altmap with -m 127004]
(get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
(pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b3e647c8b7ee..514f12157a0f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			     unsigned long end, unsigned int flags,
 			     struct page **pages, int *nr)
 {
-	int nr_start = *nr;
+	int refs, nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 
 	do {
-		struct page *page = pfn_to_page(pfn);
+		struct page *head, *page = pfn_to_page(pfn);
+		unsigned long next;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
-		SetPageReferenced(page);
-		pages[*nr] = page;
-		if (unlikely(!try_grab_page(page, flags))) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
+
+		head = compound_head(page);
+		next = PageCompound(head) ? end : addr + PAGE_SIZE;
+		refs = record_subpages(page, addr, next, pages + *nr);
+
+		SetPageReferenced(head);
+		head = try_grab_compound_head(head, refs, flags);
+		if (!head) {
+			if (PageCompound(head)) {
+				ClearPageReferenced(head);
+				put_dev_pagemap(pgmap);
+			} else {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+			}
 			return 0;
 		}
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
+		*nr += refs;
+		pfn += refs;
+	} while (addr += (refs << PAGE_SHIFT), addr != end);
 
 	if (pgmap)
 		put_dev_pagemap(pgmap);
@@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (10 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-04-01  9:38 ` Joao Martins
  11 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-04-01  9:38 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton

On 3/25/21 11:09 PM, Joao Martins wrote:

[...]

> Patch 11: Optimize grabbing page refcount changes given that we
> are working with compound pages i.e. we do 1 increment to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably with DRAM stored struct pages. It also *greatly*
> improves pinning with altmap. Results with gup_test:
> 
>                                                    before     after
>     (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
>     (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
>     (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
>     (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
> 
>     altmap performance gets specially interesting when pinning a pmem dimm:
> 
>                                                    before     after
>     (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
>     (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
>     (128G get_user_pages_fast altmap 2M page size)  ~3.91 ms -> ~70 ms
>     (128G pin_user_pages_fast altmap 2M page size)  ~3.97 ms -> ~74 ms
> 

Quick correction: These last two 3.91 and 3.97 on the left column are in *seconds* not
milliseconds. By mistake I added an extra 'm'. Sorry about that.

> The unpinning improvement patches are in mmotm/linux-next so removed from this
> series.
> 
> I have deferred the __get_user_pages() patch to outside this series
> (https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
> as I found an simpler way to address it and that is also applicable to
> THP. But will submit that as a follow up of this.
> 
> Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
> 
> Comments and suggestions very much appreciated!
> 
> Changelog,
> 
>  RFC -> v1:
>  (New patches 1-3, 5-8 but the diffstat is that different)
>  * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
>  * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)

I just noticed that I haven't fully removed the 'we' occurrences. Patches 7 and 8 had
their commit messages rewritten and I mistakenly re-introduced remnants of 'we'. I will
have it fixed for (v2) albeit I'll still wait for comments on this series before following up.

>  * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
>  * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
>  * Remove the gup_device_compound_huge special path and have the same code
>    work both ways while special casing when devmap page is compound (Jason, John)
>  * Avoid usage of vmemmap_populate_basepages() and introduce a first class
>    loop that doesn't care about passing an altmap for memmap reuse. (Dan)
>  * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
>    hack into passing block across sparse_add_section calls. It's a lot easier to
>    follow and more explicit in what it does.
>  * Replace the vmemmap refactoring with adding a @pgmap argument and moving
>    parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
>  * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
>  * Improve memmap_init_zone_device() to initialize compound pages when
>    struct pages are cache warm. That lead to a even further speed up further
>    from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
>    as a result (Dan)
>  * Remove PGMAP_COMPOUND and use @align as the property to detect whether
>    or not to reuse vmemmap areas (Dan)
> 
> Thanks,
> 	Joao
> 
> Joao Martins (11):
>   memory-failure: fetch compound_head after pgmap_pfn_valid()
>   mm/page_alloc: split prep_compound_page into head and tail subparts
>   mm/page_alloc: refactor memmap_init_zone_device() page init
>   mm/memremap: add ZONE_DEVICE support for compound pages
>   mm/sparse-vmemmap: add a pgmap argument to section activation
>   mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
>   mm/sparse-vmemmap: populate compound pagemaps
>   mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
>   mm/page_alloc: reuse tail struct pages for compound pagemaps
>   device-dax: compound pagemap support
>   mm/gup: grab head page refcount once for group of subpages
> 
>  drivers/dax/device.c           |  58 +++++++--
>  include/linux/memory_hotplug.h |   5 +-
>  include/linux/memremap.h       |  13 ++
>  include/linux/mm.h             |   8 +-
>  mm/gup.c                       |  52 +++++---
>  mm/memory-failure.c            |   2 +
>  mm/memory_hotplug.c            |   3 +-
>  mm/memremap.c                  |   9 +-
>  mm/page_alloc.c                | 126 +++++++++++++------
>  mm/sparse-vmemmap.c            | 221 +++++++++++++++++++++++++++++----
>  mm/sparse.c                    |  24 ++--
>  11 files changed, 406 insertions(+), 115 deletions(-)
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-04-24  0:12   ` Dan Williams
  2021-04-24 19:00     ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-04-24  0:12 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
> dax_lock_page()).  For pagemap with compound pages fetch the
> compound_head in case we are handling a tail page memory failure.
>
> Currently this is a nop, but in the advent of compound pages in
> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>
> Reported-by: Jane Chu <jane.chu@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/memory-failure.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..94240d772623 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>                 goto out;
>         }
>
> +       page = compound_head(page);

Unless / until we do compound pages for the filesystem-dax case, I
would add a comment like:

/* pages instantiated by device-dax (not filesystem-dax) may be
compound pages */
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-04-24  0:16   ` Dan Williams
  2021-04-24 19:05     ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-04-24  0:16 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Split the utility function prep_compound_page() into head and tail
> counterparts, and use them accordingly.

To make this patch stand alone better lets add another sentence:

"This is in preparation for sharing the storage for / deduplicating
compound page metadata."

Other than that, looks good to me.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c53fe4fa10bf..43dd98446b0b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
>         __free_pages_ok(page, compound_order(page), FPI_NONE);
>  }
>
> +static void prep_compound_head(struct page *page, unsigned int order)
> +{
> +       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> +       set_compound_order(page, order);
> +       atomic_set(compound_mapcount_ptr(page), -1);
> +       if (hpage_pincount_available(page))
> +               atomic_set(compound_pincount_ptr(page), 0);
> +}
> +
> +static void prep_compound_tail(struct page *head, int tail_idx)
> +{
> +       struct page *p = head + tail_idx;
> +
> +       set_page_count(p, 0);
> +       p->mapping = TAIL_MAPPING;
> +       set_compound_head(p, head);
> +}
> +
>  void prep_compound_page(struct page *page, unsigned int order)
>  {
>         int i;
>         int nr_pages = 1 << order;
>
>         __SetPageHead(page);
> -       for (i = 1; i < nr_pages; i++) {
> -               struct page *p = page + i;
> -               set_page_count(p, 0);
> -               p->mapping = TAIL_MAPPING;
> -               set_compound_head(p, page);
> -       }
> +       for (i = 1; i < nr_pages; i++)
> +               prep_compound_tail(page, i);
>
> -       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> -       set_compound_order(page, order);
> -       atomic_set(compound_mapcount_ptr(page), -1);
> -       if (hpage_pincount_available(page))
> -               atomic_set(compound_pincount_ptr(page), 0);
> +       prep_compound_head(page, order);
>  }
>
>  #ifdef CONFIG_DEBUG_PAGEALLOC
> --
> 2.17.1
>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-04-24  0:18   ` Dan Williams
  2021-04-24 19:05     ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-04-24  0:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Move struct page init to an helper function __init_zone_device_page().

Same sentence addition suggestion from the last patch to make this
patch have some rationale for existing.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
>  1 file changed, 41 insertions(+), 33 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 43dd98446b0b..58974067bbd4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
>  }
>
>  #ifdef CONFIG_ZONE_DEVICE
> +static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> +                                         unsigned long zone_idx, int nid,
> +                                         struct dev_pagemap *pgmap)
> +{
> +
> +       __init_single_page(page, pfn, zone_idx, nid);
> +
> +       /*
> +        * Mark page reserved as it will need to wait for onlining
> +        * phase for it to be fully associated with a zone.
> +        *
> +        * We can use the non-atomic __set_bit operation for setting
> +        * the flag as we are still initializing the pages.
> +        */
> +       __SetPageReserved(page);
> +
> +       /*
> +        * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
> +        * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
> +        * ever freed or placed on a driver-private list.
> +        */
> +       page->pgmap = pgmap;
> +       page->zone_device_data = NULL;
> +
> +       /*
> +        * Mark the block movable so that blocks are reserved for
> +        * movable at startup. This will force kernel allocations
> +        * to reserve their blocks rather than leaking throughout
> +        * the address space during boot when many long-lived
> +        * kernel allocations are made.
> +        *
> +        * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
> +        * because this is done early in section_activate()
> +        */
> +       if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> +               set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +               cond_resched();
> +       }
> +}
> +
>  void __ref memmap_init_zone_device(struct zone *zone,
>                                    unsigned long start_pfn,
>                                    unsigned long nr_pages,
> @@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>                 struct page *page = pfn_to_page(pfn);
>
> -               __init_single_page(page, pfn, zone_idx, nid);
> -
> -               /*
> -                * Mark page reserved as it will need to wait for onlining
> -                * phase for it to be fully associated with a zone.
> -                *
> -                * We can use the non-atomic __set_bit operation for setting
> -                * the flag as we are still initializing the pages.
> -                */
> -               __SetPageReserved(page);
> -
> -               /*
> -                * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
> -                * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
> -                * ever freed or placed on a driver-private list.
> -                */
> -               page->pgmap = pgmap;
> -               page->zone_device_data = NULL;
> -
> -               /*
> -                * Mark the block movable so that blocks are reserved for
> -                * movable at startup. This will force kernel allocations
> -                * to reserve their blocks rather than leaking throughout
> -                * the address space during boot when many long-lived
> -                * kernel allocations are made.
> -                *
> -                * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
> -                * because this is done early in section_activate()
> -                */
> -               if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
> -                       set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> -                       cond_resched();
> -               }
> +               __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>         }
>
>         pr_info("%s initialised %lu pages in %ums\n", __func__,
> --
> 2.17.1
>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-04-24  0:12   ` Dan Williams
@ 2021-04-24 19:00     ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-04-24 19:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton



On 4/24/21 1:12 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
>> dax_lock_page()).  For pagemap with compound pages fetch the
>> compound_head in case we are handling a tail page memory failure.
>>
>> Currently this is a nop, but in the advent of compound pages in
>> dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
>>
>> Reported-by: Jane Chu <jane.chu@oracle.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/memory-failure.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 24210c9bd843..94240d772623 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>>                 goto out;
>>         }
>>
>> +       page = compound_head(page);
> 
> Unless / until we do compound pages for the filesystem-dax case, I
> would add a comment like:
> 
> /* pages instantiated by device-dax (not filesystem-dax) may be
> compound pages */
> 
I've fixed up with the comment.

Thanks!
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-04-24  0:16   ` Dan Williams
@ 2021-04-24 19:05     ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-04-24 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton



On 4/24/21 1:16 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Split the utility function prep_compound_page() into head and tail
>> counterparts, and use them accordingly.
> 
> To make this patch stand alone better lets add another sentence:
> 
> "This is in preparation for sharing the storage for / deduplicating
> compound page metadata."
> 
Yeap, I've fixed it up.

> Other than that, looks good to me.
> 
/me nods

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/page_alloc.c | 32 +++++++++++++++++++++-----------
>>  1 file changed, 21 insertions(+), 11 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c53fe4fa10bf..43dd98446b0b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
>>         __free_pages_ok(page, compound_order(page), FPI_NONE);
>>  }
>>
>> +static void prep_compound_head(struct page *page, unsigned int order)
>> +{
>> +       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
>> +       set_compound_order(page, order);
>> +       atomic_set(compound_mapcount_ptr(page), -1);
>> +       if (hpage_pincount_available(page))
>> +               atomic_set(compound_pincount_ptr(page), 0);
>> +}
>> +
>> +static void prep_compound_tail(struct page *head, int tail_idx)
>> +{
>> +       struct page *p = head + tail_idx;
>> +
>> +       set_page_count(p, 0);
>> +       p->mapping = TAIL_MAPPING;
>> +       set_compound_head(p, head);
>> +}
>> +
>>  void prep_compound_page(struct page *page, unsigned int order)
>>  {
>>         int i;
>>         int nr_pages = 1 << order;
>>
>>         __SetPageHead(page);
>> -       for (i = 1; i < nr_pages; i++) {
>> -               struct page *p = page + i;
>> -               set_page_count(p, 0);
>> -               p->mapping = TAIL_MAPPING;
>> -               set_compound_head(p, page);
>> -       }
>> +       for (i = 1; i < nr_pages; i++)
>> +               prep_compound_tail(page, i);
>>
>> -       set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
>> -       set_compound_order(page, order);
>> -       atomic_set(compound_mapcount_ptr(page), -1);
>> -       if (hpage_pincount_available(page))
>> -               atomic_set(compound_pincount_ptr(page), 0);
>> +       prep_compound_head(page, order);
>>  }
>>
>>  #ifdef CONFIG_DEBUG_PAGEALLOC
>> --
>> 2.17.1
>>
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-04-24  0:18   ` Dan Williams
@ 2021-04-24 19:05     ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-04-24 19:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton



On 4/24/21 1:18 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Move struct page init to an helper function __init_zone_device_page().
> 
> Same sentence addition suggestion from the last patch to make this
> patch have some rationale for existing.
> 
I have fixed this too, with the same message as the previous patch.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
>>  1 file changed, 41 insertions(+), 33 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 43dd98446b0b..58974067bbd4 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
>>  }
>>
>>  #ifdef CONFIG_ZONE_DEVICE
>> +static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>> +                                         unsigned long zone_idx, int nid,
>> +                                         struct dev_pagemap *pgmap)
>> +{
>> +
>> +       __init_single_page(page, pfn, zone_idx, nid);
>> +
>> +       /*
>> +        * Mark page reserved as it will need to wait for onlining
>> +        * phase for it to be fully associated with a zone.
>> +        *
>> +        * We can use the non-atomic __set_bit operation for setting
>> +        * the flag as we are still initializing the pages.
>> +        */
>> +       __SetPageReserved(page);
>> +
>> +       /*
>> +        * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> +        * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
>> +        * ever freed or placed on a driver-private list.
>> +        */
>> +       page->pgmap = pgmap;
>> +       page->zone_device_data = NULL;
>> +
>> +       /*
>> +        * Mark the block movable so that blocks are reserved for
>> +        * movable at startup. This will force kernel allocations
>> +        * to reserve their blocks rather than leaking throughout
>> +        * the address space during boot when many long-lived
>> +        * kernel allocations are made.
>> +        *
>> +        * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
>> +        * because this is done early in section_activate()
>> +        */
>> +       if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
>> +               set_pageblock_migratetype(page, MIGRATE_MOVABLE);
>> +               cond_resched();
>> +       }
>> +}
>> +
>>  void __ref memmap_init_zone_device(struct zone *zone,
>>                                    unsigned long start_pfn,
>>                                    unsigned long nr_pages,
>> @@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>                 struct page *page = pfn_to_page(pfn);
>>
>> -               __init_single_page(page, pfn, zone_idx, nid);
>> -
>> -               /*
>> -                * Mark page reserved as it will need to wait for onlining
>> -                * phase for it to be fully associated with a zone.
>> -                *
>> -                * We can use the non-atomic __set_bit operation for setting
>> -                * the flag as we are still initializing the pages.
>> -                */
>> -               __SetPageReserved(page);
>> -
>> -               /*
>> -                * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> -                * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
>> -                * ever freed or placed on a driver-private list.
>> -                */
>> -               page->pgmap = pgmap;
>> -               page->zone_device_data = NULL;
>> -
>> -               /*
>> -                * Mark the block movable so that blocks are reserved for
>> -                * movable at startup. This will force kernel allocations
>> -                * to reserve their blocks rather than leaking throughout
>> -                * the address space during boot when many long-lived
>> -                * kernel allocations are made.
>> -                *
>> -                * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
>> -                * because this is done early in section_activate()
>> -                */
>> -               if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
>> -                       set_pageblock_migratetype(page, MIGRATE_MOVABLE);
>> -                       cond_resched();
>> -               }
>> +               __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>>         }
>>
>>         pr_info("%s initialised %lu pages in %ums\n", __func__,
>> --
>> 2.17.1
>>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-05-05 18:44   ` Dan Williams
  2021-05-05 18:58     ` Matthew Wilcox
  2021-05-05 19:49     ` Joao Martins
  0 siblings, 2 replies; 44+ messages in thread
From: Dan Williams @ 2021-05-05 18:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Add a new align property for struct dev_pagemap which specifies that a
> pagemap is composed of a set of compound pages of size @align, instead
> of base pages. When these pages are initialised, most are initialised as
> tail pages instead of order-0 pages.
>
> For certain ZONE_DEVICE users like device-dax which have a fixed page
> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> treating it the same way as THP or hugetlb pages.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/memremap.h | 13 +++++++++++++
>  mm/memremap.c            |  8 ++++++--
>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
>  3 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index b46f63dcaed3..bb28d82dda5e 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -114,6 +114,7 @@ struct dev_pagemap {
>         struct completion done;
>         enum memory_type type;
>         unsigned int flags;
> +       unsigned long align;

I think this wants some kernel-doc above to indicate that non-zero
means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
means "use non-compound base pages". The non-zero value must be
PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. Hmm, maybe it should be an
enum:

enum devmap_geometry {
    DEVMAP_PTE,
    DEVMAP_PMD,
    DEVMAP_PUD,
}

...because it's more than just an alignment it's a structural
definition of how the memmap is laid out.

>         const struct dev_pagemap_ops *ops;
>         void *owner;
>         int nr_range;
> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>         return NULL;
>  }
>
> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> +{
> +       if (!pgmap || !pgmap->align)
> +               return PAGE_SIZE;
> +       return pgmap->align;
> +}
> +
> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> +{
> +       return PHYS_PFN(pgmap_align(pgmap));
> +}
> +
>  #ifdef CONFIG_ZONE_DEVICE
>  bool pfn_zone_device_reserved(unsigned long pfn);
>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 805d761740c4..d160853670c4 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>                                 PHYS_PFN(range->start),
>                                 PHYS_PFN(range_len(range)), pgmap);
> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> -                       - pfn_first(pgmap, range_id));
> +       if (pgmap_align(pgmap) > PAGE_SIZE)
> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
> +       else
> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> +                               - pfn_first(pgmap, range_id));
>         return 0;
>
>  err_add_memory:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 58974067bbd4..3a77f9e43f3a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>         struct pglist_data *pgdat = zone->zone_pgdat;
>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
> +       unsigned int order_align = order_base_2(pfn_align);
>         unsigned long zone_idx = zone_idx(zone);
>         unsigned long start = jiffies;
>         int nid = pgdat->node_id;
> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
>                 nr_pages = end_pfn - start_pfn;
>         }
>
> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {

pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
PAGE_SHIFT" I missed somewhere?
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 18:44   ` Dan Williams
@ 2021-05-05 18:58     ` Matthew Wilcox
  2021-05-05 19:49     ` Joao Martins
  1 sibling, 0 replies; 44+ messages in thread
From: Matthew Wilcox @ 2021-05-05 18:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Joao Martins, Linux MM, linux-nvdimm, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton

On Wed, May 05, 2021 at 11:44:29AM -0700, Dan Williams wrote:
> > @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >         unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >         struct pglist_data *pgdat = zone->zone_pgdat;
> >         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> > +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
> > +       unsigned int order_align = order_base_2(pfn_align);
> >         unsigned long zone_idx = zone_idx(zone);
> >         unsigned long start = jiffies;
> >         int nid = pgdat->node_id;
> > @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >                 nr_pages = end_pfn - start_pfn;
> >         }
> >
> > -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> 
> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> PAGE_SHIFT" I missed somewhere?

If something is measured in bytes, I like to use size_t (if it's
in memory) and loff_t (if it's on storage).  The compiler doesn't do
anything useful to warn you, but it's a nice indication to humans about
what's going on.  And it removes the temptation to do 'pfn_align >>=
PAGE_SHIFT' and suddenly take pfn_align from being measured in bytes to
being measured in pages.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 18:44   ` Dan Williams
  2021-05-05 18:58     ` Matthew Wilcox
@ 2021-05-05 19:49     ` Joao Martins
  2021-05-05 22:20       ` Dan Williams
  1 sibling, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-05-05 19:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton



On 5/5/21 7:44 PM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Add a new align property for struct dev_pagemap which specifies that a
>> pagemap is composed of a set of compound pages of size @align, instead
>> of base pages. When these pages are initialised, most are initialised as
>> tail pages instead of order-0 pages.
>>
>> For certain ZONE_DEVICE users like device-dax which have a fixed page
>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>> treating it the same way as THP or hugetlb pages.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/memremap.h | 13 +++++++++++++
>>  mm/memremap.c            |  8 ++++++--
>>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
>>  3 files changed, 42 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index b46f63dcaed3..bb28d82dda5e 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>         struct completion done;
>>         enum memory_type type;
>>         unsigned int flags;
>> +       unsigned long align;
> 
> I think this wants some kernel-doc above to indicate that non-zero
> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> means "use non-compound base pages". 

Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top
of pgmap_align()?

> The non-zero value must be
> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE. 
> Hmm, maybe it should be an
> enum:
> 
> enum devmap_geometry {
>     DEVMAP_PTE,
>     DEVMAP_PMD,
>     DEVMAP_PUD,
> }
> 
I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
the whole dax/nvdimm align values change meanwhile (as a followup improvement)?

Although to be fair we only ever care about compound page size in this series (and
similarly dax/nvdimm @align properties).

> ...because it's more than just an alignment it's a structural
> definition of how the memmap is laid out.
> 
>>         const struct dev_pagemap_ops *ops;
>>         void *owner;
>>         int nr_range;
>> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>>         return NULL;
>>  }
>>
>> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
>> +{
>> +       if (!pgmap || !pgmap->align)
>> +               return PAGE_SIZE;
>> +       return pgmap->align;
>> +}
>> +
>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
>> +{
>> +       return PHYS_PFN(pgmap_align(pgmap));
>> +}
>> +
>>  #ifdef CONFIG_ZONE_DEVICE
>>  bool pfn_zone_device_reserved(unsigned long pfn);
>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index 805d761740c4..d160853670c4 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>                                 PHYS_PFN(range->start),
>>                                 PHYS_PFN(range_len(range)), pgmap);
>> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> -                       - pfn_first(pgmap, range_id));
>> +       if (pgmap_align(pgmap) > PAGE_SIZE)
>> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
>> +       else
>> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>> +                               - pfn_first(pgmap, range_id));
>>         return 0;
>>
>>  err_add_memory:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 58974067bbd4..3a77f9e43f3a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>>         struct pglist_data *pgdat = zone->zone_pgdat;
>>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
>> +       unsigned int order_align = order_base_2(pfn_align);
>>         unsigned long zone_idx = zone_idx(zone);
>>         unsigned long start = jiffies;
>>         int nid = pgdat->node_id;
>> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>                 nr_pages = end_pfn - start_pfn;
>>         }
>>
>> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> 
> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> PAGE_SHIFT" I missed somewhere?
> 
@pfn_align is in pages too. It's pgmap_align() which is in bytes:

+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+       return PHYS_PFN(pgmap_align(pgmap));
+}
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 19:49     ` Joao Martins
@ 2021-05-05 22:20       ` Dan Williams
  2021-05-05 22:36         ` Joao Martins
  2021-05-06  8:05         ` Aneesh Kumar K.V
  0 siblings, 2 replies; 44+ messages in thread
From: Dan Williams @ 2021-05-05 22:20 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 5/5/21 7:44 PM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Add a new align property for struct dev_pagemap which specifies that a
> >> pagemap is composed of a set of compound pages of size @align, instead
> >> of base pages. When these pages are initialised, most are initialised as
> >> tail pages instead of order-0 pages.
> >>
> >> For certain ZONE_DEVICE users like device-dax which have a fixed page
> >> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> >> treating it the same way as THP or hugetlb pages.
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  include/linux/memremap.h | 13 +++++++++++++
> >>  mm/memremap.c            |  8 ++++++--
> >>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
> >>  3 files changed, 42 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> >> index b46f63dcaed3..bb28d82dda5e 100644
> >> --- a/include/linux/memremap.h
> >> +++ b/include/linux/memremap.h
> >> @@ -114,6 +114,7 @@ struct dev_pagemap {
> >>         struct completion done;
> >>         enum memory_type type;
> >>         unsigned int flags;
> >> +       unsigned long align;
> >
> > I think this wants some kernel-doc above to indicate that non-zero
> > means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
> > means "use non-compound base pages".
>
> Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top
> of pgmap_align()?

I was thinking in dev_pagemap because this value is more than just a
plain alignment its restructuring the layout and construction of the
memmap(), but when I say it that way it seems much less like a vanilla
align value compared to a construction / geometry mode setting.

>
> > The non-zero value must be
> > PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
> > Hmm, maybe it should be an
> > enum:
> >
> > enum devmap_geometry {
> >     DEVMAP_PTE,
> >     DEVMAP_PMD,
> >     DEVMAP_PUD,
> > }
> >
> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?

I think it is ok for dax/nvdimm to continue to maintain their align
value because it should be ok to have 4MB align if the device really
wanted. However, when it goes to map that alignment with
memremap_pages() it can pick a mode. For example, it's already the
case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
they're already separate concepts that can stay separate.

>
> Although to be fair we only ever care about compound page size in this series (and
> similarly dax/nvdimm @align properties).
>
> > ...because it's more than just an alignment it's a structural
> > definition of how the memmap is laid out.
> >
> >>         const struct dev_pagemap_ops *ops;
> >>         void *owner;
> >>         int nr_range;
> >> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
> >>         return NULL;
> >>  }
> >>
> >> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
> >> +{
> >> +       if (!pgmap || !pgmap->align)
> >> +               return PAGE_SIZE;
> >> +       return pgmap->align;
> >> +}
> >> +
> >> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> >> +{
> >> +       return PHYS_PFN(pgmap_align(pgmap));
> >> +}
> >> +
> >>  #ifdef CONFIG_ZONE_DEVICE
> >>  bool pfn_zone_device_reserved(unsigned long pfn);
> >>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
> >> diff --git a/mm/memremap.c b/mm/memremap.c
> >> index 805d761740c4..d160853670c4 100644
> >> --- a/mm/memremap.c
> >> +++ b/mm/memremap.c
> >> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >>                                 PHYS_PFN(range->start),
> >>                                 PHYS_PFN(range_len(range)), pgmap);
> >> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> -                       - pfn_first(pgmap, range_id));
> >> +       if (pgmap_align(pgmap) > PAGE_SIZE)
> >> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
> >> +       else
> >> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >> +                               - pfn_first(pgmap, range_id));
> >>         return 0;
> >>
> >>  err_add_memory:
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 58974067bbd4..3a77f9e43f3a 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >>         struct pglist_data *pgdat = zone->zone_pgdat;
> >>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> >> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
> >> +       unsigned int order_align = order_base_2(pfn_align);
> >>         unsigned long zone_idx = zone_idx(zone);
> >>         unsigned long start = jiffies;
> >>         int nid = pgdat->node_id;
> >> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >>                 nr_pages = end_pfn - start_pfn;
> >>         }
> >>
> >> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
> >
> > pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
> > PAGE_SHIFT" I missed somewhere?
> >
> @pfn_align is in pages too. It's pgmap_align() which is in bytes:
>
> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
> +{
> +       return PHYS_PFN(pgmap_align(pgmap));
> +}

Ah yup, my eyes glazed over that. I think this is another place that
benefits from a more specific name than "align". "pfns_per_compound"
"compound_pfns"?
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-05-05 22:34   ` Dan Williams
  2021-05-05 22:37     ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-05-05 22:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Oscar Salvador

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> @altmap is stored in a dev_pagemap, but it will be repurposed for
> hotplug memory for storing the memmap in the hotplugged memory[*] and
> reusing the altmap infrastructure to that end. This is to say that
> altmap can't be replaced with a @pgmap as it is going to cover more than
> dev_pagemap backend altmaps.

I was going to say, just pass the pgmap and lookup the altmap from
pgmap, but Oscar added a use case for altmap independent of pgmap. So
you might refresh this commit message to clarify why passing pgmap by
itself is not sufficient.

Other than that, looks good.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:20       ` Dan Williams
@ 2021-05-05 22:36         ` Joao Martins
  2021-05-05 23:03           ` Dan Williams
  2021-05-18 17:27           ` Joao Martins
  2021-05-06  8:05         ` Aneesh Kumar K.V
  1 sibling, 2 replies; 44+ messages in thread
From: Joao Martins @ 2021-05-05 22:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On 5/5/21 11:20 PM, Dan Williams wrote:
> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> Add a new align property for struct dev_pagemap which specifies that a
>>>> pagemap is composed of a set of compound pages of size @align, instead
>>>> of base pages. When these pages are initialised, most are initialised as
>>>> tail pages instead of order-0 pages.
>>>>
>>>> For certain ZONE_DEVICE users like device-dax which have a fixed page
>>>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>>>> treating it the same way as THP or hugetlb pages.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  include/linux/memremap.h | 13 +++++++++++++
>>>>  mm/memremap.c            |  8 ++++++--
>>>>  mm/page_alloc.c          | 24 +++++++++++++++++++++++-
>>>>  3 files changed, 42 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>> --- a/include/linux/memremap.h
>>>> +++ b/include/linux/memremap.h
>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>         struct completion done;
>>>>         enum memory_type type;
>>>>         unsigned int flags;
>>>> +       unsigned long align;
>>>
>>> I think this wants some kernel-doc above to indicate that non-zero
>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>> means "use non-compound base pages".
>>
>> Got it. Are you thinking a kernel_doc on top of the variable above, or preferably on top
>> of pgmap_align()?
> 
> I was thinking in dev_pagemap because this value is more than just a
> plain alignment its restructuring the layout and construction of the
> memmap(), but when I say it that way it seems much less like a vanilla
> align value compared to a construction / geometry mode setting.
> 
/me nods

>>
>>> The non-zero value must be
>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>> Hmm, maybe it should be an
>>> enum:
>>>
>>> enum devmap_geometry {
>>>     DEVMAP_PTE,
>>>     DEVMAP_PMD,
>>>     DEVMAP_PUD,
>>> }
>>>
>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
> 
> I think it is ok for dax/nvdimm to continue to maintain their align
> value because it should be ok to have 4MB align if the device really
> wanted. However, when it goes to map that alignment with
> memremap_pages() it can pick a mode. For example, it's already the
> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> they're already separate concepts that can stay separate.
> 
Gotcha.

>>
>> Although to be fair we only ever care about compound page size in this series (and
>> similarly dax/nvdimm @align properties).
>>
>>> ...because it's more than just an alignment it's a structural
>>> definition of how the memmap is laid out.
>>>

Somehow I missed this other part of your response.

>>>>         const struct dev_pagemap_ops *ops;
>>>>         void *owner;
>>>>         int nr_range;
>>>> @@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
>>>>         return NULL;
>>>>  }
>>>>
>>>> +static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
>>>> +{
>>>> +       if (!pgmap || !pgmap->align)
>>>> +               return PAGE_SIZE;
>>>> +       return pgmap->align;
>>>> +}
>>>> +
>>>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
>>>> +{
>>>> +       return PHYS_PFN(pgmap_align(pgmap));
>>>> +}
>>>> +
>>>>  #ifdef CONFIG_ZONE_DEVICE
>>>>  bool pfn_zone_device_reserved(unsigned long pfn);
>>>>  void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>> index 805d761740c4..d160853670c4 100644
>>>> --- a/mm/memremap.c
>>>> +++ b/mm/memremap.c
>>>> @@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>>>         memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>>>                                 PHYS_PFN(range->start),
>>>>                                 PHYS_PFN(range_len(range)), pgmap);
>>>> -       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>>>> -                       - pfn_first(pgmap, range_id));
>>>> +       if (pgmap_align(pgmap) > PAGE_SIZE)
>>>> +               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>>>> +                       - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
>>>> +       else
>>>> +               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>>>> +                               - pfn_first(pgmap, range_id));
>>>>         return 0;
>>>>
>>>>  err_add_memory:
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 58974067bbd4..3a77f9e43f3a 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>>>         unsigned long pfn, end_pfn = start_pfn + nr_pages;
>>>>         struct pglist_data *pgdat = zone->zone_pgdat;
>>>>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>>>> +       unsigned int pfn_align = pgmap_pfn_align(pgmap);
>>>> +       unsigned int order_align = order_base_2(pfn_align);
>>>>         unsigned long zone_idx = zone_idx(zone);
>>>>         unsigned long start = jiffies;
>>>>         int nid = pgdat->node_id;
>>>> @@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>>>                 nr_pages = end_pfn - start_pfn;
>>>>         }
>>>>
>>>> -       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>>> +       for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
>>>
>>> pfn_align is in bytes and pfn is in pages... is there a "pfn_align >>=
>>> PAGE_SHIFT" I missed somewhere?
>>>
>> @pfn_align is in pages too. It's pgmap_align() which is in bytes:
>>
>> +static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
>> +{
>> +       return PHYS_PFN(pgmap_align(pgmap));
>> +}
> 
> Ah yup, my eyes glazed over that. I think this is another place that
> benefits from a more specific name than "align". "pfns_per_compound"
> "compound_pfns"?
> 
We are still describing a page, just not a base page. So perhaps @pfns_per_hpage ?

I am fine with @pfns_per_compound or @compound_pfns as well.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-05-05 22:34   ` Dan Williams
@ 2021-05-05 22:37     ` Joao Martins
  2021-05-05 23:14       ` Dan Williams
  0 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-05-05 22:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Oscar Salvador



On 5/5/21 11:34 PM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> @altmap is stored in a dev_pagemap, but it will be repurposed for
>> hotplug memory for storing the memmap in the hotplugged memory[*] and
>> reusing the altmap infrastructure to that end. This is to say that
>> altmap can't be replaced with a @pgmap as it is going to cover more than
>> dev_pagemap backend altmaps.
> 
> I was going to say, just pass the pgmap and lookup the altmap from
> pgmap, but Oscar added a use case for altmap independent of pgmap. So
> you might refresh this commit message to clarify why passing pgmap by
> itself is not sufficient.
> 
Isn't that what I am doing above with that exact paragraph? I even
reference his series at the end of commit description :) in [*]

> Other than that, looks good.
> 
/me nods
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
@ 2021-05-05 22:43   ` Dan Williams
  2021-05-06 10:27     ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-05-05 22:43 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

I suspect it's a good sign I'm only finding cosmetic and changelog
changes in the review... I have some more:

A year for now if I'm tracking down a problem and looking through mm
commits I would appreciate a subject line like the following:
"refactor core of vmemmap_populate_basepages() to helper" that gives
an idea of the impact and side effects of the change.

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>

I would add a lead in phrase like: "In preparation for describing a
memmap with compound pages, move the actual..."

> Move the actual pte population logic into a separate function
> vmemmap_populate_address() and have vmemmap_populate_basepages()
> walk through all base pages it needs to populate.

Aside from the above, looks good.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
>  1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 370728c206ee..8814876edcfa 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> -                                        int node, struct vmem_altmap *altmap)
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap)
>  {
> -       unsigned long addr = start;
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
>         pte_t *pte;
>
> +       pgd = vmemmap_pgd_populate(addr, node);
> +       if (!pgd)
> +               return -ENOMEM;
> +       p4d = vmemmap_p4d_populate(pgd, addr, node);
> +       if (!p4d)
> +               return -ENOMEM;
> +       pud = vmemmap_pud_populate(p4d, addr, node);
> +       if (!pud)
> +               return -ENOMEM;
> +       pmd = vmemmap_pmd_populate(pud, addr, node);
> +       if (!pmd)
> +               return -ENOMEM;
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       if (!pte)
> +               return -ENOMEM;
> +       vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> +}
> +
> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> +                                        int node, struct vmem_altmap *altmap)
> +{
> +       unsigned long addr = start;
> +
>         for (; addr < end; addr += PAGE_SIZE) {
> -               pgd = vmemmap_pgd_populate(addr, node);
> -               if (!pgd)
> -                       return -ENOMEM;
> -               p4d = vmemmap_p4d_populate(pgd, addr, node);
> -               if (!p4d)
> -                       return -ENOMEM;
> -               pud = vmemmap_pud_populate(p4d, addr, node);
> -               if (!pud)
> -                       return -ENOMEM;
> -               pmd = vmemmap_pmd_populate(pud, addr, node);
> -               if (!pmd)
> -                       return -ENOMEM;
> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> -               if (!pte)
> +               if (vmemmap_populate_address(addr, node, altmap))
>                         return -ENOMEM;
> -               vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>         }
>
>         return 0;
> --
> 2.17.1
>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:36         ` Joao Martins
@ 2021-05-05 23:03           ` Dan Williams
  2021-05-06 10:12             ` Joao Martins
  2021-05-18 17:27           ` Joao Martins
  1 sibling, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-05-05 23:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Wed, May 5, 2021 at 3:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> > Ah yup, my eyes glazed over that. I think this is another place that
> > benefits from a more specific name than "align". "pfns_per_compound"
> > "compound_pfns"?
> >
> We are still describing a page, just not a base page. So perhaps @pfns_per_hpage ?
>
> I am fine with @pfns_per_compound or @compound_pfns as well.

My only concern about hpage is that hpage implies PMD, where compound
is generic across PMD and PUD.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-05-05 22:37     ` Joao Martins
@ 2021-05-05 23:14       ` Dan Williams
  2021-05-06 10:24         ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-05-05 23:14 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Oscar Salvador

On Wed, May 5, 2021 at 3:38 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 5/5/21 11:34 PM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> @altmap is stored in a dev_pagemap, but it will be repurposed for
> >> hotplug memory for storing the memmap in the hotplugged memory[*] and
> >> reusing the altmap infrastructure to that end. This is to say that
> >> altmap can't be replaced with a @pgmap as it is going to cover more than
> >> dev_pagemap backend altmaps.
> >
> > I was going to say, just pass the pgmap and lookup the altmap from
> > pgmap, but Oscar added a use case for altmap independent of pgmap. So
> > you might refresh this commit message to clarify why passing pgmap by
> > itself is not sufficient.
> >
> Isn't that what I am doing above with that exact paragraph? I even
> reference his series at the end of commit description :) in [*]

Oh, sorry, it didn't hit me explicitly that you were talking about
Oscar's work I thought you were referring to your own changes in this
set. I see it now... at a minimum the tense needs updating since
Oscar's changes are in the past not the future anymore. If it helps,
the following reads more direct to me: "In support of using compound
pages for devmap mappings, plumb the pgmap down to the
vmemmap_populate* implementation. Note that while altmap is
retrievable from pgmap the memory hotplug codes passes altmap without
pgmap, so both need to be independently plumbed."
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-05-06  1:18   ` Dan Williams
  2021-05-06 11:01     ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-05-06  1:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
> means that pages are mapped at a given huge page alignment and utilize
> uses compound pages as opposed to order-0 pages.
>
> To minimize struct page overhead we take advantage of the fact that

I'm not sure if Andrew is a member of the "we" police, but I am on
team "imperative tense".

"Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead."

...I think all the other "we"s below can be dropped without losing any meaning.

> most tail pages look the same (except the first two). We allocate a
> separate page for the vmemmap area which contains the head page and
> separate for the next 64 pages. The rest of the subsections then reuse
> this tail vmemmap page to initialize the rest of the tail pages.
>
> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
> when initializing compound pagemap with big enough @align (e.g. 1G
> PUD) it  will cross various sections. To be able to reuse tail pages
> across sections belonging to the same gigantic page we fetch the
> @range being mapped (nr_ranges + 1).  If the section being mapped is
> not offset 0 of the @align, then lookup the PFN of the struct page
> address that preceeds it and use that to populate the entire

s/preceeds/precedes/

> section.
>
> On compound pagemaps with 2M align, this lets mechanism saves 6 pages

s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/

> out of the 8 necessary PFNs necessary to set the subsection's 512
> struct pages being mapped. On a 1G compound pagemap it saves
> 4094 pages.
>
> Altmap isn't supported yet, given various restrictions in altmap pfn
> allocator, thus fallback to the already in use vmemmap_populate().

Perhaps also note that altmap for devmap mappings was there to relieve
the pressure of inordinate amounts of memmap space to map terabytes of
PMEM. With compound pages the motivation for altmaps for PMEM is
reduced.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h  |   2 +-
>  mm/memremap.c       |   1 +
>  mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 131 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 61474602c2b1..49d717ae40ae 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                           struct vmem_altmap *altmap);
> +                           struct vmem_altmap *altmap, void *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
>  struct vmem_altmap;
>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index d160853670c4..2e6bc0b1ff00 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>  {
>         struct mhp_params params = {
>                 .altmap = pgmap_altmap(pgmap),
> +               .pgmap = pgmap,
>                 .pgprot = PAGE_KERNEL,
>         };
>         const int nr_range = pgmap->nr_range;
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 8814876edcfa..f57c5eada099 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>  }
>
>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                                      struct vmem_altmap *altmap)
> +                                      struct vmem_altmap *altmap, void *block)

For type-safety I would make this argument a 'struct page *' and drop
the virt_to_page().

>  {
>         pte_t *pte = pte_offset_kernel(pmd, addr);
>         if (pte_none(*pte)) {
>                 pte_t entry;
> -               void *p;
> -
> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> -               if (!p)
> -                       return NULL;
> +               void *p = block;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> +                       if (!p)
> +                               return NULL;
> +               } else if (!altmap) {
> +                       get_page(virt_to_page(block));

This is either super subtle, or there is a missing put_page(). I'm
assuming that in the shutdown path the same page will be freed
multiple times because the same page is mapped multiple times.

Have you validated that the accounting is correct and all allocated
pages get freed? I feel this needs more than a comment, it needs a
test to validate that the accounting continues to happen correctly as
future vmemmap changes land.

> +               }
>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>                 set_pte_at(&init_mm, addr, pte, entry);
>         }
> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>  }
>
>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap)
> +                                             struct vmem_altmap *altmap,
> +                                             void *page, void **ptr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pmd = vmemmap_pmd_populate(pud, addr, node);
>         if (!pmd)
>                 return -ENOMEM;
> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>         if (!pte)
>                 return -ENOMEM;
>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> +
> +       if (ptr)
> +               *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
> +       return 0;
>  }
>
>  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>         unsigned long addr = start;
>
>         for (; addr < end; addr += PAGE_SIZE) {
> -               if (vmemmap_populate_address(addr, node, altmap))
> +               if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_range(unsigned long start,
> +                                           unsigned long end,
> +                                           int node, void *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PAGE_SIZE) {
> +               if (vmemmap_populate_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
> +                                                 void **ptr)
> +{
> +       return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
> +}
> +
> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)

I think this can be replaced with a call to follow_pte(&init_mm...).

> +{
> +       pgd_t *pgd;
> +       p4d_t *p4d;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       pgd = pgd_offset_k(addr);
> +       if (pgd_none(*pgd))
> +               return NULL;
> +
> +       p4d = p4d_offset(pgd, addr);
> +       if (p4d_none(*p4d))
> +               return NULL;
> +
> +       pud = pud_offset(p4d, addr);
> +       if (pud_none(*pud))
> +               return NULL;
> +
> +       pmd = pmd_offset(pud, addr);
> +       if (pmd_none(*pmd))
> +               return NULL;
> +
> +       pte = pte_offset_kernel(pmd, addr);
> +       if (pte_none(*pte))
> +               return NULL;
> +
> +       return pte;
> +}
> +
> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
> +                                                    unsigned long start,
> +                                                    unsigned long end, int node,
> +                                                    struct dev_pagemap *pgmap)
> +{
> +       unsigned long offset, size, addr;
> +
> +       /*
> +        * For compound pages bigger than section size (e.g. 1G) fill the rest

Oh, I had to read this a couple times because I thought the "e.g." was
referring to section size. How about:

s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/

> +        * of sections as tail pages.
> +        *
> +        * Note that memremap_pages() resets @nr_range value and will increment
> +        * it after each range successful onlining. Thus the value or @nr_range
> +        * at section memmap populate corresponds to the in-progress range
> +        * being onlined that we care about.
> +        */
> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
> +       if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
> +           pgmap_align(pgmap) > SUBSECTION_SIZE) {
> +               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
> +
> +               if (!ptep)
> +                       return -ENOMEM;

Side note: I had been resistant to adding 'struct page' for altmap
pages, but that would be a requirement to enable compound pages +
altmap.

> +

Perhaps a comment here to the effect "recall the page that was
allocated in a prior iteration and fill it in with tail pages".

> +               return vmemmap_populate_range(start, end, node,
> +                                             page_to_virt(pte_page(*ptep)));

If the @block parameter changes to a 'struct page *' it also saves
some typing here.

> +       }
> +
> +       size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
> +       for (addr = start; addr < end; addr += size) {
> +               unsigned long next = addr, last = addr + size;
> +               void *block;
> +
> +               /* Populate the head page vmemmap page */
> +               if (vmemmap_populate_page(addr, node, NULL))
> +                       return -ENOMEM;
> +
> +               /* Populate the tail pages vmemmap page */
> +               block = NULL;
> +               next = addr + PAGE_SIZE;
> +               if (vmemmap_populate_page(next, node, &block))
> +                       return -ENOMEM;
> +
> +               /* Reuse the previous page for the rest of tail pages */
> +               next += PAGE_SIZE;
> +               if (vmemmap_populate_range(next, last, node, block))
>                         return -ENOMEM;
>         }
>
> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>  {
>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>         unsigned long end = start + nr_pages * sizeof(struct page);
> +       unsigned int align = pgmap_align(pgmap);
> +       int r;
>
>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>                 return NULL;
>
> -       if (vmemmap_populate(start, end, nid, altmap))
> +       if (align > PAGE_SIZE && !altmap)

I also think this code will read better the devmap_geometry enum.

> +               r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> +       else
> +               r = vmemmap_populate(start, end, nid, altmap);
> +
> +       if (r < 0)
>                 return NULL;
>
>         return pfn_to_page(pfn);
> --
> 2.17.1
>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:20       ` Dan Williams
  2021-05-05 22:36         ` Joao Martins
@ 2021-05-06  8:05         ` Aneesh Kumar K.V
  2021-05-06 10:23           ` Joao Martins
  1 sibling, 1 reply; 44+ messages in thread
From: Aneesh Kumar K.V @ 2021-05-06  8:05 UTC (permalink / raw)
  To: Dan Williams, Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton



IIUC this series is about devdax namespace with aligh of 1G or 2M where we can
save the vmmemap space by not allocating memory for tail struct pages? 

Dan Williams <dan.j.williams@intel.com> writes:

> > enum:
>> >
>> > enum devmap_geometry {
>> >     DEVMAP_PTE,
>> >     DEVMAP_PMD,
>> >     DEVMAP_PUD,
>> > }
>> >
>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>
> I think it is ok for dax/nvdimm to continue to maintain their align
> value because it should be ok to have 4MB align if the device really
> wanted. However, when it goes to map that alignment with
> memremap_pages() it can pick a mode. For example, it's already the
> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> they're already separate concepts that can stay separate.

devdax namespace with align of 1G implies we expect to map them with 1G
pte entries? I didn't follow when you say we map them today with
DEVMAP_PTE entries.


-aneesh
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 23:03           ` Dan Williams
@ 2021-05-06 10:12             ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-05-06 10:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On 5/6/21 12:03 AM, Dan Williams wrote:
> On Wed, May 5, 2021 at 3:36 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>> Ah yup, my eyes glazed over that. I think this is another place that
>>> benefits from a more specific name than "align". "pfns_per_compound"
>>> "compound_pfns"?
>>>
>> We are still describing a page, just not a base page. So perhaps @pfns_per_hpage ?
>>
>> I am fine with @pfns_per_compound or @compound_pfns as well.
> 
> My only concern about hpage is that hpage implies PMD, where compound
> is generic across PMD and PUD.
> 
True.

I will stick with your suggestions.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-06  8:05         ` Aneesh Kumar K.V
@ 2021-05-06 10:23           ` Joao Martins
  2021-05-06 11:43             ` Matthew Wilcox
  0 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-05-06 10:23 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton



On 5/6/21 9:05 AM, Aneesh Kumar K.V wrote:
> 
> 
> IIUC this series is about devdax namespace with aligh of 1G or 2M where we can
> save the vmmemap space by not allocating memory for tail struct pages? 
> 
Right.

It reuses base pages across the vmemmap, but for the base pages containing
only the tail struct pages.

> Dan Williams <dan.j.williams@intel.com> writes:
> 
>>> enum:
>>>>
>>>> enum devmap_geometry {
>>>>     DEVMAP_PTE,
>>>>     DEVMAP_PMD,
>>>>     DEVMAP_PUD,
>>>> }
>>>>
>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>
>> I think it is ok for dax/nvdimm to continue to maintain their align
>> value because it should be ok to have 4MB align if the device really
>> wanted. However, when it goes to map that alignment with
>> memremap_pages() it can pick a mode. For example, it's already the
>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>> they're already separate concepts that can stay separate.
> 
> devdax namespace with align of 1G implies we expect to map them with 1G
> pte entries? I didn't follow when you say we map them today with
> DEVMAP_PTE entries.
> 
This sort of confusion is largelly why Dan is suggesting a @geometry for naming rather
than @align (which traditionally refers to page tables entry sizes in pagemap-related stuff).

DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages (DEVMAP_PTE),
compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order (DEVMAP_PUD).

So, today:

* namespaces with align of 1G would use *struct pages of order-0* (DEVMAP_PTE) backed with
PMD entries in the direct map.
* namespaces with align of 2M would use *struct pages of order-0* (DEVMAP_PTE) backed with
PMD entries in the direct map.

After this series:

* namespaces with align of 1G would use *compound struct pages of order-30* (DEVMAP_PUD)
backed with PMD entries in the direct map.
* namespaces with align of 1G would use *compound struct pages of order-21* (DEVMAP_PMD)
backed with PTE entries in the direct map.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-05-05 23:14       ` Dan Williams
@ 2021-05-06 10:24         ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-05-06 10:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Oscar Salvador



On 5/6/21 12:14 AM, Dan Williams wrote:
> On Wed, May 5, 2021 at 3:38 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>
>>
>> On 5/5/21 11:34 PM, Dan Williams wrote:
>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> @altmap is stored in a dev_pagemap, but it will be repurposed for
>>>> hotplug memory for storing the memmap in the hotplugged memory[*] and
>>>> reusing the altmap infrastructure to that end. This is to say that
>>>> altmap can't be replaced with a @pgmap as it is going to cover more than
>>>> dev_pagemap backend altmaps.
>>>
>>> I was going to say, just pass the pgmap and lookup the altmap from
>>> pgmap, but Oscar added a use case for altmap independent of pgmap. So
>>> you might refresh this commit message to clarify why passing pgmap by
>>> itself is not sufficient.
>>>
>> Isn't that what I am doing above with that exact paragraph? I even
>> reference his series at the end of commit description :) in [*]
> 
> Oh, sorry, it didn't hit me explicitly that you were talking about
> Oscar's work I thought you were referring to your own changes in this
> set. I see it now... at a minimum the tense needs updating since
> Oscar's changes are in the past not the future anymore. 

/me nods

> If it helps,
> the following reads more direct to me: "In support of using compound
> pages for devmap mappings, plumb the pgmap down to the
> vmemmap_populate* implementation. Note that while altmap is
> retrievable from pgmap the memory hotplug codes passes altmap without
> pgmap, so both need to be independently plumbed."
> 
Let me use this one instead, definitely better than my strange english :)
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-05-05 22:43   ` Dan Williams
@ 2021-05-06 10:27     ` Joao Martins
  2021-05-06 18:36       ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-05-06 10:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton



On 5/5/21 11:43 PM, Dan Williams wrote:
> I suspect it's a good sign I'm only finding cosmetic and changelog
> changes in the review... 

Hopefully it continues that way, but the meat of the series is located
in patches 4, 6, 7 and 11. *Specially* 6 and 7.

I very strongly suspect I am gonna get non-cosmetic comments there.

> I have some more:
> 
> A year for now if I'm tracking down a problem and looking through mm
> commits I would appreciate a subject line like the following:
> "refactor core of vmemmap_populate_basepages() to helper" that gives
> an idea of the impact and side effects of the change.
> 
Fixed.

> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
> 
> I would add a lead in phrase like: "In preparation for describing a
> memmap with compound pages, move the actual..."
> 
>> Move the actual pte population logic into a separate function
>> vmemmap_populate_address() and have vmemmap_populate_basepages()
>> walk through all base pages it needs to populate.
> 
> Aside from the above, looks good.
>
Cool!
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-05-06  1:18   ` Dan Williams
@ 2021-05-06 11:01     ` Joao Martins
  2021-05-10 19:19       ` Dan Williams
  0 siblings, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-05-06 11:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On 5/6/21 2:18 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
>> means that pages are mapped at a given huge page alignment and utilize
>> uses compound pages as opposed to order-0 pages.
>>
>> To minimize struct page overhead we take advantage of the fact that
> 
> I'm not sure if Andrew is a member of the "we" police, but I am on
> team "imperative tense".
> 
That was my mistake once again. You did mentioned it over the RFC.
A few days after submitting the series I noticed[0] this patch and the
next one had still a few instances of misplaced usage of 'we', in addition
to a little unit mistake I made over the cover-letter.

[0] https://lore.kernel.org/linux-mm/a2b7dc3a-3f1d-b8e1-4770-e261055f2b17@oracle.com/

> "Take advantage of the fact that most tail pages look the same (except
> the first two) to minimize struct page overhead."
> 
> ...I think all the other "we"s below can be dropped without losing any meaning.
> 
Indeed.

>> most tail pages look the same (except the first two). We allocate a
>> separate page for the vmemmap area which contains the head page and
>> separate for the next 64 pages. The rest of the subsections then reuse
>> this tail vmemmap page to initialize the rest of the tail pages.
>>
>> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
>> when initializing compound pagemap with big enough @align (e.g. 1G
>> PUD) it  will cross various sections. To be able to reuse tail pages
>> across sections belonging to the same gigantic page we fetch the
>> @range being mapped (nr_ranges + 1).  If the section being mapped is
>> not offset 0 of the @align, then lookup the PFN of the struct page
>> address that preceeds it and use that to populate the entire
> 
> s/preceeds/precedes/
> 
/me nods

>> section.
>>
>> On compound pagemaps with 2M align, this lets mechanism saves 6 pages
> 
> s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/
> 
Will fix.

>> out of the 8 necessary PFNs necessary to set the subsection's 512
>> struct pages being mapped. On a 1G compound pagemap it saves
>> 4094 pages.
>>
>> Altmap isn't supported yet, given various restrictions in altmap pfn
>> allocator, thus fallback to the already in use vmemmap_populate().
> 
> Perhaps also note that altmap for devmap mappings was there to relieve
> the pressure of inordinate amounts of memmap space to map terabytes of
> PMEM. With compound pages the motivation for altmaps for PMEM is
> reduced.
> 
OK, I suppose it's worth highlighting it.

But perhaps how 'reduced' this motivation is still not 100% clear to me.

Like we discussed over the RFC, the DEVMAP_PMD geometry we still take a
non-trivial hit of ~4G per Tb. And altmap is a rather quick way of having
heavily disproportioned RAM to PMEM ratios working without pay the RAM
penalty. Specially with the pinning numbers so heavily reduced.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h  |   2 +-
>>  mm/memremap.c       |   1 +
>>  mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 131 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 61474602c2b1..49d717ae40ae 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> -                           struct vmem_altmap *altmap);
>> +                           struct vmem_altmap *altmap, void *block);
>>  void *vmemmap_alloc_block(unsigned long size, int node);
>>  struct vmem_altmap;
>>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index d160853670c4..2e6bc0b1ff00 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>>  {
>>         struct mhp_params params = {
>>                 .altmap = pgmap_altmap(pgmap),
>> +               .pgmap = pgmap,
>>                 .pgprot = PAGE_KERNEL,
>>         };
>>         const int nr_range = pgmap->nr_range;
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 8814876edcfa..f57c5eada099 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>>  }
>>
>>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> -                                      struct vmem_altmap *altmap)
>> +                                      struct vmem_altmap *altmap, void *block)
> 
> For type-safety I would make this argument a 'struct page *' and drop
> the virt_to_page().
> 
Will do.

>>  {
>>         pte_t *pte = pte_offset_kernel(pmd, addr);
>>         if (pte_none(*pte)) {
>>                 pte_t entry;
>> -               void *p;
>> -
>> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> -               if (!p)
>> -                       return NULL;
>> +               void *p = block;
>> +
>> +               if (!block) {
>> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> +                       if (!p)
>> +                               return NULL;
>> +               } else if (!altmap) {
>> +                       get_page(virt_to_page(block));
> 
> This is either super subtle, or there is a missing put_page(). I'm
> assuming that in the shutdown path the same page will be freed
> multiple times because the same page is mapped multiple times.
> 

It's the former (subtlety):

1) When a PTE/PMD entry is freed from the init_mm there's a a free_pages() call to
this page allocated above. free_pages() at the top does a put_page_testzero() prior to
freeing the actual page.

2) Given this codepath is solely used by pgmap infra, we already have the page allocator
is available (i.e. slab_is_available()), hence we will always go towards the
alloc_pages_node() codepath.

> Have you validated that the accounting is correct and all allocated
> pages get freed? I feel this needs more than a comment, it needs a
> test to validate that the accounting continues to happen correctly as
> future vmemmap changes land.
> 

I can add a comment *at least*.

I checked the accounting. But let me be extra pedantic on this part and recheck
this once again as I settled with this part some time ago. I will followup on this
chunk, should this part be broken in some way.

>> +               }
>>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>>                 set_pte_at(&init_mm, addr, pte, entry);
>>         }
>> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>  }
>>
>>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> -                                             struct vmem_altmap *altmap)
>> +                                             struct vmem_altmap *altmap,
>> +                                             void *page, void **ptr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>         pmd = vmemmap_pmd_populate(pud, addr, node);
>>         if (!pmd)
>>                 return -ENOMEM;
>> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>>         if (!pte)
>>                 return -ENOMEM;
>>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>> +
>> +       if (ptr)
>> +               *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
>> +       return 0;
>>  }
>>
>>  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>         unsigned long addr = start;
>>
>>         for (; addr < end; addr += PAGE_SIZE) {
>> -               if (vmemmap_populate_address(addr, node, altmap))
>> +               if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_range(unsigned long start,
>> +                                           unsigned long end,
>> +                                           int node, void *page)
>> +{
>> +       unsigned long addr = start;
>> +
>> +       for (; addr < end; addr += PAGE_SIZE) {
>> +               if (vmemmap_populate_address(addr, node, NULL, page, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>> +                                                 void **ptr)
>> +{
>> +       return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>> +}
>> +
>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> 
> I think this can be replaced with a call to follow_pte(&init_mm...).
> 

Ah, of course! That would shorten things up too.

>> +{
>> +       pgd_t *pgd;
>> +       p4d_t *p4d;
>> +       pud_t *pud;
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       pgd = pgd_offset_k(addr);
>> +       if (pgd_none(*pgd))
>> +               return NULL;
>> +
>> +       p4d = p4d_offset(pgd, addr);
>> +       if (p4d_none(*p4d))
>> +               return NULL;
>> +
>> +       pud = pud_offset(p4d, addr);
>> +       if (pud_none(*pud))
>> +               return NULL;
>> +
>> +       pmd = pmd_offset(pud, addr);
>> +       if (pmd_none(*pmd))
>> +               return NULL;
>> +
>> +       pte = pte_offset_kernel(pmd, addr);
>> +       if (pte_none(*pte))
>> +               return NULL;
>> +
>> +       return pte;
>> +}
>> +
>> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> +                                                    unsigned long start,
>> +                                                    unsigned long end, int node,
>> +                                                    struct dev_pagemap *pgmap)
>> +{
>> +       unsigned long offset, size, addr;
>> +
>> +       /*
>> +        * For compound pages bigger than section size (e.g. 1G) fill the rest
> 
> Oh, I had to read this a couple times because I thought the "e.g." was
> referring to section size. How about:
> 
> s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/
> 
Will fix. But instead I'll fix to:

e.g. x86 1G compound pages with 128M section size

Because vmemmap_populate_compound_pages() is called per section, while being
subsection-aligned.

>> +        * of sections as tail pages.
>> +        *
>> +        * Note that memremap_pages() resets @nr_range value and will increment
>> +        * it after each range successful onlining. Thus the value or @nr_range
>> +        * at section memmap populate corresponds to the in-progress range
>> +        * being onlined that we care about.
>> +        */
>> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>> +       if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>> +           pgmap_align(pgmap) > SUBSECTION_SIZE) {
>> +               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
>> +
>> +               if (!ptep)
>> +                       return -ENOMEM;
> 
> Side note: I had been resistant to adding 'struct page' for altmap
> pages, but that would be a requirement to enable compound pages +
> altmap.
> 
Fair enough.

I was thinking about something a little simpler like adding a refcount to
altmap pfn allocator, to avoid arch changes and be a little more tidy
on space (16bytes rather than 64bytes). But I suspect I can avoid the said
arch changes even with a backend struct page.

>> +
> 
> Perhaps a comment here to the effect "recall the page that was
> allocated in a prior iteration and fill it in with tail pages".
> 
Will add.

>> +               return vmemmap_populate_range(start, end, node,
>> +                                             page_to_virt(pte_page(*ptep)));
> 
> If the @block parameter changes to a 'struct page *' it also saves
> some typing here.
> 
Indeed.

>> +       }
>> +
>> +       size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
>> +       for (addr = start; addr < end; addr += size) {
>> +               unsigned long next = addr, last = addr + size;
>> +               void *block;
>> +
>> +               /* Populate the head page vmemmap page */
>> +               if (vmemmap_populate_page(addr, node, NULL))
>> +                       return -ENOMEM;
>> +
>> +               /* Populate the tail pages vmemmap page */
>> +               block = NULL;
>> +               next = addr + PAGE_SIZE;
>> +               if (vmemmap_populate_page(next, node, &block))
>> +                       return -ENOMEM;
>> +
>> +               /* Reuse the previous page for the rest of tail pages */
>> +               next += PAGE_SIZE;
>> +               if (vmemmap_populate_range(next, last, node, block))
>>                         return -ENOMEM;
>>         }
>>
>> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>>  {
>>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>>         unsigned long end = start + nr_pages * sizeof(struct page);
>> +       unsigned int align = pgmap_align(pgmap);
>> +       int r;
>>
>>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>                 return NULL;
>>
>> -       if (vmemmap_populate(start, end, nid, altmap))
>> +       if (align > PAGE_SIZE && !altmap)
> 
> I also think this code will read better the devmap_geometry enum.
> 
True.

I am starting to like the ring of it with @geometry rather than @align to represent
how metadata is used. Should avoid confusion between device-dax @align and pagemap
@align.

>> +               r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>> +       else
>> +               r = vmemmap_populate(start, end, nid, altmap);
>> +
>> +       if (r < 0)
>>                 return NULL;
>>
>>         return pfn_to_page(pfn);
>> --
>> 2.17.1
>>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-06 10:23           ` Joao Martins
@ 2021-05-06 11:43             ` Matthew Wilcox
  2021-05-06 12:15               ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Matthew Wilcox @ 2021-05-06 11:43 UTC (permalink / raw)
  To: Joao Martins
  Cc: Aneesh Kumar K.V, Linux MM, linux-nvdimm, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Thu, May 06, 2021 at 11:23:25AM +0100, Joao Martins wrote:
> >> I think it is ok for dax/nvdimm to continue to maintain their align
> >> value because it should be ok to have 4MB align if the device really
> >> wanted. However, when it goes to map that alignment with
> >> memremap_pages() it can pick a mode. For example, it's already the
> >> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
> >> they're already separate concepts that can stay separate.
> > 
> > devdax namespace with align of 1G implies we expect to map them with 1G
> > pte entries? I didn't follow when you say we map them today with
> > DEVMAP_PTE entries.
> 
> This sort of confusion is largelly why Dan is suggesting a @geometry for naming rather
> than @align (which traditionally refers to page tables entry sizes in pagemap-related stuff).
> 
> DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages (DEVMAP_PTE),
> compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order (DEVMAP_PUD).
> 
> So, today:
> 
> * namespaces with align of 1G would use *struct pages of order-0* (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> * namespaces with align of 2M would use *struct pages of order-0* (DEVMAP_PTE) backed with
> PMD entries in the direct map.
> 
> After this series:
> 
> * namespaces with align of 1G would use *compound struct pages of order-30* (DEVMAP_PUD)
> backed with PMD entries in the direct map.

order-18

> * namespaces with align of 1G would use *compound struct pages of order-21* (DEVMAP_PMD)
> backed with PTE entries in the direct map.

i think you mean "align of 2M", and order-9.

(note that these numbers are/can be different for powerpc since it
supports different TLB page sizes.  please don't get locked into x86
sizes, and don't ignore the benefits *to software* of managing memory
in sizes other than just those supported by the hardware).
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-06 11:43             ` Matthew Wilcox
@ 2021-05-06 12:15               ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-05-06 12:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Aneesh Kumar K.V, Linux MM, linux-nvdimm, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On 5/6/21 12:43 PM, Matthew Wilcox wrote:
> On Thu, May 06, 2021 at 11:23:25AM +0100, Joao Martins wrote:
>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>> value because it should be ok to have 4MB align if the device really
>>>> wanted. However, when it goes to map that alignment with
>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>> they're already separate concepts that can stay separate.
>>>
>>> devdax namespace with align of 1G implies we expect to map them with 1G
>>> pte entries? I didn't follow when you say we map them today with
>>> DEVMAP_PTE entries.
>>
>> This sort of confusion is largelly why Dan is suggesting a @geometry for naming rather
>> than @align (which traditionally refers to page tables entry sizes in pagemap-related stuff).
>>
>> DEVMAP_{PTE,PMD,PUD} refers to the representation of metadata in base pages (DEVMAP_PTE),
>> compound pages of PMD order (DEVMAP_PMD) or compound pages of PUD order (DEVMAP_PUD).
>>
>> So, today:
>>
>> * namespaces with align of 1G would use *struct pages of order-0* (DEVMAP_PTE) backed with
>> PMD entries in the direct map.
>> * namespaces with align of 2M would use *struct pages of order-0* (DEVMAP_PTE) backed with
>> PMD entries in the direct map.
>>
>> After this series:
>>
>> * namespaces with align of 1G would use *compound struct pages of order-30* (DEVMAP_PUD)
>> backed with PMD entries in the direct map.
> 
> order-18
> 
Right, thanks for the correction.

Forgot to subtract PAGE_SIZE order (12).

>> * namespaces with align of 1G would use *compound struct pages of order-21* (DEVMAP_PMD)
>> backed with PTE entries in the direct map.
> 
> i think you mean "align of 2M", and order-9.
> 
True.

> (note that these numbers are/can be different for powerpc since it
> supports different TLB page sizes.  please don't get locked into x86
> sizes, and don't ignore the benefits *to software* of managing memory
> in sizes other than just those supported by the hardware).
> 
I am not ignoring that either that I think.

The page size looking values above is also what the consumer of this (device-dax) allows
you to use as @align, hence why you see DEVMAP_PTE, PMD and PUD as the sizes.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-05-06 10:27     ` Joao Martins
@ 2021-05-06 18:36       ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-05-06 18:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On 5/6/21 11:27 AM, Joao Martins wrote:
> On 5/5/21 11:43 PM, Dan Williams wrote:
>> I suspect it's a good sign I'm only finding cosmetic and changelog
>> changes in the review... 
> 
> Hopefully it continues that way, but the meat of the series is located
> in patches 4, 6, 7 and 11. *Specially* 6 and 7.
> 
Ugh, I meant 7 and 8 *not* 6 and 7.

> I very strongly suspect I am gonna get non-cosmetic comments there.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-05-06 11:01     ` Joao Martins
@ 2021-05-10 19:19       ` Dan Williams
  2021-05-13 18:45         ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Dan Williams @ 2021-05-10 19:19 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> >> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> >
> > I think this can be replaced with a call to follow_pte(&init_mm...).
> >
>
> Ah, of course! That would shorten things up too.

Now that I look closely, I notice that function disclaims being
suitable as a general purpose pte lookup utility. If it works for you,
great, but if not I think it's past time to create such a helper. I
know Ira is looking for one, and I contributed to the proliferation
when I added dev_pagemap_mapping_shift().
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-05-10 19:19       ` Dan Williams
@ 2021-05-13 18:45         ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-05-13 18:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On 5/10/21 8:19 PM, Dan Williams wrote:
> On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>>
>>> I think this can be replaced with a call to follow_pte(&init_mm...).
>>>
>>
>> Ah, of course! That would shorten things up too.
> 
> Now that I look closely, I notice that function disclaims being
> suitable as a general purpose pte lookup utility. 
> If it works for you,
> great, but if not I think it's past time to create such a helper. I
> know Ira is looking for one, and I contributed to the proliferation
> when I added dev_pagemap_mapping_shift().
> 
There's also x86 equivalents, like lookup_address() and lookup_address_in_pgd().
These two don't differ that much from vmemmap_lookup_address().

I can move this to an internal place e.g. mm/internal.h

The patch after this one, changes vmemmap_lookup_address() to stop at the PMD (to reuse
that across the next sections).
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-05 22:36         ` Joao Martins
  2021-05-05 23:03           ` Dan Williams
@ 2021-05-18 17:27           ` Joao Martins
  2021-05-18 19:56             ` Jane Chu
  1 sibling, 1 reply; 44+ messages in thread
From: Joao Martins @ 2021-05-18 17:27 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton

On 5/5/21 11:36 PM, Joao Martins wrote:
> On 5/5/21 11:20 PM, Dan Williams wrote:
>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>> --- a/include/linux/memremap.h
>>>>> +++ b/include/linux/memremap.h
>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>         struct completion done;
>>>>>         enum memory_type type;
>>>>>         unsigned int flags;
>>>>> +       unsigned long align;
>>>>
>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>> means "use non-compound base pages".

[...]

>>>> The non-zero value must be
>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>> Hmm, maybe it should be an
>>>> enum:
>>>>
>>>> enum devmap_geometry {
>>>>     DEVMAP_PTE,
>>>>     DEVMAP_PMD,
>>>>     DEVMAP_PUD,
>>>> }
>>>>
>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>
>> I think it is ok for dax/nvdimm to continue to maintain their align
>> value because it should be ok to have 4MB align if the device really
>> wanted. However, when it goes to map that alignment with
>> memremap_pages() it can pick a mode. For example, it's already the
>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>> they're already separate concepts that can stay separate.
>>
> Gotcha.

I am reconsidering part of the above. In general, yes, the meaning of devmap @align
represents a slightly different variation of the device @align i.e. how the metadata is
laid out **but** regardless of what kind of page table entries we use vmemmap.

By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
2) the geometry of metadata is very much tied to the value we pick to @align at namespace
provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
above? -- and 3) the value of geometry actually derives from dax device @align because we
will need to create compound pages representing a page size of @align value.

Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
device @align. In reality what we want to convey in @geometry is not page table sizes, but
just the page size used for the vmemmap of the dax device. Additionally, limiting its
value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
create compound pages of order 10 for the said vmemmap.

I am going to wait until you finish reviewing the remaining four patches of this series,
but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
setter/getter to audit its value? Thoughts?

		Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-18 17:27           ` Joao Martins
@ 2021-05-18 19:56             ` Jane Chu
  2021-05-19 11:29               ` Joao Martins
  0 siblings, 1 reply; 44+ messages in thread
From: Jane Chu @ 2021-05-18 19:56 UTC (permalink / raw)
  To: Joao Martins, Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On 5/18/2021 10:27 AM, Joao Martins wrote:

> On 5/5/21 11:36 PM, Joao Martins wrote:
>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>> --- a/include/linux/memremap.h
>>>>>> +++ b/include/linux/memremap.h
>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>          struct completion done;
>>>>>>          enum memory_type type;
>>>>>>          unsigned int flags;
>>>>>> +       unsigned long align;
>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>> means "use non-compound base pages".
> [...]
>
>>>>> The non-zero value must be
>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>> Hmm, maybe it should be an
>>>>> enum:
>>>>>
>>>>> enum devmap_geometry {
>>>>>      DEVMAP_PTE,
>>>>>      DEVMAP_PMD,
>>>>>      DEVMAP_PUD,
>>>>> }
>>>>>
>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>> value because it should be ok to have 4MB align if the device really
>>> wanted. However, when it goes to map that alignment with
>>> memremap_pages() it can pick a mode. For example, it's already the
>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>> they're already separate concepts that can stay separate.
>>>
>> Gotcha.
> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
> represents a slightly different variation of the device @align i.e. how the metadata is
> laid out **but** regardless of what kind of page table entries we use vmemmap.
>
> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
> above? -- and 3) the value of geometry actually derives from dax device @align because we
> will need to create compound pages representing a page size of @align value.
>
> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
> device @align. In reality what we want to convey in @geometry is not page table sizes, but
> just the page size used for the vmemmap of the dax device. Additionally, limiting its
> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
> create compound pages of order 10 for the said vmemmap.
>
> I am going to wait until you finish reviewing the remaining four patches of this series,
> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
> setter/getter to audit its value? Thoughts?
>
> 		Joao

Good points there.

My understanding is that  dax->align  conveys granularity of size while 
carving out a namespace,

it's a geometry attribute loosely akin to sector size of a spindle 
disk.  I tend to think that

device pagesize  has almost no relation to "align" in that, it's 
possible to have 1G "align" and

4K pagesize, or verse versa.  That is, with the advent of compound page 
support, it is possible

to totally separate the two concepts.

How about adding a new option to "ndctl create-namespace" that describes 
device creator's desired

pagesize, and another parameter to describe whether the pagesize shall 
be fixed or allowed to be split up,

such that, if the intention is to never split up 2M pagesize, then it 
would be possible to save a lot metadata

space on the device?

thanks,

-jane
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-05-18 19:56             ` Jane Chu
@ 2021-05-19 11:29               ` Joao Martins
  0 siblings, 0 replies; 44+ messages in thread
From: Joao Martins @ 2021-05-19 11:29 UTC (permalink / raw)
  To: Jane Chu, Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton



On 5/18/21 8:56 PM, Jane Chu wrote:
> On 5/18/2021 10:27 AM, Joao Martins wrote:
> 
>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>> --- a/include/linux/memremap.h
>>>>>>> +++ b/include/linux/memremap.h
>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>>          struct completion done;
>>>>>>>          enum memory_type type;
>>>>>>>          unsigned int flags;
>>>>>>> +       unsigned long align;
>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>> means "use non-compound base pages".
>> [...]
>>
>>>>>> The non-zero value must be
>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>> Hmm, maybe it should be an
>>>>>> enum:
>>>>>>
>>>>>> enum devmap_geometry {
>>>>>>      DEVMAP_PTE,
>>>>>>      DEVMAP_PMD,
>>>>>>      DEVMAP_PUD,
>>>>>> }
>>>>>>
>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>> value because it should be ok to have 4MB align if the device really
>>>> wanted. However, when it goes to map that alignment with
>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>> they're already separate concepts that can stay separate.
>>>>
>>> Gotcha.
>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>> represents a slightly different variation of the device @align i.e. how the metadata is
>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>
>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>> will need to create compound pages representing a page size of @align value.
>>
>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>> just the page size used for the vmemmap of the dax device. Additionally, limiting its
>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>> create compound pages of order 10 for the said vmemmap.
>>
>> I am going to wait until you finish reviewing the remaining four patches of this series,
>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>> setter/getter to audit its value? Thoughts?
> 
> Good points there.
> 
> My understanding is that  dax->align  conveys granularity of size while 
> carving out a namespace it's a geometry attribute loosely akin to sector size of a spindle 
> disk.  I tend to think that device pagesize  has almost no relation to "align" in that, it's 
> possible to have 1G "align" and 4K pagesize, or verse versa.  That is, with the advent of compound page 
> support, it is possible to totally separate the two concepts.
> 
> How about adding a new option to "ndctl create-namespace" that describes 
> device creator's desired pagesize, and another parameter to describe whether the pagesize shall 
> be fixed or allowed to be split up, such that, if the intention is to never split up 2M pagesize, then it 
> would be possible to save a lot metadata space on the device?

Maybe that can be selected by the driver too, but it's an interesting point you raise
should we settle with the geometry (e.g. like a geometry sysfs entry IIUC your
suggestion?). device-dax for example would use geometry == align and therefore save space
(like what I propose in patch 10). But fsdax would retain the default that is geometry =
PAGE_SIZE and align = PMD_SIZE should it want to split pages.

Interestingly, devmap poisoning always occur at @align level regardless of @geometry.

What I am not sure is what value (vs added complexity) it brings to allow geometry *value*
to be selecteable by user given that so far we seem to only ever initialize metadata as
either sets of base pages [*] or sets of compound pages (of a size). And the difference
between both can possibly be summarized to split-ability like you say.

[*] that optionally can are morphed into compound pages by driver
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-05-19 11:29 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-04-24  0:12   ` Dan Williams
2021-04-24 19:00     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-04-24  0:16   ` Dan Williams
2021-04-24 19:05     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-04-24  0:18   ` Dan Williams
2021-04-24 19:05     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-05-05 18:44   ` Dan Williams
2021-05-05 18:58     ` Matthew Wilcox
2021-05-05 19:49     ` Joao Martins
2021-05-05 22:20       ` Dan Williams
2021-05-05 22:36         ` Joao Martins
2021-05-05 23:03           ` Dan Williams
2021-05-06 10:12             ` Joao Martins
2021-05-18 17:27           ` Joao Martins
2021-05-18 19:56             ` Jane Chu
2021-05-19 11:29               ` Joao Martins
2021-05-06  8:05         ` Aneesh Kumar K.V
2021-05-06 10:23           ` Joao Martins
2021-05-06 11:43             ` Matthew Wilcox
2021-05-06 12:15               ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-05-05 22:34   ` Dan Williams
2021-05-05 22:37     ` Joao Martins
2021-05-05 23:14       ` Dan Williams
2021-05-06 10:24         ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
2021-05-05 22:43   ` Dan Williams
2021-05-06 10:27     ` Joao Martins
2021-05-06 18:36       ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-05-06  1:18   ` Dan Williams
2021-05-06 11:01     ` Joao Martins
2021-05-10 19:19       ` Dan Williams
2021-05-13 18:45         ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).