nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
@ 2021-03-25 23:09 Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
                   ` (11 more replies)
  0 siblings, 12 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Hey,

This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 

[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

This series also takes one step further on 1GB pages and *also* reuse PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based memmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb)

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up series I have stashed as its
ortogonal to device-dax as THP suffers from the same syndrome).

So to summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-4: Have memmap_init_zone_device() initialize its metadata as compound
pages. We split the current utility function of prep_compound_page() into head
and tail and use those two helpers where appropriate to take advantage of caches
being warm after __init_single_page(). Since RFC this also lets us further speed
up from 190ms down to 80ms init time.

Patches 5-10: Much like Muchun series, we reuse PTE (and PMD) tail page vmemmap
areas across a given page size (namely @align was referred by remaining
memremap/dax code) and enabling of memremap to initialize the ZONE_DEVICE pages
as compound pages or a given @align order. The main difference though, is that
contrary to the hugetlbfs series, there's no vmemmap for the area, because we
are populating it as opposed to remapping it. IOW no freeing of pages of
already initialized vmemmap like the case for hugetlbfs, which simplifies the
logic (besides not being arch-specific). After these, there's quite visible
region bootstrap of pmem memmap given that we would initialize fewer struct
pages depending on the page size with DRAM backed struct pages. altmap sees no
difference in bootstrap.

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~78-100/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally.

Patch 11: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:

                                                   before     after
    (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
    (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
    (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
    (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms

    altmap performance gets specially interesting when pinning a pmem dimm:

                                                   before     after
    (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
    (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
    (128G get_user_pages_fast altmap 2M page size)  ~3.91 ms -> ~70 ms
    (128G pin_user_pages_fast altmap 2M page size)  ~3.97 ms -> ~74 ms

The unpinning improvement patches are in mmotm/linux-next so removed from this
series.

I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.

Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).

Comments and suggestions very much appreciated!

Changelog,

 RFC -> v1:
 (New patches 1-3, 5-8 but the diffstat is that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

Thanks,
	Joao

Joao Martins (11):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  mm/sparse-vmemmap: populate compound pagemaps
  mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  mm/page_alloc: reuse tail struct pages for compound pagemaps
  device-dax: compound pagemap support
  mm/gup: grab head page refcount once for group of subpages

 drivers/dax/device.c           |  58 +++++++--
 include/linux/memory_hotplug.h |   5 +-
 include/linux/memremap.h       |  13 ++
 include/linux/mm.h             |   8 +-
 mm/gup.c                       |  52 +++++---
 mm/memory-failure.c            |   2 +
 mm/memory_hotplug.c            |   3 +-
 mm/memremap.c                  |   9 +-
 mm/page_alloc.c                | 126 +++++++++++++------
 mm/sparse-vmemmap.c            | 221 +++++++++++++++++++++++++++++----
 mm/sparse.c                    |  24 ++--
 11 files changed, 406 insertions(+), 115 deletions(-)

-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()).  For pagemap with compound pages fetch the
compound_head in case we are handling a tail page memory failure.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memory-failure.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..94240d772623 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto out;
 	}
 
+	page = compound_head(page);
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c53fe4fa10bf..43dd98446b0b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page), FPI_NONE);
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	set_page_count(p, 0);
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		set_page_count(p, 0);
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Move struct page init to an helper function __init_zone_device_page().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 43dd98446b0b..58974067bbd4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap)
+{
+
+	__init_single_page(page, pfn, zone_idx, nid);
+
+	/*
+	 * Mark page reserved as it will need to wait for onlining
+	 * phase for it to be fully associated with a zone.
+	 *
+	 * We can use the non-atomic __set_bit operation for setting
+	 * the flag as we are still initializing the pages.
+	 */
+	__SetPageReserved(page);
+
+	/*
+	 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
+	 * ever freed or placed on a driver-private list.
+	 */
+	page->pgmap = pgmap;
+	page->zone_device_data = NULL;
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		cond_resched();
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-		 * because this is done early in section_activate()
-		 */
-		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (2 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
       [not found]   ` <CAPcyv4gs_rHL7FPqyQEb3yT4jrv8Wo_xA2ojKsppoBfmDocq8A@mail.gmail.com>
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Add a new align property for struct dev_pagemap which specifies that a
pagemap is composed of a set of compound pages of size @align, instead
of base pages. When these pages are initialised, most are initialised as
tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memremap.h | 13 +++++++++++++
 mm/memremap.c            |  8 ++++++--
 mm/page_alloc.c          | 24 +++++++++++++++++++++++-
 3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index b46f63dcaed3..bb28d82dda5e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -114,6 +114,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long align;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
+{
+	if (!pgmap || !pgmap->align)
+		return PAGE_SIZE;
+	return pgmap->align;
+}
+
+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+	return PHYS_PFN(pgmap_align(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..d160853670c4 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	if (pgmap_align(pgmap) > PAGE_SIZE)
+		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
+	else
+		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+				- pfn_first(pgmap, range_id));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58974067bbd4..3a77f9e43f3a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfn_align = pgmap_pfn_align(pgmap);
+	unsigned int order_align = order_base_2(pfn_align);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
 		struct page *page = pfn_to_page(pfn);
+		unsigned long i;
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfn_align == 1)
+			continue;
+
+		__SetPageHead(page);
+
+		for (i = 1; i < pfn_align; i++) {
+			__init_zone_device_page(page + i, pfn + i, zone_idx,
+						nid, pgmap);
+			prep_compound_tail(page, i);
+
+			/*
+			 * The first and second tail pages need to
+			 * initialized first, hence the head page is
+			 * prepared last.
+			 */
+			if (i == 2)
+				prep_compound_head(page, order_align);
+		}
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (3 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

@altmap is stored in a dev_pagemap, but it will be repurposed for
hotplug memory for storing the memmap in the hotplugged memory[*] and
reusing the altmap infrastructure to that end. This is to say that
altmap can't be replaced with a @pgmap as it is going to cover more than
dev_pagemap backend altmaps.

So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:

	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap

Passing @pgmap allows __populate_section_memmap() to both fetch the align
in which memmap metadata is created for and also to let sparse-vmemmap
fetch pgmap ranges to co-relate to a given section and pick whether to just
reuse tail pages from past onlined sections.

[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memory_hotplug.h |  5 ++++-
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  3 ++-
 mm/sparse.c                    | 24 +++++++++++++++---------
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a85d4b7d15c2..45532192934c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,7 @@ struct mem_section;
 struct memory_block;
 struct resource;
 struct vmem_altmap;
+struct dev_pagemap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
@@ -72,6 +73,7 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	struct dev_pagemap *pgmap;
 };
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -360,7 +362,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb1e191da319..61474602c2b1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3033,7 +3033,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e0b499a1fee4..2df3b2a7b4b5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -384,7 +384,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+					 params->pgmap);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16183d85a7d5..370728c206ee 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -249,7 +249,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index be66a62e22c3..c2abf1281a89 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -443,7 +443,8 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -542,7 +543,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			break;
 
 		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-				nid, NULL);
+				nid, NULL, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -648,9 +649,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -719,7 +721,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 }
 #else
 struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	return kvmalloc_node(array_size(sizeof(struct page),
 					PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -842,7 +845,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -874,7 +878,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
@@ -889,6 +893,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
  * @altmap: device page map
+ * @pgmap: device page map object that owns the section
  *
  * This is only intended for hotplug.
  *
@@ -902,7 +907,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -913,7 +919,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (4 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Move the actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages()
walk through all base pages it needs to populate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 370728c206ee..8814876edcfa 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap)
 {
-	unsigned long addr = start;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	pgd = vmemmap_pgd_populate(addr, node);
+	if (!pgd)
+		return -ENOMEM;
+	p4d = vmemmap_p4d_populate(pgd, addr, node);
+	if (!p4d)
+		return -ENOMEM;
+	pud = vmemmap_pud_populate(p4d, addr, node);
+	if (!pud)
+		return -ENOMEM;
+	pmd = vmemmap_pmd_populate(pud, addr, node);
+	if (!pmd)
+		return -ENOMEM;
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	if (!pte)
+		return -ENOMEM;
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	unsigned long addr = start;
+
 	for (; addr < end; addr += PAGE_SIZE) {
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
-			return -ENOMEM;
-		p4d = vmemmap_p4d_populate(pgd, addr, node);
-		if (!p4d)
-			return -ENOMEM;
-		pud = vmemmap_pud_populate(p4d, addr, node);
-		if (!pud)
-			return -ENOMEM;
-		pmd = vmemmap_pmd_populate(pud, addr, node);
-		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
-		if (!pte)
+		if (vmemmap_populate_address(addr, node, altmap))
 			return -ENOMEM;
-		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 	}
 
 	return 0;
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (5 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

To minimize struct page overhead we take advantage of the fact that
most tail pages look the same (except the first two). We allocate a
separate page for the vmemmap area which contains the head page and
separate for the next 64 pages. The rest of the subsections then reuse
this tail vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound pagemap with big enough @align (e.g. 1G
PUD) it  will cross various sections. To be able to reuse tail pages
across sections belonging to the same gigantic page we fetch the
@range being mapped (nr_ranges + 1).  If the section being mapped is
not offset 0 of the @align, then lookup the PFN of the struct page
address that preceeds it and use that to populate the entire
section.

On compound pagemaps with 2M align, this lets mechanism saves 6 pages
out of the 8 necessary PFNs necessary to set the subsection's 512
struct pages being mapped. On a 1G compound pagemap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |   2 +-
 mm/memremap.c       |   1 +
 mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 131 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 61474602c2b1..49d717ae40ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index d160853670c4..2e6bc0b1ff00 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8814876edcfa..f57c5eada099 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap, void *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p;
-
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else if (!altmap) {
+			get_page(virt_to_page(block));
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	if (ptr)
+		*ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
+	return 0;
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
@@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	unsigned long addr = start;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		if (vmemmap_populate_address(addr, node, altmap))
+		if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		if (vmemmap_populate_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  void **ptr)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
+}
+
+static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	pte = pte_offset_kernel(pmd, addr);
+	if (pte_none(*pte))
+		return NULL;
+
+	return pte;
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long offset, size, addr;
+
+	/*
+	 * For compound pages bigger than section size (e.g. 1G) fill the rest
+	 * of sections as tail pages.
+	 *
+	 * Note that memremap_pages() resets @nr_range value and will increment
+	 * it after each range successful onlining. Thus the value or @nr_range
+	 * at section memmap populate corresponds to the in-progress range
+	 * being onlined that we care about.
+	 */
+	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
+	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
+	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
+		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+
+		if (!ptep)
+			return -ENOMEM;
+
+		return vmemmap_populate_range(start, end, node,
+					      page_to_virt(pte_page(*ptep)));
+	}
+
+	size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		void *block;
+
+		/* Populate the head page vmemmap page */
+		if (vmemmap_populate_page(addr, node, NULL))
+			return -ENOMEM;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		if (vmemmap_populate_page(next, node, &block))
+			return -ENOMEM;
+
+		/* Reuse the previous page for the rest of tail pages */
+		next += PAGE_SIZE;
+		if (vmemmap_populate_range(next, last, node, block))
 			return -ENOMEM;
 	}
 
@@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	unsigned int align = pgmap_align(pgmap);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (align > PAGE_SIZE && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (6 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that preceeds
the memmap we are initializing.  This is done when a given memmap
address isn't aligned to the pgmap @align (which is safe to do because
@ranges are guaranteed to be aligned to @align).

For pagemaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages.  Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being  allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD we previously populated which only contain
tail pages).

After this scheme for an 1GB pagemap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages.  The latter page gets
its PMD reused across future section mapping of the same pagemap.

Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. For each PUD-aligned
pagemap we go from 40960 bytes down to 16384 bytes. Rather than
requiring 8 PMD page allocations we only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |  3 +-
 mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 63 insertions(+), 19 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 49d717ae40ae..9c1a676d6b95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+			    void *block);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index f57c5eada099..291a8a32480a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+				       void *block)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+			if (!p)
+				return NULL;
+		} else {
+			get_page(virt_to_page(block));
+		}
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      void *page, void **ptr)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+						  struct vmem_altmap *altmap,
+						  void *page, pmd_t **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = vmemmap_pgd_populate(addr, node);
 	if (!pgd)
@@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pud = vmemmap_pud_populate(p4d, addr, node);
 	if (!pud)
 		return -ENOMEM;
-	pmd = vmemmap_pmd_populate(pud, addr, node);
+	pmd = vmemmap_pmd_populate(pud, addr, node, page);
 	if (!pmd)
 		return -ENOMEM;
+	if (ptr)
+		*ptr = pmd;
+	return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
+{
+	pmd_t *pmd;
+	pte_t *pte;
+
+	if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
+		return -ENOMEM;
+
 	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
@@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
 	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
 }
 
-static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+						unsigned long end,
+						int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PMD_SIZE) {
+		if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd))
@@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
 	if (pmd_none(*pmd))
 		return NULL;
 
-	pte = pte_offset_kernel(pmd, addr);
-	if (pte_none(*pte))
-		return NULL;
-
-	return pte;
+	return pmd;
 }
 
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
@@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
 	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
 	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
-		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		addr = start - PAGE_SIZE;
+		pmdp = vmemmap_lookup_address(addr);
+		if (!pmdp)
+			return -ENOMEM;
+
+		/* Reuse the tail pages vmemmap pmd page */
+		if (offset % pgmap->align > PFN_PHYS(PAGES_PER_SECTION))
+			return vmemmap_populate_pmd_range(start, end, node,
+						page_to_virt(pmd_page(*pmdp)));
 
-		if (!ptep)
+		/* Populate the tail pages vmemmap pmd page */
+		ptep = pte_offset_kernel(pmdp, addr);
+		if (pte_none(*ptep))
 			return -ENOMEM;
 
 		return vmemmap_populate_range(start, end, node,
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (7 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
       [not found]   ` <CAPcyv4gtSqfmuAaX9cs63OvLkf-h4B_5fPiEnM9p9cqLZztXpg@mail.gmail.com>
  2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

When a pgmap @align is set, all pages are mapped at a given huge page
alignment and thus uses compound pages to describe them as opposed to a
struct per 4K.

With @align > PAGE_SIZE and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of unique
struct pages is a lot smaller that the total amount of struct pages
being mapped.

When struct pages are initialize in memmap_init_zone_device, make
sure that only unique struct pages are initialized i.e. the first 2
4K pages per @align which means 128 struct pages, instead of 32768 for
2M @align or 262144 for a 1G @align.

Keep old behaviour with altmap given that it doesn't support reusal
of tail vmemmap areas.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3a77f9e43f3a..948dfad6754b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6277,6 +6277,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+#define MEMMAP_NR_PAGES	(2 * (PAGE_SIZE/sizeof(struct page)))
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6287,6 +6289,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
 	unsigned int pfn_align = pgmap_pfn_align(pgmap);
 	unsigned int order_align = order_base_2(pfn_align);
+	unsigned long ntails = min_t(unsigned long, pfn_align, MEMMAP_NR_PAGES);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6302,6 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	if (altmap) {
 		start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
 		nr_pages = end_pfn - start_pfn;
+		ntails = pfn_align;
 	}
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
@@ -6315,7 +6319,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 		__SetPageHead(page);
 
-		for (i = 1; i < pfn_align; i++) {
+		for (i = 1; i < ntails; i++) {
 			__init_zone_device_page(page + i, pfn + i, zone_idx,
 						nid, pgmap);
 			prep_compound_tail(page, i);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 10/11] device-dax: compound pagemap support
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (8 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
       [not found]   ` <CAPcyv4jeY0K7ciWeCLjxXmiWs7NNeM-_zEdZ2XAdYnyZc9PvWA@mail.gmail.com>
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
  2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

dax devices are created with a fixed @align (huge page size) which
is enforced through as well at mmap() of the device. Faults,
consequently happen too at the specified @align specified at the
creation, and those don't change through out dax device lifetime.
MCEs poisons a whole dax huge page, as well as splits occurring at
at the configured page size.

Use the newly added compound pagemap facility which maps the
assigned dax ranges as compound pages at a page size of @align.
Currently, this means, that region/namespace bootstrap would take
considerably less, given that you would initialize considerably less
pages.

On setups with 128G NVDIMMs the initialization with DRAM stored struct pages
improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than
a 1msec with 1G pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index db92573c94e8..e3dcc4ad1727 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned long fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned long fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head = compound_head(head);
+	if (head->mapping)
+		return;
+
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap->align > PAGE_SIZE)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (dev_dax->align > PAGE_SIZE)
+		pgmap->align = dev_dax->align;
+
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (9 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
  2021-06-02  1:05   ` Dan Williams
  2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Much like hugetlbfs or THPs, treat device pagemaps with
compound pages like the rest of GUP handling of compound pages.

Rather than incrementing the refcount every 4K, we record
all sub pages and increment by @refs amount *once*.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:

 $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
(pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
[altmap]
(get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
(pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

 $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
(pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
[altmap with -m 127004]
(get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
(pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b3e647c8b7ee..514f12157a0f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			     unsigned long end, unsigned int flags,
 			     struct page **pages, int *nr)
 {
-	int nr_start = *nr;
+	int refs, nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 
 	do {
-		struct page *page = pfn_to_page(pfn);
+		struct page *head, *page = pfn_to_page(pfn);
+		unsigned long next;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
-		SetPageReferenced(page);
-		pages[*nr] = page;
-		if (unlikely(!try_grab_page(page, flags))) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
+
+		head = compound_head(page);
+		next = PageCompound(head) ? end : addr + PAGE_SIZE;
+		refs = record_subpages(page, addr, next, pages + *nr);
+
+		SetPageReferenced(head);
+		head = try_grab_compound_head(head, refs, flags);
+		if (!head) {
+			if (PageCompound(head)) {
+				ClearPageReferenced(head);
+				put_dev_pagemap(pgmap);
+			} else {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+			}
 			return 0;
 		}
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
+		*nr += refs;
+		pfn += refs;
+	} while (addr += (refs << PAGE_SHIFT), addr != end);
 
 	if (pgmap)
 		put_dev_pagemap(pgmap);
@@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (10 preceding siblings ...)
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-04-01  9:38 ` Joao Martins
  11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-04-01  9:38 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton

On 3/25/21 11:09 PM, Joao Martins wrote:

[...]

> Patch 11: Optimize grabbing page refcount changes given that we
> are working with compound pages i.e. we do 1 increment to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably with DRAM stored struct pages. It also *greatly*
> improves pinning with altmap. Results with gup_test:
> 
>                                                    before     after
>     (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
>     (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
>     (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
>     (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
> 
>     altmap performance gets specially interesting when pinning a pmem dimm:
> 
>                                                    before     after
>     (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
>     (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
>     (128G get_user_pages_fast altmap 2M page size)  ~3.91 ms -> ~70 ms
>     (128G pin_user_pages_fast altmap 2M page size)  ~3.97 ms -> ~74 ms
> 

Quick correction: These last two 3.91 and 3.97 on the left column are in *seconds* not
milliseconds. By mistake I added an extra 'm'. Sorry about that.

> The unpinning improvement patches are in mmotm/linux-next so removed from this
> series.
> 
> I have deferred the __get_user_pages() patch to outside this series
> (https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
> as I found an simpler way to address it and that is also applicable to
> THP. But will submit that as a follow up of this.
> 
> Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
> 
> Comments and suggestions very much appreciated!
> 
> Changelog,
> 
>  RFC -> v1:
>  (New patches 1-3, 5-8 but the diffstat is that different)
>  * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
>  * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)

I just noticed that I haven't fully removed the 'we' occurrences. Patches 7 and 8 had
their commit messages rewritten and I mistakenly re-introduced remnants of 'we'. I will
have it fixed for (v2) albeit I'll still wait for comments on this series before following up.

>  * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
>  * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
>  * Remove the gup_device_compound_huge special path and have the same code
>    work both ways while special casing when devmap page is compound (Jason, John)
>  * Avoid usage of vmemmap_populate_basepages() and introduce a first class
>    loop that doesn't care about passing an altmap for memmap reuse. (Dan)
>  * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
>    hack into passing block across sparse_add_section calls. It's a lot easier to
>    follow and more explicit in what it does.
>  * Replace the vmemmap refactoring with adding a @pgmap argument and moving
>    parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
>  * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
>  * Improve memmap_init_zone_device() to initialize compound pages when
>    struct pages are cache warm. That lead to a even further speed up further
>    from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
>    as a result (Dan)
>  * Remove PGMAP_COMPOUND and use @align as the property to detect whether
>    or not to reuse vmemmap areas (Dan)
> 
> Thanks,
> 	Joao
> 
> Joao Martins (11):
>   memory-failure: fetch compound_head after pgmap_pfn_valid()
>   mm/page_alloc: split prep_compound_page into head and tail subparts
>   mm/page_alloc: refactor memmap_init_zone_device() page init
>   mm/memremap: add ZONE_DEVICE support for compound pages
>   mm/sparse-vmemmap: add a pgmap argument to section activation
>   mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
>   mm/sparse-vmemmap: populate compound pagemaps
>   mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
>   mm/page_alloc: reuse tail struct pages for compound pagemaps
>   device-dax: compound pagemap support
>   mm/gup: grab head page refcount once for group of subpages
> 
>  drivers/dax/device.c           |  58 +++++++--
>  include/linux/memory_hotplug.h |   5 +-
>  include/linux/memremap.h       |  13 ++
>  include/linux/mm.h             |   8 +-
>  mm/gup.c                       |  52 +++++---
>  mm/memory-failure.c            |   2 +
>  mm/memory_hotplug.c            |   3 +-
>  mm/memremap.c                  |   9 +-
>  mm/page_alloc.c                | 126 +++++++++++++------
>  mm/sparse-vmemmap.c            | 221 +++++++++++++++++++++++++++++----
>  mm/sparse.c                    |  24 ++--
>  11 files changed, 406 insertions(+), 115 deletions(-)
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-06-02  1:05   ` Dan Williams
  2021-06-07 15:21     ` Joao Martins
  0 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2021-06-02  1:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, nvdimm

On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Much like hugetlbfs or THPs, treat device pagemaps with
> compound pages like the rest of GUP handling of compound pages.
>

How about:

"Use try_grab_compound_head() for device-dax GUP when configured with
a compound pagemap."

> Rather than incrementing the refcount every 4K, we record
> all sub pages and increment by @refs amount *once*.

"Rather than incrementing the refcount for each page, do one atomic
addition for all the pages to be pinned."

>
> Performance measured by gup_benchmark improves considerably
> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>
>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> [altmap]
> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

Hmm what is altmap representing here? The altmap case does not support
compound geometry, so this last test is comparing pinning this amount
of memory without compound pages where the memmap is in PMEM to the
speed *with* compound pages and the memmap in DRAM?

>
>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> [altmap with -m 127004]
> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 32 insertions(+), 20 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index b3e647c8b7ee..514f12157a0f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  }
>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>
> +
> +static int record_subpages(struct page *page, unsigned long addr,
> +                          unsigned long end, struct page **pages)
> +{
> +       int nr;
> +
> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
> +               pages[nr++] = page++;
> +
> +       return nr;
> +}
> +
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>                              unsigned long end, unsigned int flags,
>                              struct page **pages, int *nr)
>  {
> -       int nr_start = *nr;
> +       int refs, nr_start = *nr;
>         struct dev_pagemap *pgmap = NULL;
>
>         do {
> -               struct page *page = pfn_to_page(pfn);
> +               struct page *head, *page = pfn_to_page(pfn);
> +               unsigned long next;
>
>                 pgmap = get_dev_pagemap(pfn, pgmap);
>                 if (unlikely(!pgmap)) {
>                         undo_dev_pagemap(nr, nr_start, flags, pages);
>                         return 0;
>                 }
> -               SetPageReferenced(page);
> -               pages[*nr] = page;
> -               if (unlikely(!try_grab_page(page, flags))) {
> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
> +
> +               head = compound_head(page);
> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;

This looks a tad messy, and makes assumptions that upper layers are
not sending this routine multiple huge pages to map. next should be
set to the next compound page, not end.

> +               refs = record_subpages(page, addr, next, pages + *nr);
> +
> +               SetPageReferenced(head);
> +               head = try_grab_compound_head(head, refs, flags);
> +               if (!head) {
> +                       if (PageCompound(head)) {

@head is NULL here, I think you wanted to rename the result of
try_grab_compound_head() to something like pinned_head so that you
don't undo the work you did above. However I feel like there's one too
PageCompund() checks.


> +                               ClearPageReferenced(head);
> +                               put_dev_pagemap(pgmap);
> +                       } else {
> +                               undo_dev_pagemap(nr, nr_start, flags, pages);
> +                       }
>                         return 0;
>                 }
> -               (*nr)++;
> -               pfn++;
> -       } while (addr += PAGE_SIZE, addr != end);
> +               *nr += refs;
> +               pfn += refs;
> +       } while (addr += (refs << PAGE_SHIFT), addr != end);
>
>         if (pgmap)
>                 put_dev_pagemap(pgmap);
> @@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>  }
>  #endif
>
> -static int record_subpages(struct page *page, unsigned long addr,
> -                          unsigned long end, struct page **pages)
> -{
> -       int nr;
> -
> -       for (nr = 0; addr != end; addr += PAGE_SIZE)
> -               pages[nr++] = page++;
> -
> -       return nr;
> -}
> -
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>                                       unsigned long sz)
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
       [not found]   ` <CAPcyv4gtSqfmuAaX9cs63OvLkf-h4B_5fPiEnM9p9cqLZztXpg@mail.gmail.com>
@ 2021-06-07 13:48     ` Joao Martins
  2021-06-07 19:32       ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-07 13:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 6/2/21 12:35 AM, Dan Williams wrote:
> Just like patch8 lets rename this "optimize memmap_init_zone_device()
> for compound page maps"
> 
> 
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> When a pgmap @align is set,
> 
> Rewrite this to talk about geometry rather than align per previous
> patch comments.
> 
OK.

>> all pages are mapped at a given huge page
>> alignment and thus uses compound pages to describe them as opposed to a
> 
> s/thus uses/use/
>
/me nods

>> struct per 4K.
>>
>> With @align > PAGE_SIZE and when struct pages are stored in ram
>> (!altmap) most tail pages are reused. Consequently, the amount of unique
>> struct pages is a lot smaller that the total amount of struct pages
>> being mapped.
>>
>> When struct pages are initialize in memmap_init_zone_device, make
>> sure that only unique struct pages are initialized i.e. the first 2
>> 4K pages per @align which means 128 struct pages, instead of 32768 for
>> 2M @align or 262144 for a 1G @align.
> 
> This is the important detail to put at the beginning and then follow
> with the detailed description, i.e. something like:
> 
> "Currently memmap_init_zone_device() ends up initializing 32768 pages
> when it only needs to initialize 128 given tail page reuse. That
> number is worse with 1GB compound page geometries, 262144 instead of
> 128. Update memmap_init_zone_device() to skip redundant
> initialization, details below:"
> 
/me nods

>> Keep old behaviour with altmap given that it doesn't support reusal
>> of tail vmemmap areas.
> 
> I think I like how you described this earlier as the (!altmap) path. So:
> 
> "The !altmap path is left alone since it does not support compound
> page map geometries."
> 
It's more accurate to say that "does not support memory savings with
compound page map geometries."

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/page_alloc.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3a77f9e43f3a..948dfad6754b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6277,6 +6277,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>>         }
>>  }
>>
>> +#define MEMMAP_NR_PAGES        (2 * (PAGE_SIZE/sizeof(struct page)))
> 
> This seems too generic a name for something that is specific to the
> compound page init case... more below.
> 
>> +
>>  void __ref memmap_init_zone_device(struct zone *zone,
>>                                    unsigned long start_pfn,
>>                                    unsigned long nr_pages,
>> @@ -6287,6 +6289,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         struct vmem_altmap *altmap = pgmap_altmap(pgmap);
>>         unsigned int pfn_align = pgmap_pfn_align(pgmap);
>>         unsigned int order_align = order_base_2(pfn_align);
>> +       unsigned long ntails = min_t(unsigned long, pfn_align, MEMMAP_NR_PAGES);
> 
> I'd rather delay this to a specific "if (compound)" branch below
> because compound init is so different from the base page case.
> 
Perhaps this is not obvious, but both altmap *and* non-altmap case would use compound pages.

This ntails default value difference refers to the number of *unique* pages (...)

>>         unsigned long zone_idx = zone_idx(zone);
>>         unsigned long start = jiffies;
>>         int nid = pgdat->node_id;
>> @@ -6302,6 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         if (altmap) {
>>                 start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>>                 nr_pages = end_pfn - start_pfn;
>> +               ntails = pfn_align;
> 
> Why would the altmap case need to init ntails, won't the inner loop of
> compound page setup be skipped?
> 
(...) Which means the answer here is no. The altmap case does initialize tail pages, but
contrary to the non-altmap, all tail struct pages are unique (no savings) and thus all
need to be initialized.

>>         }
> 
> Given all of the above I'm wondering if there should be a new
> "compound specific" flavor of this routine rather than trying to
> cleverly inter mingle the old path with the new. This is easier
> because you have already done the work in previous patches to break
> this into helpers. So just have memmap_init_zone_device() do it the
> "old" way and memmap_init_compound() handle all the tail page init +
> optimizations.
> 
I can separate it out, should be easier to follow.

Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
more accurately represents what goes on the page tables. That's regardless of
vmemmap-based gains, and hence why my train of thought was to not separate it.

After this series, all devmap pages where @geometry matches @align will have compound
pages be used instead. And we enable that in device-dax as first user (in the next patch).
altmap or not so far just differentiates on the uniqueness of struct pages as the former
doesn't reuse base pages that only contain tail pages and consequently makes us initialize
all tail struct pages.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 10/11] device-dax: compound pagemap support
       [not found]   ` <CAPcyv4jeY0K7ciWeCLjxXmiWs7NNeM-_zEdZ2XAdYnyZc9PvWA@mail.gmail.com>
@ 2021-06-07 13:59     ` Joao Martins
  0 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-06-07 13:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Linux NVDIMM



On 6/2/21 1:36 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> dax devices are created with a fixed @align (huge page size) which
>> is enforced through as well at mmap() of the device. Faults,
>> consequently happen too at the specified @align specified at the
>> creation, and those don't change through out dax device lifetime.
>> MCEs poisons a whole dax huge page, as well as splits occurring at
>> at the configured page size.
> 
> This paragraph last...
> 
/me nods

>>
>> Use the newly added compound pagemap facility which maps the
>> assigned dax ranges as compound pages at a page size of @align.
>> Currently, this means, that region/namespace bootstrap would take
>> considerably less, given that you would initialize considerably less
>> pages.
> 
> This paragraph should go first...
> 
/me nods

>>
>> On setups with 128G NVDIMMs the initialization with DRAM stored struct pages
>> improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than
>> a 1msec with 1G pages.
> 
> This paragraph second...
> 
/me nods

> 
> The reason for this ordering is to have increasingly more detail as
> the changelog is read so that people that don't care about the details
> can get the main theme immediately, and others that wonder why
> device-dax is able to support this can read deeper.
> 
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 45 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index db92573c94e8..e3dcc4ad1727 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>>  }
>>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>
>> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +                            unsigned long fault_size,
>> +                            struct address_space *f_mapping)
>> +{
>> +       unsigned long i;
>> +       pgoff_t pgoff;
>> +
>> +       pgoff = linear_page_index(vmf->vma, vmf->address
>> +                       & ~(fault_size - 1));
> 
> I know you are just copying this style from whomever wrote it this way
> originally, but that person (me) was wrong this should be:
> 
> pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
> 
> ...you might do a lead-in cleanup patch before this one.
> 
Yeap, will do.

> 
>> +
>> +       for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>> +               struct page *page;
>> +
>> +               page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>> +               if (page->mapping)
>> +                       continue;
>> +               page->mapping = f_mapping;
>> +               page->index = pgoff + i;
>> +       }
>> +}
>> +
>> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +                                unsigned long fault_size,
>> +                                struct address_space *f_mapping)
>> +{
>> +       struct page *head;
>> +
>> +       head = pfn_to_page(pfn_t_to_pfn(pfn));
>> +       head = compound_head(head);
>> +       if (head->mapping)
>> +               return;
>> +
>> +       head->mapping = f_mapping;
>> +       head->index = linear_page_index(vmf->vma, vmf->address
>> +                       & ~(fault_size - 1));
>> +}
>> +
>>  static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>                 enum page_entry_size pe_size)
>>  {
>> @@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>         }
>>
>>         if (rc == VM_FAULT_NOPAGE) {
>> -               unsigned long i;
>> -               pgoff_t pgoff;
>> +               struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;
> 
> The device should already know its pagemap...
> 
> There is a distinction in dev_dax_probe() for "static" vs "dynamic"
> pgmap, but once the pgmap is allocated it should be fine to assign it
> back to dev_dax->pgmap in the "dynamic" case. That could be a lead-in
> patch to make dev_dax->pgmap always valid.
> 
I suppose you mean to always set dev_dax->pgmap at the end of the
'if (!pgmap)' in dev_dax_probe() after we allocate the pgmap.

I will make this a separate cleanup patch as you suggested.

>>
>>                 /*
>>                  * In the device-dax case the only possibility for a
>> @@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>                  * mapped. No need to consider the zero page, or racing
>>                  * conflicting mappings.
>>                  */
>> -               pgoff = linear_page_index(vmf->vma, vmf->address
>> -                               & ~(fault_size - 1));
>> -               for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>> -                       struct page *page;
>> -
>> -                       page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>> -                       if (page->mapping)
>> -                               continue;
>> -                       page->mapping = filp->f_mapping;
>> -                       page->index = pgoff + i;
>> -               }
>> +               if (pgmap->align > PAGE_SIZE)
>> +                       set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
>> +               else
>> +                       set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
>>         }
>>         dax_read_unlock(id);
>>
>> @@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>         }
>>
>>         pgmap->type = MEMORY_DEVICE_GENERIC;
>> +       if (dev_dax->align > PAGE_SIZE)
>> +               pgmap->align = dev_dax->align;
> 
> Just needs updates for whatever renames you do for the "compound
> geometry" terminology rather than subtle side effects of "align".
> 
> Other than that, looks good to me.
> 
OK, will do.

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-06-02  1:05   ` Dan Williams
@ 2021-06-07 15:21     ` Joao Martins
  2021-06-07 19:22       ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-07 15:21 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, nvdimm

On 6/2/21 2:05 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Much like hugetlbfs or THPs, treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
> 
> How about:
> 
> "Use try_grab_compound_head() for device-dax GUP when configured with
> a compound pagemap."
> 
Yeap, a bit clearer indeed.

>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
> 
> "Rather than incrementing the refcount for each page, do one atomic
> addition for all the pages to be pinned."
> 
ACK.

>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>>
>>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
>> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
>> [altmap]
>> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
>> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
> 
> Hmm what is altmap representing here? The altmap case does not support
> compound geometry, 

It does support compound geometry and so we use compound pages with altmap case.
What altmap doesn't support is the memory savings in the vmemmap that can be
done when using compound pages. That's what is represented here.

> so this last test is comparing pinning this amount
> of memory without compound pages where the memmap is in PMEM to the
> speed *with* compound pages and the memmap in DRAM?
> 
The test compares pinning this amount of memory with compound pages placed
in PMEM and in DRAM. It just exposes just how ineficient this can get if huge pages aren't
represented with compound pages.

>>
>>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
>> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
>> [altmap with -m 127004]
>> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
>> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 32 insertions(+), 20 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index b3e647c8b7ee..514f12157a0f 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>  }
>>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +                          unsigned long end, struct page **pages)
>> +{
>> +       int nr;
>> +
>> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +               pages[nr++] = page++;
>> +
>> +       return nr;
>> +}
>> +
>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>                              unsigned long end, unsigned int flags,
>>                              struct page **pages, int *nr)
>>  {
>> -       int nr_start = *nr;
>> +       int refs, nr_start = *nr;
>>         struct dev_pagemap *pgmap = NULL;
>>
>>         do {
>> -               struct page *page = pfn_to_page(pfn);
>> +               struct page *head, *page = pfn_to_page(pfn);
>> +               unsigned long next;
>>
>>                 pgmap = get_dev_pagemap(pfn, pgmap);
>>                 if (unlikely(!pgmap)) {
>>                         undo_dev_pagemap(nr, nr_start, flags, pages);
>>                         return 0;
>>                 }
>> -               SetPageReferenced(page);
>> -               pages[*nr] = page;
>> -               if (unlikely(!try_grab_page(page, flags))) {
>> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
>> +
>> +               head = compound_head(page);
>> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;
> 
> This looks a tad messy, and makes assumptions that upper layers are
> not sending this routine multiple huge pages to map. next should be
> set to the next compound page, not end.

Although for devmap (and same could be said for hugetlbfs), __gup_device_huge() (as called
by __gup_device_huge_{pud,pmd}) would only ever be called on a compound page which
represents the same level, as opposed to many compound pages i.e. @end already represents
the next compound page of the PMD or PUD level.

But of course, should we represent devmap pages in geometries other than the values of
hpagesize/align other than PMD or PUD size then it's true that relying on @end value being
next compound page is fragile. But so as the rest of the surrounding code.

> 
>> +               refs = record_subpages(page, addr, next, pages + *nr);
>> +
>> +               SetPageReferenced(head);
>> +               head = try_grab_compound_head(head, refs, flags);
>> +               if (!head) {
>> +                       if (PageCompound(head)) {
> 
> @head is NULL here, I think you wanted to rename the result of
> try_grab_compound_head() to something like pinned_head so that you
> don't undo the work you did above. 

Yes. pinned_head is what I actually should have written. Let me fix that.

> However I feel like there's one too
> PageCompund() checks.
> 

I agree, but I am not fully sure how I can remove them :(

The previous approach was to separate the logic into two distinct helpers namely
__gup_device_huge() and __gup_device_compound_huge(). But that sort of special casing
wasn't a good idea, so I tried merging both cases in __gup_device_huge() solely
differentiating on PageCompound().

I could make this slightly less bad by moving the error case PageCompound checks to
undo_dev_pagemap() and record_subpages().

But we still have the pagemap refcount to be taken until your other series removes the
need for it. So perhaps I should place the remaining PageCompound based check inside
record_subpages to accomodate the PAGE_SIZE geometry case (similarly hinted by Jason in
the previous version but that I didn't fully address).

How does the above sound?

longterm once we stop having devmap use non compound struct pages on PMDs/PUDs and the
pgmap refcount on gup is removed then perhaps we can move to existing regular huge page
path that is not devmap specific.

> 
>> +                               ClearPageReferenced(head);
>> +                               put_dev_pagemap(pgmap);
>> +                       } else {
>> +                               undo_dev_pagemap(nr, nr_start, flags, pages);
>> +                       }
>>                         return 0;
>>                 }
>> -               (*nr)++;
>> -               pfn++;
>> -       } while (addr += PAGE_SIZE, addr != end);
>> +               *nr += refs;
>> +               pfn += refs;
>> +       } while (addr += (refs << PAGE_SHIFT), addr != end);
>>
>>         if (pgmap)
>>                 put_dev_pagemap(pgmap);
>> @@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>>  }
>>  #endif
>>
>> -static int record_subpages(struct page *page, unsigned long addr,
>> -                          unsigned long end, struct page **pages)
>> -{
>> -       int nr;
>> -
>> -       for (nr = 0; addr != end; addr += PAGE_SIZE)
>> -               pages[nr++] = page++;
>> -
>> -       return nr;
>> -}
>> -
>>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>>                                       unsigned long sz)
>> --
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
  2021-06-07 15:21     ` Joao Martins
@ 2021-06-07 19:22       ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2021-06-07 19:22 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, nvdimm

On Mon, Jun 7, 2021 at 8:22 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 6/2/21 2:05 AM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Much like hugetlbfs or THPs, treat device pagemaps with
> >> compound pages like the rest of GUP handling of compound pages.
> >>
> >
> > How about:
> >
> > "Use try_grab_compound_head() for device-dax GUP when configured with
> > a compound pagemap."
> >
> Yeap, a bit clearer indeed.
>
> >> Rather than incrementing the refcount every 4K, we record
> >> all sub pages and increment by @refs amount *once*.
> >
> > "Rather than incrementing the refcount for each page, do one atomic
> > addition for all the pages to be pinned."
> >
> ACK.
>
> >>
> >> Performance measured by gup_benchmark improves considerably
> >> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
> >>
> >>  $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> >> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> >> [altmap]
> >> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> >> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
> >
> > Hmm what is altmap representing here? The altmap case does not support
> > compound geometry,
>
> It does support compound geometry and so we use compound pages with altmap case.
> What altmap doesn't support is the memory savings in the vmemmap that can be
> done when using compound pages. That's what is represented here.

Ah, I missed that detail, might be good to mention this in the
Documentation/vm/ overview doc for this capability.

>
> > so this last test is comparing pinning this amount
> > of memory without compound pages where the memmap is in PMEM to the
> > speed *with* compound pages and the memmap in DRAM?
> >
> The test compares pinning this amount of memory with compound pages placed
> in PMEM and in DRAM. It just exposes just how ineficient this can get if huge pages aren't
> represented with compound pages.

Got it.

>
> >>
> >>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> >> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> >> [altmap with -m 127004]
> >> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> >> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
> >>  1 file changed, 32 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index b3e647c8b7ee..514f12157a0f 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >>  }
> >>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >>
> >> +
> >> +static int record_subpages(struct page *page, unsigned long addr,
> >> +                          unsigned long end, struct page **pages)
> >> +{
> >> +       int nr;
> >> +
> >> +       for (nr = 0; addr != end; addr += PAGE_SIZE)
> >> +               pages[nr++] = page++;
> >> +
> >> +       return nr;
> >> +}
> >> +
> >>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>                              unsigned long end, unsigned int flags,
> >>                              struct page **pages, int *nr)
> >>  {
> >> -       int nr_start = *nr;
> >> +       int refs, nr_start = *nr;
> >>         struct dev_pagemap *pgmap = NULL;
> >>
> >>         do {
> >> -               struct page *page = pfn_to_page(pfn);
> >> +               struct page *head, *page = pfn_to_page(pfn);
> >> +               unsigned long next;
> >>
> >>                 pgmap = get_dev_pagemap(pfn, pgmap);
> >>                 if (unlikely(!pgmap)) {
> >>                         undo_dev_pagemap(nr, nr_start, flags, pages);
> >>                         return 0;
> >>                 }
> >> -               SetPageReferenced(page);
> >> -               pages[*nr] = page;
> >> -               if (unlikely(!try_grab_page(page, flags))) {
> >> -                       undo_dev_pagemap(nr, nr_start, flags, pages);
> >> +
> >> +               head = compound_head(page);
> >> +               next = PageCompound(head) ? end : addr + PAGE_SIZE;
> >
> > This looks a tad messy, and makes assumptions that upper layers are
> > not sending this routine multiple huge pages to map. next should be
> > set to the next compound page, not end.
>
> Although for devmap (and same could be said for hugetlbfs), __gup_device_huge() (as called
> by __gup_device_huge_{pud,pmd}) would only ever be called on a compound page which
> represents the same level, as opposed to many compound pages i.e. @end already represents
> the next compound page of the PMD or PUD level.
>
> But of course, should we represent devmap pages in geometries other than the values of
> hpagesize/align other than PMD or PUD size then it's true that relying on @end value being
> next compound page is fragile. But so as the rest of the surrounding code.

Ok, for now maybe a:

/* @end is assumed to be limited at most 1 compound page */

...would remind whoever refactors this later about the assumption.

>
> >
> >> +               refs = record_subpages(page, addr, next, pages + *nr);
> >> +
> >> +               SetPageReferenced(head);
> >> +               head = try_grab_compound_head(head, refs, flags);
> >> +               if (!head) {
> >> +                       if (PageCompound(head)) {
> >
> > @head is NULL here, I think you wanted to rename the result of
> > try_grab_compound_head() to something like pinned_head so that you
> > don't undo the work you did above.
>
> Yes. pinned_head is what I actually should have written. Let me fix that.
>
> > However I feel like there's one too
> > PageCompund() checks.
> >
>
> I agree, but I am not fully sure how I can remove them :(

If you fix the bug above that's sufficient for me, I may be wishing
for something more pretty but is not possible in practice...

>
> The previous approach was to separate the logic into two distinct helpers namely
> __gup_device_huge() and __gup_device_compound_huge(). But that sort of special casing
> wasn't a good idea, so I tried merging both cases in __gup_device_huge() solely
> differentiating on PageCompound().
>
> I could make this slightly less bad by moving the error case PageCompound checks to
> undo_dev_pagemap() and record_subpages().
>
> But we still have the pagemap refcount to be taken until your other series removes the
> need for it. So perhaps I should place the remaining PageCompound based check inside
> record_subpages to accomodate the PAGE_SIZE geometry case (similarly hinted by Jason in
> the previous version but that I didn't fully address).
>
> How does the above sound?

Sounds worth a try, but not a hard requirement for this to move
forward from my perspective.

>
> longterm once we stop having devmap use non compound struct pages on PMDs/PUDs and the
> pgmap refcount on gup is removed then perhaps we can move to existing regular huge page
> path that is not devmap specific.
>

Ok.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-07 13:48     ` Joao Martins
@ 2021-06-07 19:32       ` Dan Williams
  2021-06-14 18:41         ` Joao Martins
  0 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2021-06-07 19:32 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Mon, Jun 7, 2021 at 6:49 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> > Given all of the above I'm wondering if there should be a new
> > "compound specific" flavor of this routine rather than trying to
> > cleverly inter mingle the old path with the new. This is easier
> > because you have already done the work in previous patches to break
> > this into helpers. So just have memmap_init_zone_device() do it the
> > "old" way and memmap_init_compound() handle all the tail page init +
> > optimizations.
> >
> I can separate it out, should be easier to follow.
>
> Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
> more accurately represents what goes on the page tables. That's regardless of
> vmemmap-based gains, and hence why my train of thought was to not separate it.
>
> After this series, all devmap pages where @geometry matches @align will have compound
> pages be used instead. And we enable that in device-dax as first user (in the next patch).
> altmap or not so far just differentiates on the uniqueness of struct pages as the former
> doesn't reuse base pages that only contain tail pages and consequently makes us initialize
> all tail struct pages.

I think combining the cases into a common central routine makes the
code that much harder to maintain. A small duplication cost is worth
it in my opinion to help readability / maintainability.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
       [not found]               ` <e22ef769-5eb2-1812-497f-6d069d632cd0@oracle.com>
@ 2021-06-07 21:00                 ` Joao Martins
  2021-06-07 21:57                   ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-07 21:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Linux NVDIMM



On 6/7/21 9:47 PM, Joao Martins wrote:
> On 6/7/21 9:17 PM, Dan Williams wrote:
>> On Tue, May 18, 2021 at 10:28 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> On 5/5/21 11:36 PM, Joao Martins wrote:
>>>> On 5/5/21 11:20 PM, Dan Williams wrote:
>>>>> On Wed, May 5, 2021 at 12:50 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>> On 5/5/21 7:44 PM, Dan Williams wrote:
>>>>>>> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>>>>>>> index b46f63dcaed3..bb28d82dda5e 100644
>>>>>>>> --- a/include/linux/memremap.h
>>>>>>>> +++ b/include/linux/memremap.h
>>>>>>>> @@ -114,6 +114,7 @@ struct dev_pagemap {
>>>>>>>>         struct completion done;
>>>>>>>>         enum memory_type type;
>>>>>>>>         unsigned int flags;
>>>>>>>> +       unsigned long align;
>>>>>>>
>>>>>>> I think this wants some kernel-doc above to indicate that non-zero
>>>>>>> means "use compound pages with tail-page dedup" and zero / PAGE_SIZE
>>>>>>> means "use non-compound base pages".
>>>
>>> [...]
>>>
>>>>>>> The non-zero value must be
>>>>>>> PAGE_SIZE, PMD_PAGE_SIZE or PUD_PAGE_SIZE.
>>>>>>> Hmm, maybe it should be an
>>>>>>> enum:
>>>>>>>
>>>>>>> enum devmap_geometry {
>>>>>>>     DEVMAP_PTE,
>>>>>>>     DEVMAP_PMD,
>>>>>>>     DEVMAP_PUD,
>>>>>>> }
>>>>>>>
>>>>>> I suppose a converter between devmap_geometry and page_size would be needed too? And maybe
>>>>>> the whole dax/nvdimm align values change meanwhile (as a followup improvement)?
>>>>>
>>>>> I think it is ok for dax/nvdimm to continue to maintain their align
>>>>> value because it should be ok to have 4MB align if the device really
>>>>> wanted. However, when it goes to map that alignment with
>>>>> memremap_pages() it can pick a mode. For example, it's already the
>>>>> case that dax->align == 1GB is mapped with DEVMAP_PTE today, so
>>>>> they're already separate concepts that can stay separate.
>>>>>
>>>> Gotcha.
>>>
>>> I am reconsidering part of the above. In general, yes, the meaning of devmap @align
>>> represents a slightly different variation of the device @align i.e. how the metadata is
>>> laid out **but** regardless of what kind of page table entries we use vmemmap.
>>>
>>> By using DEVMAP_PTE/PMD/PUD we might end up 1) duplicating what nvdimm/dax already
>>> validates in terms of allowed device @align values (i.e. PAGE_SIZE, PMD_SIZE and PUD_SIZE)
>>> 2) the geometry of metadata is very much tied to the value we pick to @align at namespace
>>> provisioning -- not the "align" we might use at mmap() perhaps that's what you referred
>>> above? -- and 3) the value of geometry actually derives from dax device @align because we
>>> will need to create compound pages representing a page size of @align value.
>>>
>>> Using your example above: you're saying that dax->align == 1G is mapped with DEVMAP_PTEs,
>>> in reality the vmemmap is populated with PMDs/PUDs page tables (depending on what archs
>>> decide to do at vmemmap_populate()) and uses base pages as its metadata regardless of what
>>> device @align. In reality what we want to convey in @geometry is not page table sizes, but
>>> just the page size used for the vmemmap of the dax device.
>>
>> Good point, the names "PTE, PMD, PUD" imply the hardware mapping size,
>> not the software compound page size.
>>
>>> Additionally, limiting its
>>> value might not be desirable... if tomorrow Linux for some arch supports dax/nvdimm
>>> devices with 4M align or 64K align, the value of @geometry will have to reflect the 4M to
>>> create compound pages of order 10 for the said vmemmap.
>>>
>>> I am going to wait until you finish reviewing the remaining four patches of this series,
>>> but maybe this is a simple misnomer (s/align/geometry/) with a comment but without
>>> DEVMAP_{PTE,PMD,PUD} enum part? Or perhaps its own struct with a value and enum a
>>> setter/getter to audit its value? Thoughts?
>>
>> I do see what you mean about the confusion DEVMAP_{PTE,PMD,PUD}
>> introduces, but I still think the device-dax align and the
>> organization of the 'struct page' metadata are distinct concepts. So
>> I'm happy with any color of the bikeshed as long as the 2 concepts are
>> distinct. How about calling it  "compound_page_order"? Open to other
>> ideas...
>>
> I actually like the name of @geometry. The only thing better would be @vmemmap_geometry
> solely because it makes it clear that its the vmemmap that we are talking about -- but
> might be unnecssarily verbose. And I still agree that is separate concept that should be
> named differently *at least*.
> 
> But naming aside, I was trying to get at was to avoid a second geometry value validation
> i.e. to be validated the value and set with a value such as DEVMAP_PTE, DEVMAP_PMD and
> DEVMAP_PUD. 

Sorry my english keeps getting broken, I meant this instead:

But naming aside, what I am trying to get at is to remove the second geometry value
validation i.e. for @geometry to not be validated a second time to be set to DEVMAP_PTE,
DEVMAP_PMD or DEVMAP_PUD.

> That to me sounds a little redundant, when the geometry value depends on what
> align is going to be used from. Here my metnion of @align refers to what's used to create
> the dax device, not the mmap() align [which can be lower than the device one]. The dax
> device align is the one used to decide whether to use PTEs, PMDs or PUDs at dax fault handler.
> 
> So separate concepts, but still its value dependent on one another. At least unless we
> want to allow geometry values different than those set by --align as Jane suggested.
> 

And I should add:

I can maintain the DEVMAP_* enum values, but then these will need to be changed in tandem
anytime a new @align value is supported. Or instead we use the name @geometry albeit with
still as an unsigned long type . Or rather than an unsigned long perhaps making another
type and its value obtained/changed with getter/setter.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-06-07 21:00                 ` Joao Martins
@ 2021-06-07 21:57                   ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2021-06-07 21:57 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Linux NVDIMM

On Mon, Jun 7, 2021 at 2:02 PM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> > But naming aside, I was trying to get at was to avoid a second geometry value validation
> > i.e. to be validated the value and set with a value such as DEVMAP_PTE, DEVMAP_PMD and
> > DEVMAP_PUD.
>
> Sorry my english keeps getting broken, I meant this instead:
>
> But naming aside, what I am trying to get at is to remove the second geometry value
> validation i.e. for @geometry to not be validated a second time to be set to DEVMAP_PTE,
> DEVMAP_PMD or DEVMAP_PUD.
>
> > That to me sounds a little redundant, when the geometry value depends on what
> > align is going to be used from. Here my metnion of @align refers to what's used to create
> > the dax device, not the mmap() align [which can be lower than the device one]. The dax
> > device align is the one used to decide whether to use PTEs, PMDs or PUDs at dax fault handler.
> >
> > So separate concepts, but still its value dependent on one another. At least unless we
> > want to allow geometry values different than those set by --align as Jane suggested.
> >
>
> And I should add:
>
> I can maintain the DEVMAP_* enum values, but then these will need to be changed in tandem
> anytime a new @align value is supported. Or instead we use the name @geometry albeit with
> still as an unsigned long type . Or rather than an unsigned long perhaps making another
> type and its value obtained/changed with getter/setter.

No, feel free to drop the enum and use an explicit "geometry" value
for the vmemmap.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-07 19:32       ` Dan Williams
@ 2021-06-14 18:41         ` Joao Martins
  2021-06-14 23:07           ` Dan Williams
  0 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-14 18:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton



On 6/7/21 8:32 PM, Dan Williams wrote:
> On Mon, Jun 7, 2021 at 6:49 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>> Given all of the above I'm wondering if there should be a new
>>> "compound specific" flavor of this routine rather than trying to
>>> cleverly inter mingle the old path with the new. This is easier
>>> because you have already done the work in previous patches to break
>>> this into helpers. So just have memmap_init_zone_device() do it the
>>> "old" way and memmap_init_compound() handle all the tail page init +
>>> optimizations.
>>>
>> I can separate it out, should be easier to follow.
>>
>> Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
>> more accurately represents what goes on the page tables. That's regardless of
>> vmemmap-based gains, and hence why my train of thought was to not separate it.
>>
>> After this series, all devmap pages where @geometry matches @align will have compound
>> pages be used instead. And we enable that in device-dax as first user (in the next patch).
>> altmap or not so far just differentiates on the uniqueness of struct pages as the former
>> doesn't reuse base pages that only contain tail pages and consequently makes us initialize
>> all tail struct pages.
> 
> I think combining the cases into a common central routine makes the
> code that much harder to maintain. A small duplication cost is worth
> it in my opinion to help readability / maintainability.
> 
I am addressing this comment and taking a step back. By just moving the tail page init to
memmap_init_compound() this gets a lot more readable. Albeit now I think having separate
top-level loops over pfns, doesn't bring much improvement there.

Here's what I have by moving just tails init to a separate routine. See your original
suggestion after the scissors mark. I have a slight inclination towards the first one, but
no really strong preference. Thoughts?

[...]

static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
                                        unsigned long zone_idx, int nid,
                                        struct dev_pagemap *pgmap,
                                        unsigned long nr_pages)
{
        unsigned int order_align = order_base_2(nr_pages);
        unsigned long i;

        __SetPageHead(page);

        for (i = 1; i < nr_pages; i++) {
                __init_zone_device_page(page + i, pfn + i, zone_idx,
                                        nid, pgmap);
                prep_compound_tail(page, i);

                /*
                 * The first and second tail pages need to
                 * initialized first, hence the head page is
                 * prepared last.
                 */
                if (i == 2)
                        prep_compound_head(page, order_align);
        }
}

void __ref memmap_init_zone_device(struct zone *zone,
                                   unsigned long start_pfn,
                                   unsigned long nr_pages,
                                   struct dev_pagemap *pgmap)
{
        unsigned long pfn, end_pfn = start_pfn + nr_pages;
        struct pglist_data *pgdat = zone->zone_pgdat;
        struct vmem_altmap *altmap = pgmap_altmap(pgmap);
        unsigned long pfns_per_compound = pgmap_pfn_geometry(pgmap);
        unsigned long zone_idx = zone_idx(zone);
        unsigned long start = jiffies;
        int nid = pgdat->node_id;

        if (WARN_ON_ONCE(!pgmap || zone_idx(zone) != ZONE_DEVICE))
                return;

        /*
         * The call to memmap_init_zone should have already taken care
         * of the pages reserved for the memmap, so we can just jump to
         * the end of that region and start processing the device pages.
         */
        if (altmap) {
                start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
                nr_pages = end_pfn - start_pfn;
        }

        for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);

                if (pfns_per_compound == 1)
                        continue;

                memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
                                     pfns_per_compound);
        }

        pr_info("%s initialised %lu pages in %ums\n", __func__,
                nr_pages, jiffies_to_msecs(jiffies - start));
}


[...]

--->8-----
Whereas your original suggestion would look more like this:

[...]

static void __ref memmap_init_compound(unsigned long zone_idx, int nid,
                                        struct dev_pagemap *pgmap,
                                        unsigned long start_pfn,
                                        unsigned long end_pfn)
{
        unsigned long pfns_per_compound = pgmap_pfn_geometry(pgmap);
        unsigned int order_align = order_base_2(pfns_per_compound);
        unsigned long i;

        for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);

                __SetPageHead(page);

                for (i = 1; i < pfns_per_compound; i++) {
                        __init_zone_device_page(page + i, pfn + i, zone_idx,
                                                nid, pgmap);
                        prep_compound_tail(page, i);

                        /*
                         * The first and second tail pages need to
                         * initialized first, hence the head page is
                         * prepared last.
                         */
                        if (i == 2)
                                prep_compound_head(page, order_align);
                }
        }
}

static void __ref memmap_init_base(unsigned long zone_idx, int nid,
                                         struct dev_pagemap *pgmap,
                                         unsigned long start_pfn,
                                         unsigned long end_pfn)
{
        for (pfn = start_pfn; pfn < end_pfn; pfn++) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
        }
}

void __ref memmap_init_zone_device(struct zone *zone,
                                   unsigned long start_pfn,
                                   unsigned long nr_pages,
                                   struct dev_pagemap *pgmap)
{
        unsigned long pfn, end_pfn = start_pfn + nr_pages;
        struct pglist_data *pgdat = zone->zone_pgdat;
        struct vmem_altmap *altmap = pgmap_altmap(pgmap);
        bool compound = pgmap_geometry(pgmap) > PAGE_SIZE;
        unsigned long zone_idx = zone_idx(zone);
        unsigned long start = jiffies;
        int nid = pgdat->node_id;

        if (WARN_ON_ONCE(!pgmap || zone_idx(zone) != ZONE_DEVICE))
                return;

        /*
         * The call to memmap_init_zone should have already taken care
         * of the pages reserved for the memmap, so we can just jump to
         * the end of that region and start processing the device pages.
         */
        if (altmap) {
                start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
                nr_pages = end_pfn - start_pfn;
        }

        if (compound)
                memmap_init_compound(zone_idx, nid, pgmap, start_pfn, end_pfn);
        else
                memmap_init_base(zone_idx, nid, pgmap, start_pfn, end_pfn);

        pr_info("%s initialised %lu pages in %ums\n", __func__,
                nr_pages, jiffies_to_msecs(jiffies - start));
}


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
  2021-06-14 18:41         ` Joao Martins
@ 2021-06-14 23:07           ` Dan Williams
  0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2021-06-14 23:07 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Linux NVDIMM, Ira Weiny, Matthew Wilcox,
	Jason Gunthorpe, Jane Chu, Muchun Song, Mike Kravetz,
	Andrew Morton

On Mon, Jun 14, 2021 at 11:42 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 6/7/21 8:32 PM, Dan Williams wrote:
> > On Mon, Jun 7, 2021 at 6:49 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> > [..]
> >>> Given all of the above I'm wondering if there should be a new
> >>> "compound specific" flavor of this routine rather than trying to
> >>> cleverly inter mingle the old path with the new. This is easier
> >>> because you have already done the work in previous patches to break
> >>> this into helpers. So just have memmap_init_zone_device() do it the
> >>> "old" way and memmap_init_compound() handle all the tail page init +
> >>> optimizations.
> >>>
> >> I can separate it out, should be easier to follow.
> >>
> >> Albeit just a note, I think memmap_init_compound() should be the new normal as metadata
> >> more accurately represents what goes on the page tables. That's regardless of
> >> vmemmap-based gains, and hence why my train of thought was to not separate it.
> >>
> >> After this series, all devmap pages where @geometry matches @align will have compound
> >> pages be used instead. And we enable that in device-dax as first user (in the next patch).
> >> altmap or not so far just differentiates on the uniqueness of struct pages as the former
> >> doesn't reuse base pages that only contain tail pages and consequently makes us initialize
> >> all tail struct pages.
> >
> > I think combining the cases into a common central routine makes the
> > code that much harder to maintain. A small duplication cost is worth
> > it in my opinion to help readability / maintainability.
> >
> I am addressing this comment and taking a step back. By just moving the tail page init to
> memmap_init_compound() this gets a lot more readable. Albeit now I think having separate
> top-level loops over pfns, doesn't bring much improvement there.
>
> Here's what I have by moving just tails init to a separate routine. See your original
> suggestion after the scissors mark. I have a slight inclination towards the first one, but
> no really strong preference. Thoughts?

I like your "first one" too.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-06-14 23:07 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
     [not found]   ` <CAPcyv4gs_rHL7FPqyQEb3yT4jrv8Wo_xA2ojKsppoBfmDocq8A@mail.gmail.com>
     [not found]     ` <cd1c9849-8660-dbdc-718a-aa4ba5d48c01@oracle.com>
     [not found]       ` <CAPcyv4jG8+S6xJyp=1S2=dpit0Hs2+HgGwpWeRROCRuJnQYAxQ@mail.gmail.com>
     [not found]         ` <56a3e271-4ef8-ba02-639e-fd7fe7de7e36@oracle.com>
     [not found]           ` <8c922a58-c901-1ad9-5d19-1182bd6dea1e@oracle.com>
     [not found]             ` <CAPcyv4j_PdzytEeabe95FrUiNVNobdJRvUE9M9j0krKQ1defBg@mail.gmail.com>
     [not found]               ` <e22ef769-5eb2-1812-497f-6d069d632cd0@oracle.com>
2021-06-07 21:00                 ` Joao Martins
2021-06-07 21:57                   ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
     [not found]   ` <CAPcyv4gtSqfmuAaX9cs63OvLkf-h4B_5fPiEnM9p9cqLZztXpg@mail.gmail.com>
2021-06-07 13:48     ` Joao Martins
2021-06-07 19:32       ` Dan Williams
2021-06-14 18:41         ` Joao Martins
2021-06-14 23:07           ` Dan Williams
2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
     [not found]   ` <CAPcyv4jeY0K7ciWeCLjxXmiWs7NNeM-_zEdZ2XAdYnyZc9PvWA@mail.gmail.com>
2021-06-07 13:59     ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-06-02  1:05   ` Dan Williams
2021-06-07 15:21     ` Joao Martins
2021-06-07 19:22       ` Dan Williams
2021-04-01  9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).