nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
@ 2020-12-08 17:28 Joao Martins
  2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (12 more replies)
  0 siblings, 13 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Hey,

This small series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 

[0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound pagemap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages* function family with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup.

So to summarize what the series does:

Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
page size (namely @align was referred by remaining memremap/dax code) and
enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
given @align order. The main difference though, is that contrary to the hugetlbfs
series, there's no vmemmap for the area, because we are onlining it. IOW no
freeing of pages of already initialized vmemmap like the case for hugetlbfs,
which simplifies the logic (besides not being arch-specific). After these,
there's quite visible region bootstrap of pmem memmap given that we would
initialize fewer struct pages depending on the page size.

    NVDIMM namespace bootstrap improves from ~750ms to ~190ms/<=1ms on emulated NVDIMMs
    with 2M and 1G respectivally. The net gain in improvement is similarly observed
    in proportion when running on actual NVDIMMs.

Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
are working with compound pages i.e. we do 1 increment/decrement to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
improves considerably, and unpin_user_pages() improves as well when passed a
set of consecutive pages:

                                           before          after
    (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
    (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us

The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
user. For unpin_user_pages() we have an additional test to demonstrate the
improvement.  The test performs MR reg/unreg continuously and measuring its
rate for a given period. So essentially ib_mem_get and ib_mem_release being
stress tested which at the end of day means: pin_user_pages_longterm() and
unpin_user_pages() for a scatterlist:

    Before:
    159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
    466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
	        
    After:
    305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
    1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)

Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
is very experimental, and I imported most of follow_hugetlb_page(), except
that we do the same trick as gup-fast. In doing the patch I feel this batching
should live in follow_page_mask() and having that being changed to return a set
of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
patch then brings the previous test of mr reg/unreg (above) on parity
between device-dax and hugetlbfs.

Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
still running tests. Hence the RFC, asking for comments and general direction
of the work before continuing.

Patches apply on top of linux-next tag next-20201208 (commit a9e26cb5f261).

Comments and suggestions very much appreciated!

Thanks,
	Joao

Joao Martins (9):
  memremap: add ZONE_DEVICE support for compound pages
  sparse-vmemmap: Consolidate arguments in vmemmap section populate
  sparse-vmemmap: Reuse vmemmap areas for a given page size
  mm/page_alloc: Reuse tail struct pages for compound pagemaps
  device-dax: Compound pagemap support
  mm/gup: Grab head page refcount once for group of subpages
  mm/gup: Decrement head page once for group of subpages
  RDMA/umem: batch page unpin in __ib_mem_release()
  mm: Add follow_devmap_page() for devdax vmas

 drivers/dax/device.c           |  54 ++++++---
 drivers/infiniband/core/umem.c |  25 +++-
 include/linux/huge_mm.h        |   4 +
 include/linux/memory_hotplug.h |  16 ++-
 include/linux/memremap.h       |   2 +
 include/linux/mm.h             |   6 +-
 mm/gup.c                       | 130 ++++++++++++++++-----
 mm/huge_memory.c               | 202 +++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            |  13 ++-
 mm/memremap.c                  |  13 ++-
 mm/page_alloc.c                |  28 ++++-
 mm/sparse-vmemmap.c            |  97 +++++++++++++---
 mm/sparse.c                    |  16 +--
 13 files changed, 531 insertions(+), 75 deletions(-)

-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
  2020-12-09  5:59   ` John Hubbard
  2021-02-20  1:24   ` Dan Williams
  2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Add a new flag for struct dev_pagemap which designates that a a pagemap
is described as a set of compound pages or in other words, that how
pages are grouped together in the page tables are reflected in how we
describe struct pages. This means that rather than initializing
individual struct pages, we also initialize these struct pages, as
compound pages (on x86: 2M or 1G compound pages)

For certain ZONE_DEVICE users, like device-dax, which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
thus playing the same tricks as hugetlb pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memremap.h | 2 ++
 mm/memremap.c            | 8 ++++++--
 mm/page_alloc.c          | 7 +++++++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..f8f26b2cc3da 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -90,6 +90,7 @@ struct dev_pagemap_ops {
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
+#define PGMAP_COMPOUND		(1 << 1)
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
@@ -114,6 +115,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned int align;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
diff --git a/mm/memremap.c b/mm/memremap.c
index 16b2fb482da1..287a24b7a65a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	if (pgmap->flags & PGMAP_COMPOUND)
+		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
+	else
+		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+				- pfn_first(pgmap, range_id));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eaa227a479e4..9716ecd58e29 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6116,6 +6116,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	bool compound = pgmap->flags & PGMAP_COMPOUND;
+	unsigned int align = PHYS_PFN(pgmap->align);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6171,6 +6173,11 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		}
 	}
 
+	if (compound) {
+		for (pfn = start_pfn; pfn < end_pfn; pfn += align)
+			prep_compound_page(pfn_to_page(pfn), order_base_2(align));
+	}
+
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
 		nr_pages, jiffies_to_msecs(jiffies - start));
 }
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
  2020-12-09  6:16   ` John Hubbard
  2021-02-20  1:49   ` Dan Williams
  2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Replace vmem_altmap with an vmem_context argument. That let us
express how the vmemmap is gonna be initialized e.g. passing
flags and a page size for reusing pages upon initializing the
vmemmap.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/memory_hotplug.h |  6 +++++-
 include/linux/mm.h             |  2 +-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  6 +++++-
 mm/sparse.c                    | 16 ++++++++--------
 5 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 551093b74596..73f8bcbb58a4 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -81,6 +81,10 @@ struct mhp_params {
 	pgprot_t pgprot;
 };
 
+struct vmem_context {
+	struct vmem_altmap *altmap;
+};
+
 /*
  * Zone resizing functions
  *
@@ -353,7 +357,7 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_context *ctx);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..2eb44318bb2d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3000,7 +3000,7 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_context *ctx);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 63b2e46b6555..f8870c53fe5e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -313,6 +313,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 	unsigned long cur_nr_pages;
 	int err;
 	struct vmem_altmap *altmap = params->altmap;
+	struct vmem_context ctx = { .altmap = params->altmap };
 
 	if (WARN_ON_ONCE(!params->pgprot.pgprot))
 		return -EINVAL;
@@ -341,7 +342,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, &ctx);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16183d85a7d5..bcda68ba1381 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -249,15 +249,19 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_context *ctx)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	struct vmem_altmap *altmap = NULL;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
+	if (ctx)
+		altmap = ctx->altmap;
+
 	if (vmemmap_populate(start, end, nid, altmap))
 		return NULL;
 
diff --git a/mm/sparse.c b/mm/sparse.c
index 7bd23f9d6cef..47ca494398a7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -443,7 +443,7 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_context *ctx)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -648,9 +648,9 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_context *ctx)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, ctx);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -842,7 +842,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_context *ctx)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -874,9 +874,9 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, ctx);
 	if (!memmap) {
-		section_deactivate(pfn, nr_pages, altmap);
+		section_deactivate(pfn, nr_pages, ctx->altmap);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -902,7 +902,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_context *ctx)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -913,7 +913,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, ctx);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
  2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
  2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
  2020-12-08 17:38   ` Joao Martins
  2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that that
struct pages are onlined with a given alignment, and should reuse the
tail pages vmemmap areas. On that circunstamce we reuse the PFN backing
only the tail pages subsections, while letting the head page PFN remain
different. This presumes that the backing page structs are compound
pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with
PGMAP_COMPOUND set)

On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary
PFNs necessary to describe the subsection's 32K struct pages we are
onlining. On a 1G compound pagemap it let us save 4096 pages.

Sections are 128M (or bigger/smaller), and such when initializing a
compound memory map where we are initializing compound struct pages, we
need to preserve the tail page to be reused across the rest of the areas
for pagesizes which bigger than a section.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
I wonder, rather than separating vmem_context and mhp_params, that
one would just pick the latter. Albeit  semantically the ctx aren't
necessarily paramters, context passed from multiple sections onlining
(i.e. multiple calls to populate_section_memmap). Also provided that
this is internal state, which isn't passed to external modules, except
 @align and @flags for page size and requesting whether to reuse tail
page areas.
---
 include/linux/memory_hotplug.h | 10 ++++
 include/linux/mm.h             |  2 +-
 mm/memory_hotplug.c            | 12 ++++-
 mm/memremap.c                  |  3 ++
 mm/sparse-vmemmap.c            | 93 ++++++++++++++++++++++++++++------
 5 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 73f8bcbb58a4..e15bb82805a3 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -70,6 +70,10 @@ typedef int __bitwise mhp_t;
  */
 #define MEMHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
 
+/*
+ */
+#define MEMHP_REUSE_VMEMMAP	((__force mhp_t)BIT(1))
+
 /*
  * Extended parameters for memory hotplug:
  * altmap: alternative allocator for memmap array (optional)
@@ -79,10 +83,16 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	unsigned int align;
+	mhp_t flags;
 };
 
 struct vmem_context {
 	struct vmem_altmap *altmap;
+	mhp_t flags;
+	unsigned int align;
+	void *block;
+	unsigned long block_page;
 };
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2eb44318bb2d..8b0155441835 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3006,7 +3006,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f8870c53fe5e..56121dfcc44b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -300,6 +300,14 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
 	return 0;
 }
 
+static void vmem_context_init(struct vmem_context *ctx, struct mhp_params *params)
+{
+	memset(ctx, 0, sizeof(*ctx));
+	ctx->align = params->align;
+	ctx->altmap = params->altmap;
+	ctx->flags = params->flags;
+}
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
@@ -313,7 +321,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 	unsigned long cur_nr_pages;
 	int err;
 	struct vmem_altmap *altmap = params->altmap;
-	struct vmem_context ctx = { .altmap = params->altmap };
+	struct vmem_context ctx;
 
 	if (WARN_ON_ONCE(!params->pgprot.pgprot))
 		return -EINVAL;
@@ -338,6 +346,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 	if (err)
 		return err;
 
+	vmem_context_init(&ctx, params);
+
 	for (; pfn < end_pfn; pfn += cur_nr_pages) {
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
diff --git a/mm/memremap.c b/mm/memremap.c
index 287a24b7a65a..ecfa74848ac6 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -253,6 +253,9 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 			goto err_kasan;
 		}
 
+		if (pgmap->flags & PGMAP_COMPOUND)
+			params->align = pgmap->align;
+
 		error = arch_add_memory(nid, range->start, range_len(range),
 					params);
 	}
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bcda68ba1381..1679f36473ac 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap, void *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p;
-
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else {
+			get_page(virt_to_page(block));
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -216,8 +220,10 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static void *__meminit __vmemmap_populate_basepages(unsigned long start,
+					   unsigned long end, int node,
+					   struct vmem_altmap *altmap,
+					   void *block)
 {
 	unsigned long addr = start;
 	pgd_t *pgd;
@@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	for (; addr < end; addr += PAGE_SIZE) {
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
-			return -ENOMEM;
+			return NULL;
 		p4d = vmemmap_p4d_populate(pgd, addr, node);
 		if (!p4d)
-			return -ENOMEM;
+			return NULL;
 		pud = vmemmap_pud_populate(p4d, addr, node);
 		if (!pud)
-			return -ENOMEM;
+			return NULL;
 		pmd = vmemmap_pmd_populate(pud, addr, node);
 		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+			return NULL;
+		pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
 		if (!pte)
-			return -ENOMEM;
+			return NULL;
 		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 	}
 
+	return __va(__pfn_to_phys(pte_pfn(*pte)));
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
+		return -ENOMEM;
 	return 0;
 }
 
+static struct page * __meminit vmemmap_populate_reuse(unsigned long start,
+					unsigned long end, int node,
+					struct vmem_context *ctx)
+{
+	unsigned long size, addr = start;
+	unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page);
+
+	size = min(psize, end - start);
+
+	for (; addr < end; addr += size) {
+		unsigned long head = addr + PAGE_SIZE;
+		unsigned long tail = addr;
+		unsigned long last = addr + size;
+		void *area;
+
+		if (ctx->block_page &&
+		    IS_ALIGNED((addr - ctx->block_page), psize))
+			ctx->block = NULL;
+
+		area  = ctx->block;
+		if (!area) {
+			if (!__vmemmap_populate_basepages(addr, head, node,
+							  ctx->altmap, NULL))
+				return NULL;
+
+			tail = head + PAGE_SIZE;
+			area = __vmemmap_populate_basepages(head, tail, node,
+							    ctx->altmap, NULL);
+			if (!area)
+				return NULL;
+
+			ctx->block = area;
+			ctx->block_page = addr;
+		}
+
+		if (!__vmemmap_populate_basepages(tail, last, node,
+						  ctx->altmap, area))
+			return NULL;
+	}
+
+	return (struct page*) start;
+}
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_context *ctx)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
 	struct vmem_altmap *altmap = NULL;
+	int flags = 0;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (ctx)
+	if (ctx) {
 		altmap = ctx->altmap;
+		flags = ctx->flags;
+	}
+
+	if (flags & MEMHP_REUSE_VMEMMAP)
+		return vmemmap_populate_reuse(start, end, nid, ctx);
 
 	if (vmemmap_populate(start, end, nid, altmap))
 		return NULL;
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (2 preceding siblings ...)
  2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
  2021-02-20  3:34   ` Dan Williams
  2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that
struct pages are onlined with a given alignment, and should reuse the
tail pages vmemmap areas. On that circunstamce we reuse the PFN backing
only the tail pages subsections, while letting the head page PFN remain
different. This presumes that the backing page structs are compound
pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with
PGMAP_COMPOUND set)

On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary
PFNs necessary to describe the subsection's 32K struct pages we are
onlining. On a 1G compound pagemap it let us save 4096 pages.

Sections are 128M (or bigger/smaller), and such when initializing a
compound memory map where we are initializing compound struct pages, we
need to preserve the tail page to be reused across the rest of the areas
for pagesizes which bigger than a section.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
I wonder, rather than separating vmem_context and mhp_params, that
one would just pick the latter. Albeit  semantically the ctx aren't
necessarily paramters, context passed from multiple sections onlining
(i.e. multiple calls to populate_section_memmap). Also provided that
this is internal state, which isn't passed to external modules, except
 @align and @flags for page size and requesting whether to reuse tail
page areas.
---
 include/linux/memory_hotplug.h | 10 ++++
 include/linux/mm.h             |  2 +-
 mm/memory_hotplug.c            | 12 ++++-
 mm/memremap.c                  |  3 ++
 mm/sparse-vmemmap.c            | 93 ++++++++++++++++++++++++++++------
 5 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 73f8bcbb58a4..e15bb82805a3 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -70,6 +70,10 @@ typedef int __bitwise mhp_t;
  */
 #define MEMHP_MERGE_RESOURCE	((__force mhp_t)BIT(0))
 
+/*
+ */
+#define MEMHP_REUSE_VMEMMAP	((__force mhp_t)BIT(1))
+
 /*
  * Extended parameters for memory hotplug:
  * altmap: alternative allocator for memmap array (optional)
@@ -79,10 +83,16 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	unsigned int align;
+	mhp_t flags;
 };
 
 struct vmem_context {
 	struct vmem_altmap *altmap;
+	mhp_t flags;
+	unsigned int align;
+	void *block;
+	unsigned long block_page;
 };
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2eb44318bb2d..8b0155441835 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3006,7 +3006,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f8870c53fe5e..56121dfcc44b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -300,6 +300,14 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
 	return 0;
 }
 
+static void vmem_context_init(struct vmem_context *ctx, struct mhp_params *params)
+{
+	memset(ctx, 0, sizeof(*ctx));
+	ctx->align = params->align;
+	ctx->altmap = params->altmap;
+	ctx->flags = params->flags;
+}
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
@@ -313,7 +321,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 	unsigned long cur_nr_pages;
 	int err;
 	struct vmem_altmap *altmap = params->altmap;
-	struct vmem_context ctx = { .altmap = params->altmap };
+	struct vmem_context ctx;
 
 	if (WARN_ON_ONCE(!params->pgprot.pgprot))
 		return -EINVAL;
@@ -338,6 +346,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 	if (err)
 		return err;
 
+	vmem_context_init(&ctx, params);
+
 	for (; pfn < end_pfn; pfn += cur_nr_pages) {
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
diff --git a/mm/memremap.c b/mm/memremap.c
index 287a24b7a65a..ecfa74848ac6 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -253,6 +253,9 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 			goto err_kasan;
 		}
 
+		if (pgmap->flags & PGMAP_COMPOUND)
+			params->align = pgmap->align;
+
 		error = arch_add_memory(nid, range->start, range_len(range),
 					params);
 	}
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bcda68ba1381..364c071350e8 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap, void *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p;
-
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else {
+			get_page(virt_to_page(block));
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -216,8 +220,10 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static void *__meminit __vmemmap_populate_basepages(unsigned long start,
+					   unsigned long end, int node,
+					   struct vmem_altmap *altmap,
+					   void *block)
 {
 	unsigned long addr = start;
 	pgd_t *pgd;
@@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	for (; addr < end; addr += PAGE_SIZE) {
 		pgd = vmemmap_pgd_populate(addr, node);
 		if (!pgd)
-			return -ENOMEM;
+			return NULL;
 		p4d = vmemmap_p4d_populate(pgd, addr, node);
 		if (!p4d)
-			return -ENOMEM;
+			return NULL;
 		pud = vmemmap_pud_populate(p4d, addr, node);
 		if (!pud)
-			return -ENOMEM;
+			return NULL;
 		pmd = vmemmap_pmd_populate(pud, addr, node);
 		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+			return NULL;
+		pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
 		if (!pte)
-			return -ENOMEM;
+			return NULL;
 		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 	}
 
+	return __va(__pfn_to_phys(pte_pfn(*pte)));
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
+		return -ENOMEM;
 	return 0;
 }
 
+static struct page * __meminit vmemmap_populate_reuse(unsigned long start,
+					unsigned long end, int node,
+					struct vmem_context *ctx)
+{
+	unsigned long size, addr = start;
+	unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page);
+
+	size = min(psize, end - start);
+
+	for (; addr < end; addr += size) {
+		unsigned long head = addr + PAGE_SIZE;
+		unsigned long tail = addr;
+		unsigned long last = addr + size;
+		void *area;
+
+		if (ctx->block_page &&
+		    IS_ALIGNED((addr - ctx->block_page), psize))
+			ctx->block = NULL;
+
+		area  = ctx->block;
+		if (!area) {
+			if (!__vmemmap_populate_basepages(addr, head, node,
+							  ctx->altmap, NULL))
+				return NULL;
+
+			tail = head + PAGE_SIZE;
+			area = __vmemmap_populate_basepages(head, tail, node,
+							    ctx->altmap, NULL);
+			if (!area)
+				return NULL;
+
+			ctx->block = area;
+			ctx->block_page = addr;
+		}
+
+		if (!__vmemmap_populate_basepages(tail, last, node,
+						  ctx->altmap, area))
+			return NULL;
+	}
+
+	return (struct page *) start;
+}
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_context *ctx)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
 	struct vmem_altmap *altmap = NULL;
+	int flags = 0;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (ctx)
+	if (ctx) {
 		altmap = ctx->altmap;
+		flags = ctx->flags;
+	}
+
+	if (flags & MEMHP_REUSE_VMEMMAP)
+		return vmemmap_populate_reuse(start, end, nid, ctx);
 
 	if (vmemmap_populate(start, end, nid, altmap))
 		return NULL;
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (3 preceding siblings ...)
  2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
  2021-02-20  6:17   ` Dan Williams
  2020-12-08 17:28 ` [PATCH RFC 5/9] device-dax: Compound pagemap support Joao Martins
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

When PGMAP_COMPOUND is set, all pages are onlined at a given huge page
alignment and using compound pages to describe them as opposed to a
struct per 4K.

To minimize struct page overhead and given the usage of compound pages we
utilize the fact that most tail pages look the same, we online the
subsection while pointing to the same pages. Thus request VMEMMAP_REUSE
in add_pages.

With VMEMMAP_REUSE, provided we reuse most tail pages the amount of
struct pages we need to initialize is a lot smaller that the total
amount of structs we would normnally online. Thus allow an @init_order
to be passed to specify how much pages we want to prep upon creating a
compound page.

Finally when onlining all struct pages in memmap_init_zone_device, make
sure that we only initialize the unique struct pages i.e. the first 2
4K pages from @align which means 128 struct pages out of 32768 for 2M
@align or 262144 for a 1G @align.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/memremap.c   |  4 +++-
 mm/page_alloc.c | 23 ++++++++++++++++++++---
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index ecfa74848ac6..3eca07916b9d 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -253,8 +253,10 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 			goto err_kasan;
 		}
 
-		if (pgmap->flags & PGMAP_COMPOUND)
+		if (pgmap->flags & PGMAP_COMPOUND) {
 			params->align = pgmap->align;
+			params->flags = MEMHP_REUSE_VMEMMAP;
+		}
 
 		error = arch_add_memory(nid, range->start, range_len(range),
 					params);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9716ecd58e29..180a7d4e9285 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -691,10 +691,11 @@ void free_compound_page(struct page *page)
 	__free_pages_ok(page, compound_order(page), FPI_NONE);
 }
 
-void prep_compound_page(struct page *page, unsigned int order)
+static void __prep_compound_page(struct page *page, unsigned int order,
+				 unsigned int init_order)
 {
 	int i;
-	int nr_pages = 1 << order;
+	int nr_pages = 1 << init_order;
 
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
@@ -711,6 +712,11 @@ void prep_compound_page(struct page *page, unsigned int order)
 		atomic_set(compound_pincount_ptr(page), 0);
 }
 
+void prep_compound_page(struct page *page, unsigned int order)
+{
+	__prep_compound_page(page, order, order);
+}
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
 unsigned int _debug_guardpage_minorder;
 
@@ -6108,6 +6114,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+
+#define MEMMAP_COMPOUND_SIZE (2 * (PAGE_SIZE/sizeof(struct page)))
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6138,6 +6147,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
+		/* Skip already initialized pages. */
+		if (compound && (pfn % align >= MEMMAP_COMPOUND_SIZE)) {
+			pfn = ALIGN(pfn, align) - 1;
+			continue;
+		}
+
 		__init_single_page(page, pfn, zone_idx, nid);
 
 		/*
@@ -6175,7 +6190,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
 
 	if (compound) {
 		for (pfn = start_pfn; pfn < end_pfn; pfn += align)
-			prep_compound_page(pfn_to_page(pfn), order_base_2(align));
+			__prep_compound_page(pfn_to_page(pfn),
+					   order_base_2(align),
+					   order_base_2(MEMMAP_COMPOUND_SIZE));
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 5/9] device-dax: Compound pagemap support
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (4 preceding siblings ...)
  2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
  2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

dax devices are created with a fixed @align (huge page size) which
is enforced through as well at mmap() of the device. Faults,
consequently happen too at the specified @align specified at the
creation, and those don't change through out dax device lifetime.
MCEs poisons a whole dax huge page, as well as splits occurring at
at the configured page size.

As such, use the newly added compound pagemap facility which onlines
the assigned dax ranges as compound pages. Currently, this means,
that region/namespace bootstrap would take considerably less, given
that you would initialize considerably less pages.

On emulated NVDIMM guests this can be easily seen, e.g. on a setup
with an emulated NVDIMM with 128G in size seeing improvements from ~750ms
to ~190ms with 2M pages, and to less than a 1msec with 1G pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Probably deserves its own sysfs attribute for enabling PGMAP_COMPOUND?
---
 drivers/dax/device.c | 54 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 41 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 25e0b84a4296..9daec6e08efe 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,39 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned int fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned int fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma, vmf->address
+			& ~(fault_size - 1));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +258,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +266,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap->flags & PGMAP_COMPOUND)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -426,6 +451,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	pgmap->flags = PGMAP_COMPOUND;
+	pgmap->align = dev_dax->align;
+
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (5 preceding siblings ...)
  2020-12-08 17:28 ` [PATCH RFC 5/9] device-dax: Compound pagemap support Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
  2020-12-09  4:40   ` John Hubbard
       [not found]   ` <20201208194905.GQ5487@ziepe.ca>
  2020-12-08 17:28 ` [PATCH RFC 7/9] mm/gup: Decrement head page " Joao Martins
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Much like hugetlbfs or THPs, we treat device pagemaps with
compound pages like the rest of GUP handling of compound pages.

Rather than incrementing the refcount every 4K, we record
all sub pages and increment by @refs amount *once*.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast():

 $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w

(get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
(pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 51 insertions(+), 16 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609..194e6981eb03 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-			     unsigned long end, unsigned int flags,
-			     struct page **pages, int *nr)
+static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
+				      struct page *head, unsigned long sz,
+				      unsigned long addr, unsigned long end,
+				      unsigned int flags, struct page **pages)
+{
+	struct page *page;
+	int refs;
+
+	if (!(pgmap->flags & PGMAP_COMPOUND))
+		return -1;
+
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	refs = record_subpages(page, addr, end, pages);
+
+	SetPageReferenced(page);
+	head = try_grab_compound_head(head, refs, flags);
+	if (!head) {
+		ClearPageReferenced(page);
+		return 0;
+	}
+
+	return refs;
+}
+
+static int __gup_device_huge(unsigned long pfn, unsigned long sz,
+			     unsigned long addr, unsigned long end,
+			     unsigned int flags, struct page **pages, int *nr)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 
 	do {
 		struct page *page = pfn_to_page(pfn);
+		int refs;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
+
+		refs = __gup_device_compound_huge(pgmap, page, sz, addr, end,
+						  flags, pages + *nr);
+		if (refs >= 0) {
+			*nr += refs;
+			put_dev_pagemap(pgmap);
+			return refs ? 1 : 0;
+		}
+
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		if (unlikely(!try_grab_page(page, flags))) {
@@ -2289,7 +2335,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
+	if (!__gup_device_huge(fault_pfn, PMD_SHIFT, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -2307,7 +2353,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
+	if (!__gup_device_huge(fault_pfn, PUD_SHIFT, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -2334,17 +2380,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (6 preceding siblings ...)
  2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins
@ 2020-12-08 17:28 ` Joao Martins
       [not found]   ` <20201208193446.GP5487@ziepe.ca>
  2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:28 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Rather than decrementing the ref count one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 41 ++++++++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 194e6981eb03..3a9a7229f418 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -212,6 +212,18 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+static int record_refs(struct page **pages, int npages)
+{
+	struct page *head = compound_head(pages[0]);
+	int refs = 1, index;
+
+	for (index = 1; index < npages; index++, refs++)
+		if (compound_head(pages[index]) != head)
+			break;
+
+	return refs;
+}
+
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:            pointer to page to be released
@@ -221,9 +233,9 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
  * that such pages can be separately tracked and uniquely handled. In
  * particular, interactions with RDMA and filesystems need special handling.
  */
-void unpin_user_page(struct page *page)
+static void __unpin_user_page(struct page *page, int refs)
 {
-	int refs = 1;
+	int orig_refs = refs;
 
 	page = compound_head(page);
 
@@ -237,14 +249,19 @@ void unpin_user_page(struct page *page)
 		return;
 
 	if (hpage_pincount_available(page))
-		hpage_pincount_sub(page, 1);
+		hpage_pincount_sub(page, refs);
 	else
-		refs = GUP_PIN_COUNTING_BIAS;
+		refs *= GUP_PIN_COUNTING_BIAS;
 
 	if (page_ref_sub_and_test(page, refs))
 		__put_page(page);
 
-	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
+	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
+}
+
+void unpin_user_page(struct page *page)
+{
+	__unpin_user_page(page, 1);
 }
 EXPORT_SYMBOL(unpin_user_page);
 
@@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty)
 {
 	unsigned long index;
+	int refs = 1;
 
 	/*
 	 * TODO: this can be optimized for huge pages: if a series of pages is
@@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		return;
 	}
 
-	for (index = 0; index < npages; index++) {
+	for (index = 0; index < npages; index += refs) {
 		struct page *page = compound_head(pages[index]);
+
 		/*
 		 * Checking PageDirty at this point may race with
 		 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -310,7 +329,8 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		 */
 		if (!PageDirty(page))
 			set_page_dirty_lock(page);
-		unpin_user_page(page);
+		refs = record_refs(pages + index, npages - index);
+		__unpin_user_page(page, refs);
 	}
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -327,6 +347,7 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 void unpin_user_pages(struct page **pages, unsigned long npages)
 {
 	unsigned long index;
+	int refs = 1;
 
 	/*
 	 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -340,8 +361,10 @@ void unpin_user_pages(struct page **pages, unsigned long npages)
 	 * physically contiguous and part of the same compound page, then a
 	 * single operation to the head page should suffice.
 	 */
-	for (index = 0; index < npages; index++)
-		unpin_user_page(pages[index]);
+	for (index = 0; index < npages; index += refs) {
+		refs = record_refs(pages + index, npages - index);
+		__unpin_user_page(pages[index], refs);
+	}
 }
 EXPORT_SYMBOL(unpin_user_pages);
 
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release()
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (7 preceding siblings ...)
  2020-12-08 17:28 ` [PATCH RFC 7/9] mm/gup: Decrement head page " Joao Martins
@ 2020-12-08 17:29 ` Joao Martins
  2020-12-09  5:18   ` John Hubbard
       [not found]   ` <20201208192935.GA1908088@ziepe.ca>
  2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:29 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Take advantage of the newly added unpin_user_pages() batched
refcount update, by calculating a page array from an SGL
(same size as the one used in ib_mem_get()) and call
unpin_user_pages() with that.

unpin_user_pages() will check on consecutive pages that belong
to the same compound page set and batch the refcount update in
a single write.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with device-dax and hugetlbfs:

Before:
159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)

After:
 305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)

We also see similar improvements on a setup with pmem and RDMA hardware.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/infiniband/core/umem.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index e9fecbdf391b..493cfdcf7381 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -44,20 +44,40 @@
 
 #include "uverbs.h"
 
+#define PAGES_PER_LIST (PAGE_SIZE / sizeof(struct page *))
+
 static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
 {
+	bool make_dirty = umem->writable && dirty;
+	struct page **page_list = NULL;
 	struct sg_page_iter sg_iter;
+	unsigned long nr = 0;
 	struct page *page;
 
+	page_list = (struct page **) __get_free_page(GFP_KERNEL);
+
 	if (umem->nmap > 0)
 		ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
 				DMA_BIDIRECTIONAL);
 
 	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
 		page = sg_page_iter_page(&sg_iter);
-		unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
+		if (page_list)
+			page_list[nr++] = page;
+
+		if (!page_list) {
+			unpin_user_pages_dirty_lock(&page, 1, make_dirty);
+		} else if (nr == PAGES_PER_LIST) {
+			unpin_user_pages_dirty_lock(page_list, nr, make_dirty);
+			nr = 0;
+		}
 	}
 
+	if (nr)
+		unpin_user_pages_dirty_lock(page_list, nr, make_dirty);
+
+	if (page_list)
+		free_page((unsigned long) page_list);
 	sg_free_table(&umem->sg_head);
 }
 
@@ -212,8 +232,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
 		cond_resched();
 		ret = pin_user_pages_fast(cur_base,
 					  min_t(unsigned long, npages,
-						PAGE_SIZE /
-						sizeof(struct page *)),
+						PAGES_PER_LIST),
 					  gup_flags | FOLL_LONGTERM, page_list);
 		if (ret < 0)
 			goto umem_release;
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (8 preceding siblings ...)
  2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins
@ 2020-12-08 17:29 ` Joao Martins
  2020-12-09  5:23   ` John Hubbard
       [not found]   ` <20201208195754.GR5487@ziepe.ca>
  2020-12-09  9:38 ` [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps David Hildenbrand
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:29 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton, Joao Martins

Similar to follow_hugetlb_page() add a follow_devmap_page which rather
than calling follow_page() per 4K page in a PMD/PUD it does so for the
entire PMD, where we lock the pmd/pud, get all pages , unlock.

While doing so, we only change the refcount once when PGMAP_COMPOUND is
passed in.

This let us improve {pin,get}_user_pages{,_longterm}() considerably:

$ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w

(<test>) [before] -> [after]
(get_user_pages 2M pages) ~150k us -> ~8.9k us
(pin_user_pages 2M pages) ~192k us -> ~9k us
(pin_user_pages_longterm 2M pages) ~200k us -> ~19k us

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
I've special-cased this to device-dax vmas given its similar page size
guarantees as hugetlbfs, but I feel this is a bit wrong. I am
replicating follow_hugetlb_page() as RFC ought to seek feedback whether
this should be generalized if no fundamental issues exist. In such case,
should I be changing follow_page_mask() to take either an array of pages
or a function pointer and opaque arguments which would let caller pick
its structure?
---
 include/linux/huge_mm.h |   4 +
 include/linux/mm.h      |   2 +
 mm/gup.c                |  22 ++++-
 mm/huge_memory.c        | 202 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 227 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0365aa97f8e7..da87ecea19e6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -293,6 +293,10 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, int flags, struct dev_pagemap **pgmap);
+long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			struct page **pages, struct vm_area_struct **vmas,
+			unsigned long *position, unsigned long *nr_pages,
+			long i, unsigned int flags, int *locked);
 
 extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b0155441835..466c88679628 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1164,6 +1164,8 @@ static inline void get_page(struct page *page)
 	page_ref_inc(page);
 }
 
+__maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
+						   unsigned int flags);
 bool __must_check try_grab_page(struct page *page, unsigned int flags);
 
 static inline __must_check bool try_get_page(struct page *page)
diff --git a/mm/gup.c b/mm/gup.c
index 3a9a7229f418..50effb9cc349 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -78,7 +78,7 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
  * considered failure, and furthermore, a likely bug in the caller, so a warning
  * is also emitted.
  */
-static __maybe_unused struct page *try_grab_compound_head(struct page *page,
+__maybe_unused struct page *try_grab_compound_head(struct page *page,
 							  int refs,
 							  unsigned int flags)
 {
@@ -880,8 +880,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
  * does not include FOLL_NOWAIT, the mmap_lock may be released.  If it
  * is, *@locked will be set to 0 and -EBUSY returned.
  */
-static int faultin_page(struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *locked)
+int faultin_page(struct vm_area_struct *vma,
+		 unsigned long address, unsigned int *flags, int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -1103,6 +1103,22 @@ static long __get_user_pages(struct mm_struct *mm,
 				}
 				continue;
 			}
+			if (vma_is_dax(vma)) {
+				i = follow_devmap_page(mm, vma, pages, vmas,
+						       &start, &nr_pages, i,
+						       gup_flags, locked);
+				if (locked && *locked == 0) {
+					/*
+					 * We've got a VM_FAULT_RETRY
+					 * and we've lost mmap_lock.
+					 * We must stop here.
+					 */
+					BUG_ON(gup_flags & FOLL_NOWAIT);
+					BUG_ON(ret != 0);
+					goto out;
+				}
+				continue;
+			}
 		}
 retry:
 		/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec2bb93f7431..20bfbf211dc3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1168,6 +1168,208 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	return page;
 }
 
+long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			struct page **pages, struct vm_area_struct **vmas,
+			unsigned long *position, unsigned long *nr_pages,
+			long i, unsigned int flags, int *locked)
+{
+	unsigned long pfn_offset;
+	unsigned long vaddr = *position;
+	unsigned long remainder = *nr_pages;
+	unsigned long align = vma_kernel_pagesize(vma);
+	unsigned long align_nr_pages = align >> PAGE_SHIFT;
+	unsigned long mask = ~(align-1);
+	unsigned long nr_pages_hpage = 0;
+	struct dev_pagemap *pgmap = NULL;
+	int err = -EFAULT;
+
+	if (align == PAGE_SIZE)
+		return i;
+
+	while (vaddr < vma->vm_end && remainder) {
+		pte_t *pte;
+		spinlock_t *ptl = NULL;
+		int absent;
+		struct page *page;
+
+		/*
+		 * If we have a pending SIGKILL, don't keep faulting pages and
+		 * potentially allocating memory.
+		 */
+		if (fatal_signal_pending(current)) {
+			remainder = 0;
+			break;
+		}
+
+		/*
+		 * Some archs (sparc64, sh*) have multiple pte_ts to
+		 * each hugepage.  We have to make sure we get the
+		 * first, for the page indexing below to work.
+		 *
+		 * Note that page table lock is not held when pte is null.
+		 */
+		pte = huge_pte_offset(mm, vaddr & mask, align);
+		if (pte) {
+			if (align == PMD_SIZE)
+				ptl = pmd_lockptr(mm, (pmd_t *) pte);
+			else if (align == PUD_SIZE)
+				ptl = pud_lockptr(mm, (pud_t *) pte);
+			spin_lock(ptl);
+		}
+		absent = !pte || pte_none(ptep_get(pte));
+
+		if (absent && (flags & FOLL_DUMP)) {
+			if (pte)
+				spin_unlock(ptl);
+			remainder = 0;
+			break;
+		}
+
+		if (absent ||
+		    ((flags & FOLL_WRITE) &&
+		      !pte_write(ptep_get(pte)))) {
+			vm_fault_t ret;
+			unsigned int fault_flags = 0;
+
+			if (pte)
+				spin_unlock(ptl);
+			if (flags & FOLL_WRITE)
+				fault_flags |= FAULT_FLAG_WRITE;
+			if (locked)
+				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
+					FAULT_FLAG_KILLABLE;
+			if (flags & FOLL_NOWAIT)
+				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
+					FAULT_FLAG_RETRY_NOWAIT;
+			if (flags & FOLL_TRIED) {
+				/*
+				 * Note: FAULT_FLAG_ALLOW_RETRY and
+				 * FAULT_FLAG_TRIED can co-exist
+				 */
+				fault_flags |= FAULT_FLAG_TRIED;
+			}
+			ret = handle_mm_fault(vma, vaddr, flags, NULL);
+			if (ret & VM_FAULT_ERROR) {
+				err = vm_fault_to_errno(ret, flags);
+				remainder = 0;
+				break;
+			}
+			if (ret & VM_FAULT_RETRY) {
+				if (locked &&
+				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
+					*locked = 0;
+				*nr_pages = 0;
+				/*
+				 * VM_FAULT_RETRY must not return an
+				 * error, it will return zero
+				 * instead.
+				 *
+				 * No need to update "position" as the
+				 * caller will not check it after
+				 * *nr_pages is set to 0.
+				 */
+				return i;
+			}
+			continue;
+		}
+
+		pfn_offset = (vaddr & ~mask) >> PAGE_SHIFT;
+		page = pte_page(ptep_get(pte));
+
+		pgmap = get_dev_pagemap(page_to_pfn(page), pgmap);
+		if (!pgmap) {
+			spin_unlock(ptl);
+			remainder = 0;
+			err = -EFAULT;
+			break;
+		}
+
+		/*
+		 * If subpage information not requested, update counters
+		 * and skip the same_page loop below.
+		 */
+		if (!pages && !vmas && !pfn_offset &&
+		    (vaddr + align < vma->vm_end) &&
+		    (remainder >= (align_nr_pages))) {
+			vaddr += align;
+			remainder -= align_nr_pages;
+			i += align_nr_pages;
+			spin_unlock(ptl);
+			continue;
+		}
+
+		nr_pages_hpage = 0;
+
+same_page:
+		if (pages) {
+			pages[i] = mem_map_offset(page, pfn_offset);
+
+			/*
+			 * try_grab_page() should always succeed here, because:
+			 * a) we hold the ptl lock, and b) we've just checked
+			 * that the huge page is present in the page tables.
+			 */
+			if (!(pgmap->flags & PGMAP_COMPOUND) &&
+			    WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
+
+		}
+
+		if (vmas)
+			vmas[i] = vma;
+
+		vaddr += PAGE_SIZE;
+		++pfn_offset;
+		--remainder;
+		++i;
+		nr_pages_hpage++;
+		if (vaddr < vma->vm_end && remainder &&
+				pfn_offset < align_nr_pages) {
+			/*
+			 * We use pfn_offset to avoid touching the pageframes
+			 * of this compound page.
+			 */
+			goto same_page;
+		} else {
+			/*
+			 * try_grab_compound_head() should always succeed here,
+			 * because: a) we hold the ptl lock, and b) we've just
+			 * checked that the huge page is present in the page
+			 * tables. If the huge page is present, then the tail
+			 * pages must also be present. The ptl prevents the
+			 * head page and tail pages from being rearranged in
+			 * any way. So this page must be available at this
+			 * point, unless the page refcount overflowed:
+			 */
+			if ((pgmap->flags & PGMAP_COMPOUND) &&
+			    WARN_ON_ONCE(!try_grab_compound_head(pages[i-1],
+								 nr_pages_hpage,
+								 flags))) {
+				put_dev_pagemap(pgmap);
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
+			put_dev_pagemap(pgmap);
+		}
+		spin_unlock(ptl);
+	}
+	*nr_pages = remainder;
+	/*
+	 * setting position is actually required only if remainder is
+	 * not zero but it's faster not to add a "if (remainder)"
+	 * branch.
+	 */
+	*position = vaddr;
+
+	return i ? i : err;
+}
+
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma)
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align
  2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins
@ 2020-12-08 17:38   ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-08 17:38 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/8/20 5:28 PM, Joao Martins wrote:
> Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that that
> struct pages are onlined with a given alignment, and should reuse the
> tail pages vmemmap areas. On that circunstamce we reuse the PFN backing
> only the tail pages subsections, while letting the head page PFN remain
> different. This presumes that the backing page structs are compound
> pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with
> PGMAP_COMPOUND set)
> 
> On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary
> PFNs necessary to describe the subsection's 32K struct pages we are
> onlining. On a 1G compound pagemap it let us save 4096 pages.
> 
> Sections are 128M (or bigger/smaller), and such when initializing a
> compound memory map where we are initializing compound struct pages, we
> need to preserve the tail page to be reused across the rest of the areas
> for pagesizes which bigger than a section.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Sigh, ignore this one.

I mistakenly had stashed an old version of this patch, and wrongly send it up.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
  2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins
@ 2020-12-09  4:40   ` John Hubbard
  2020-12-09 13:44     ` Joao Martins
       [not found]   ` <20201208194905.GQ5487@ziepe.ca>
  1 sibling, 1 reply; 67+ messages in thread
From: John Hubbard @ 2020-12-09  4:40 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe  <jgg@ziepe.ca>,
	Jane Chu <jane.chu@oracle.com>,
	Muchun Song, Mike Kravetz, Andrew

On 12/8/20 9:28 AM, Joao Martins wrote:
> Much like hugetlbfs or THPs, we treat device pagemaps with
> compound pages like the rest of GUP handling of compound pages.
> 
> Rather than incrementing the refcount every 4K, we record
> all sub pages and increment by @refs amount *once*.
> 
> Performance measured by gup_benchmark improves considerably
> get_user_pages_fast() and pin_user_pages_fast():
> 
>   $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w

"gup_test", now that you're in linux-next, actually.

(Maybe I'll retrofit that test with getopt_long(), those options are
getting more elaborate.)

> 
> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us

That is a beautiful result! I'm very motivated to see if this patchset
can make it in, in some form.

> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
>   1 file changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 98eb8e6d2609..194e6981eb03 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   }
>   #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>   
> +
> +static int record_subpages(struct page *page, unsigned long addr,
> +			   unsigned long end, struct page **pages)
> +{
> +	int nr;
> +
> +	for (nr = 0; addr != end; addr += PAGE_SIZE)
> +		pages[nr++] = page++;
> +
> +	return nr;
> +}
> +
>   #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> -			     unsigned long end, unsigned int flags,
> -			     struct page **pages, int *nr)
> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
> +				      struct page *head, unsigned long sz,

If this variable survives (I see Jason requested a reorg of this math stuff,
and I also like that idea), then I'd like a slightly better name for "sz".

I was going to suggest one, but then realized that I can't understand how this
works. See below...

> +				      unsigned long addr, unsigned long end,
> +				      unsigned int flags, struct page **pages)
> +{
> +	struct page *page;
> +	int refs;
> +
> +	if (!(pgmap->flags & PGMAP_COMPOUND))
> +		return -1;

btw, I'm unhappy with returning -1 here and assigning it later to a refs variable.
(And that will show up even more clearly as an issue if you attempt to make
refs unsigned everywhere!)

I'm not going to suggest anything because there are a lot of ways to structure
these routines, and I don't want to overly constrain you. Just please don't assign
negative values to any refs variables.

> +
> +	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);

If you pass in PMD_SHIFT or PUD_SHIFT for, that's a number-of-bits, isn't it?
Not a size. And if it's not a size, then sz - 1 doesn't work, does it? If it
does work, then better naming might help. I'm probably missing a really
obvious math trick here.


thanks,
-- 
John Hubbard
NVIDIA

> +	refs = record_subpages(page, addr, end, pages);
> +
> +	SetPageReferenced(page);
> +	head = try_grab_compound_head(head, refs, flags);
> +	if (!head) {
> +		ClearPageReferenced(page);
> +		return 0;
> +	}
> +
> +	return refs;
> +}
> +
> +static int __gup_device_huge(unsigned long pfn, unsigned long sz,
> +			     unsigned long addr, unsigned long end,
> +			     unsigned int flags, struct page **pages, int *nr)
>   {
>   	int nr_start = *nr;
>   	struct dev_pagemap *pgmap = NULL;
>   
>   	do {
>   		struct page *page = pfn_to_page(pfn);
> +		int refs;
>   
>   		pgmap = get_dev_pagemap(pfn, pgmap);
>   		if (unlikely(!pgmap)) {
>   			undo_dev_pagemap(nr, nr_start, flags, pages);
>   			return 0;
>   		}
> +
> +		refs = __gup_device_compound_huge(pgmap, page, sz, addr, end,
> +						  flags, pages + *nr);
> +		if (refs >= 0) {
> +			*nr += refs;
> +			put_dev_pagemap(pgmap);
> +			return refs ? 1 : 0;
> +		}
> +
>   		SetPageReferenced(page);
>   		pages[*nr] = page;
>   		if (unlikely(!try_grab_page(page, flags))) {
> @@ -2289,7 +2335,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   	int nr_start = *nr;
>   
>   	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
> +	if (!__gup_device_huge(fault_pfn, PMD_SHIFT, addr, end, flags, pages, nr))
>   		return 0;
>   
>   	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -2307,7 +2353,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>   	int nr_start = *nr;
>   
>   	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
> +	if (!__gup_device_huge(fault_pfn, PUD_SHIFT, addr, end, flags, pages, nr))
>   		return 0;
>   
>   	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -2334,17 +2380,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>   }
>   #endif
>   
> -static int record_subpages(struct page *page, unsigned long addr,
> -			   unsigned long end, struct page **pages)
> -{
> -	int nr;
> -
> -	for (nr = 0; addr != end; addr += PAGE_SIZE)
> -		pages[nr++] = page++;
> -
> -	return nr;
> -}
> -
>   #ifdef CONFIG_ARCH_HAS_HUGEPD
>   static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>   				      unsigned long sz)
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages
       [not found]   ` <20201208193446.GP5487@ziepe.ca>
@ 2020-12-09  5:06     ` John Hubbard
  2020-12-09 12:17     ` Joao Martins
  2020-12-17 19:05     ` Joao Martins
  2 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2020-12-09  5:06 UTC (permalink / raw)
  To: Jason Gunthorpe, Joao Martins, Daniel Jordan
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew

On 12/8/20 11:34 AM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:59PM +0000, Joao Martins wrote:
>> Rather than decrementing the ref count one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>   mm/gup.c | 41 ++++++++++++++++++++++++++++++++---------
>>   1 file changed, 32 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 194e6981eb03..3a9a7229f418 100644
>> +++ b/mm/gup.c
>> @@ -212,6 +212,18 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>   }
>>   #endif /* CONFIG_DEV_PAGEMAP_OPS */
>>   
>> +static int record_refs(struct page **pages, int npages)
>> +{
>> +	struct page *head = compound_head(pages[0]);
>> +	int refs = 1, index;
>> +
>> +	for (index = 1; index < npages; index++, refs++)
>> +		if (compound_head(pages[index]) != head)
>> +			break;
>> +
>> +	return refs;
>> +}
>> +
>>   /**
>>    * unpin_user_page() - release a dma-pinned page
>>    * @page:            pointer to page to be released
>> @@ -221,9 +233,9 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>    * that such pages can be separately tracked and uniquely handled. In
>>    * particular, interactions with RDMA and filesystems need special handling.
>>    */
>> -void unpin_user_page(struct page *page)
>> +static void __unpin_user_page(struct page *page, int refs)
> 
> Refs should be unsigned everywhere.

That's fine (although, see my comments in the previous patch for
pitfalls). But it should be a preparatory patch, in order to avoid
clouding up this one and your others as well.


> 
> I suggest using clear language 'page' here should always be a compound
> head called 'head' (or do we have another common variable name for
> this?)
> 

Agreed. Matthew's struct folio upgrade will allow us to really make
things clear in a typesafe way, but meanwhile, it's probably good to use
one of the following patterns:

page = compound_head(page); // at the very beginning of a routine

or

do_things_to_this_single_page(page);

head = compound_head(page);
do_things_to_this_compound_page(head);


> 'refs' is number of tail pages within the compound, so 'ntails' or
> something
> 

I think it's OK to leave it as "refs", because within gup.c, refs has
a very particular meaning. But if you change to ntails or something, I'd
want to see a complete change: no leftovers of refs that are really ntails.

So far I'd rather leave it as refs, but it's not a big deal either way.

>>   {
>> -	int refs = 1;
>> +	int orig_refs = refs;
>>   
>>   	page = compound_head(page);
> 
> Caller should always do this
> 
>> @@ -237,14 +249,19 @@ void unpin_user_page(struct page *page)
>>   		return;
>>   
>>   	if (hpage_pincount_available(page))
>> -		hpage_pincount_sub(page, 1);
>> +		hpage_pincount_sub(page, refs);

Maybe a nice touch would be to pass in orig_refs, because there
is no intention to use a possibly modified refs. So:

		hpage_pincount_sub(page, orig_refs);

...obviously a fine point, I realize. :)

>>   	else
>> -		refs = GUP_PIN_COUNTING_BIAS;
>> +		refs *= GUP_PIN_COUNTING_BIAS;
>>   
>>   	if (page_ref_sub_and_test(page, refs))
>>   		__put_page(page);
>>   
>> -	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
>> +	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
>> +}
> 
> And really this should be placed directly after
> try_grab_compound_head() and be given a similar name
> 'unpin_compound_head()'. Even better would be to split the FOLL_PIN
> part into a function so there was a clear logical pairing.
> 
> And reviewing it like that I want to ask if this unpin sequence is in
> the right order.. I would expect it to be the reverse order of the get
> 
> John?
> 
> Is it safe to call mod_node_page_state() after releasing the refcount?
> This could race with hot-unplugging the struct pages so I think it is
> wrong.

Yes, I think you are right! I wasn't in a hot unplug state of mind when I
thought about the ordering there, but I should have been. :)

> 
>> +void unpin_user_page(struct page *page)
>> +{
>> +	__unpin_user_page(page, 1);
> 
> Thus this is
> 
> 	__unpin_user_page(compound_head(page), 1);
> 
>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   				 bool make_dirty)
>>   {
>>   	unsigned long index;
>> +	int refs = 1;
>>   
>>   	/*
>>   	 * TODO: this can be optimized for huge pages: if a series of pages is

I think you can delete this TODO block now, and the one in unpin_user_pages_dirty_lock(),
as a result of these changes.

>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   		return;
>>   	}
>>   
>> -	for (index = 0; index < npages; index++) {
>> +	for (index = 0; index < npages; index += refs) {
>>   		struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, &head, &ntails) {
>         		if (!PageDirty(head))
> 			set_page_dirty_lock(head, ntails);
> 		unpin_user_page(head, ntails);
> }
> 
> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
> No reason not to fix set_page_dirty_lock() too while you are here.

Eh? What's wrong with set_page_dirty_lock() ?

> 
> Also, this patch and the next can be completely independent of the
> rest of the series, it is valuable regardless of the other tricks. You
> can split them and progress them independently.
> 
> .. and I was just talking about this with Daniel Jordan and some other
> people at your company :)
> 
> Thanks,
> Jason
> 

thanks,
-- 
John Hubbard
NVIDIA
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release()
  2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins
@ 2020-12-09  5:18   ` John Hubbard
       [not found]   ` <20201208192935.GA1908088@ziepe.ca>
  1 sibling, 0 replies; 67+ messages in thread
From: John Hubbard @ 2020-12-09  5:18 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe  <jgg@ziepe.ca>,
	Jane Chu <jane.chu@oracle.com>,
	Muchun Song, Mike Kravetz, Andrew

On 12/8/20 9:29 AM, Joao Martins wrote:
> Take advantage of the newly added unpin_user_pages() batched
> refcount update, by calculating a page array from an SGL
> (same size as the one used in ib_mem_get()) and call
> unpin_user_pages() with that.
> 
> unpin_user_pages() will check on consecutive pages that belong
> to the same compound page set and batch the refcount update in
> a single write.
> 
> Running a test program which calls mr reg/unreg on a 1G in size
> and measures cost of both operations together (in a guest using rxe)
> with device-dax and hugetlbfs:
> 
> Before:
> 159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
> 466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
> 
> After:
>   305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
> 1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
> 
> We also see similar improvements on a setup with pmem and RDMA hardware.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   drivers/infiniband/core/umem.c | 25 ++++++++++++++++++++++---
>   1 file changed, 22 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index e9fecbdf391b..493cfdcf7381 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -44,20 +44,40 @@
>   
>   #include "uverbs.h"
>   
> +#define PAGES_PER_LIST (PAGE_SIZE / sizeof(struct page *))

I was going to maybe suggest that this item, and the "bool make_dirty" cleanup,
be a separate patch, because they are just cleanups. But the memory allocation issue
below might make that whole (minor) point obsolete.

> +
>   static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>   {
> +	bool make_dirty = umem->writable && dirty;
> +	struct page **page_list = NULL;
>   	struct sg_page_iter sg_iter;
> +	unsigned long nr = 0;
>   	struct page *page;
>   
> +	page_list = (struct page **) __get_free_page(GFP_KERNEL);

Yeah, allocating memory in a free/release path is not good. btw, for future use,
I see that kmalloc() is generally recommended these days (that's a change), when
you want a pointer to storage, as opposed to wanting struct pages:

https://lore.kernel.org/lkml/CA+55aFwyxJ+TOpaJZnC5MPJ-25xbLAEu8iJP8zTYhmA3LXFF8Q@mail.gmail.com/

> +
>   	if (umem->nmap > 0)
>   		ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
>   				DMA_BIDIRECTIONAL);
>   
>   	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
>   		page = sg_page_iter_page(&sg_iter);
> -		unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
> +		if (page_list)
> +			page_list[nr++] = page;
> +
> +		if (!page_list) {
> +			unpin_user_pages_dirty_lock(&page, 1, make_dirty);
> +		} else if (nr == PAGES_PER_LIST) {
> +			unpin_user_pages_dirty_lock(page_list, nr, make_dirty);
> +			nr = 0;
> +		}
>   	}
>   
> +	if (nr)
> +		unpin_user_pages_dirty_lock(page_list, nr, make_dirty);
> +
> +	if (page_list)
> +		free_page((unsigned long) page_list);
>   	sg_free_table(&umem->sg_head);
>   }
>   
> @@ -212,8 +232,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
>   		cond_resched();
>   		ret = pin_user_pages_fast(cur_base,
>   					  min_t(unsigned long, npages,
> -						PAGE_SIZE /
> -						sizeof(struct page *)),
> +						PAGES_PER_LIST),
>   					  gup_flags | FOLL_LONGTERM, page_list);
>   		if (ret < 0)
>   			goto umem_release;
> 

thanks,
-- 
John Hubbard
NVIDIA
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas
  2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins
@ 2020-12-09  5:23   ` John Hubbard
       [not found]   ` <20201208195754.GR5487@ziepe.ca>
  1 sibling, 0 replies; 67+ messages in thread
From: John Hubbard @ 2020-12-09  5:23 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe  <jgg@ziepe.ca>,
	Jane Chu <jane.chu@oracle.com>,
	Muchun Song, Mike Kravetz, Andrew

On 12/8/20 9:29 AM, Joao Martins wrote:
> Similar to follow_hugetlb_page() add a follow_devmap_page which rather
> than calling follow_page() per 4K page in a PMD/PUD it does so for the
> entire PMD, where we lock the pmd/pud, get all pages , unlock.
> 
> While doing so, we only change the refcount once when PGMAP_COMPOUND is
> passed in.
> 
> This let us improve {pin,get}_user_pages{,_longterm}() considerably:
> 
> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w
> 
d> (<test>) [before] -> [after]
> (get_user_pages 2M pages) ~150k us -> ~8.9k us
> (pin_user_pages 2M pages) ~192k us -> ~9k us
> (pin_user_pages_longterm 2M pages) ~200k us -> ~19k us
>

Yes, this is a massive improvement.


> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> I've special-cased this to device-dax vmas given its similar page size
> guarantees as hugetlbfs, but I feel this is a bit wrong. I am
> replicating follow_hugetlb_page() as RFC ought to seek feedback whether
> this should be generalized if no fundamental issues exist. In such case,
> should I be changing follow_page_mask() to take either an array of pages
> or a function pointer and opaque arguments which would let caller pick
> its structure?

I don't know which approach is better because I haven't yet attempted to
find the common elements in these routines. But if there is *any way* to
avoid this copy-paste creation of yet more following of pages, then it's
*very* good to do.

thanks,
-- 
John Hubbard
NVIDIA

> ---
>   include/linux/huge_mm.h |   4 +
>   include/linux/mm.h      |   2 +
>   mm/gup.c                |  22 ++++-
>   mm/huge_memory.c        | 202 ++++++++++++++++++++++++++++++++++++++++
>   4 files changed, 227 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 0365aa97f8e7..da87ecea19e6 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -293,6 +293,10 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>   		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
>   struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>   		pud_t *pud, int flags, struct dev_pagemap **pgmap);
> +long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			struct page **pages, struct vm_area_struct **vmas,
> +			unsigned long *position, unsigned long *nr_pages,
> +			long i, unsigned int flags, int *locked);
>   
>   extern vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd);
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8b0155441835..466c88679628 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1164,6 +1164,8 @@ static inline void get_page(struct page *page)
>   	page_ref_inc(page);
>   }
>   
> +__maybe_unused struct page *try_grab_compound_head(struct page *page, int refs,
> +						   unsigned int flags);
>   bool __must_check try_grab_page(struct page *page, unsigned int flags);
>   
>   static inline __must_check bool try_get_page(struct page *page)
> diff --git a/mm/gup.c b/mm/gup.c
> index 3a9a7229f418..50effb9cc349 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -78,7 +78,7 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
>    * considered failure, and furthermore, a likely bug in the caller, so a warning
>    * is also emitted.
>    */
> -static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> +__maybe_unused struct page *try_grab_compound_head(struct page *page,
>   							  int refs,
>   							  unsigned int flags)
>   {
> @@ -880,8 +880,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
>    * does not include FOLL_NOWAIT, the mmap_lock may be released.  If it
>    * is, *@locked will be set to 0 and -EBUSY returned.
>    */
> -static int faultin_page(struct vm_area_struct *vma,
> -		unsigned long address, unsigned int *flags, int *locked)
> +int faultin_page(struct vm_area_struct *vma,
> +		 unsigned long address, unsigned int *flags, int *locked)
>   {
>   	unsigned int fault_flags = 0;
>   	vm_fault_t ret;
> @@ -1103,6 +1103,22 @@ static long __get_user_pages(struct mm_struct *mm,
>   				}
>   				continue;
>   			}
> +			if (vma_is_dax(vma)) {
> +				i = follow_devmap_page(mm, vma, pages, vmas,
> +						       &start, &nr_pages, i,
> +						       gup_flags, locked);
> +				if (locked && *locked == 0) {
> +					/*
> +					 * We've got a VM_FAULT_RETRY
> +					 * and we've lost mmap_lock.
> +					 * We must stop here.
> +					 */
> +					BUG_ON(gup_flags & FOLL_NOWAIT);
> +					BUG_ON(ret != 0);
> +					goto out;
> +				}
> +				continue;
> +			}
>   		}
>   retry:
>   		/*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ec2bb93f7431..20bfbf211dc3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1168,6 +1168,208 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>   	return page;
>   }
>   
> +long follow_devmap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			struct page **pages, struct vm_area_struct **vmas,
> +			unsigned long *position, unsigned long *nr_pages,
> +			long i, unsigned int flags, int *locked)
> +{
> +	unsigned long pfn_offset;
> +	unsigned long vaddr = *position;
> +	unsigned long remainder = *nr_pages;
> +	unsigned long align = vma_kernel_pagesize(vma);
> +	unsigned long align_nr_pages = align >> PAGE_SHIFT;
> +	unsigned long mask = ~(align-1);
> +	unsigned long nr_pages_hpage = 0;
> +	struct dev_pagemap *pgmap = NULL;
> +	int err = -EFAULT;
> +
> +	if (align == PAGE_SIZE)
> +		return i;
> +
> +	while (vaddr < vma->vm_end && remainder) {
> +		pte_t *pte;
> +		spinlock_t *ptl = NULL;
> +		int absent;
> +		struct page *page;
> +
> +		/*
> +		 * If we have a pending SIGKILL, don't keep faulting pages and
> +		 * potentially allocating memory.
> +		 */
> +		if (fatal_signal_pending(current)) {
> +			remainder = 0;
> +			break;
> +		}
> +
> +		/*
> +		 * Some archs (sparc64, sh*) have multiple pte_ts to
> +		 * each hugepage.  We have to make sure we get the
> +		 * first, for the page indexing below to work.
> +		 *
> +		 * Note that page table lock is not held when pte is null.
> +		 */
> +		pte = huge_pte_offset(mm, vaddr & mask, align);
> +		if (pte) {
> +			if (align == PMD_SIZE)
> +				ptl = pmd_lockptr(mm, (pmd_t *) pte);
> +			else if (align == PUD_SIZE)
> +				ptl = pud_lockptr(mm, (pud_t *) pte);
> +			spin_lock(ptl);
> +		}
> +		absent = !pte || pte_none(ptep_get(pte));
> +
> +		if (absent && (flags & FOLL_DUMP)) {
> +			if (pte)
> +				spin_unlock(ptl);
> +			remainder = 0;
> +			break;
> +		}
> +
> +		if (absent ||
> +		    ((flags & FOLL_WRITE) &&
> +		      !pte_write(ptep_get(pte)))) {
> +			vm_fault_t ret;
> +			unsigned int fault_flags = 0;
> +
> +			if (pte)
> +				spin_unlock(ptl);
> +			if (flags & FOLL_WRITE)
> +				fault_flags |= FAULT_FLAG_WRITE;
> +			if (locked)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> +					FAULT_FLAG_KILLABLE;
> +			if (flags & FOLL_NOWAIT)
> +				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
> +					FAULT_FLAG_RETRY_NOWAIT;
> +			if (flags & FOLL_TRIED) {
> +				/*
> +				 * Note: FAULT_FLAG_ALLOW_RETRY and
> +				 * FAULT_FLAG_TRIED can co-exist
> +				 */
> +				fault_flags |= FAULT_FLAG_TRIED;
> +			}
> +			ret = handle_mm_fault(vma, vaddr, flags, NULL);
> +			if (ret & VM_FAULT_ERROR) {
> +				err = vm_fault_to_errno(ret, flags);
> +				remainder = 0;
> +				break;
> +			}
> +			if (ret & VM_FAULT_RETRY) {
> +				if (locked &&
> +				    !(fault_flags & FAULT_FLAG_RETRY_NOWAIT))
> +					*locked = 0;
> +				*nr_pages = 0;
> +				/*
> +				 * VM_FAULT_RETRY must not return an
> +				 * error, it will return zero
> +				 * instead.
> +				 *
> +				 * No need to update "position" as the
> +				 * caller will not check it after
> +				 * *nr_pages is set to 0.
> +				 */
> +				return i;
> +			}
> +			continue;
> +		}
> +
> +		pfn_offset = (vaddr & ~mask) >> PAGE_SHIFT;
> +		page = pte_page(ptep_get(pte));
> +
> +		pgmap = get_dev_pagemap(page_to_pfn(page), pgmap);
> +		if (!pgmap) {
> +			spin_unlock(ptl);
> +			remainder = 0;
> +			err = -EFAULT;
> +			break;
> +		}
> +
> +		/*
> +		 * If subpage information not requested, update counters
> +		 * and skip the same_page loop below.
> +		 */
> +		if (!pages && !vmas && !pfn_offset &&
> +		    (vaddr + align < vma->vm_end) &&
> +		    (remainder >= (align_nr_pages))) {
> +			vaddr += align;
> +			remainder -= align_nr_pages;
> +			i += align_nr_pages;
> +			spin_unlock(ptl);
> +			continue;
> +		}
> +
> +		nr_pages_hpage = 0;
> +
> +same_page:
> +		if (pages) {
> +			pages[i] = mem_map_offset(page, pfn_offset);
> +
> +			/*
> +			 * try_grab_page() should always succeed here, because:
> +			 * a) we hold the ptl lock, and b) we've just checked
> +			 * that the huge page is present in the page tables.
> +			 */
> +			if (!(pgmap->flags & PGMAP_COMPOUND) &&
> +			    WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
> +				spin_unlock(ptl);
> +				remainder = 0;
> +				err = -ENOMEM;
> +				break;
> +			}
> +
> +		}
> +
> +		if (vmas)
> +			vmas[i] = vma;
> +
> +		vaddr += PAGE_SIZE;
> +		++pfn_offset;
> +		--remainder;
> +		++i;
> +		nr_pages_hpage++;
> +		if (vaddr < vma->vm_end && remainder &&
> +				pfn_offset < align_nr_pages) {
> +			/*
> +			 * We use pfn_offset to avoid touching the pageframes
> +			 * of this compound page.
> +			 */
> +			goto same_page;
> +		} else {
> +			/*
> +			 * try_grab_compound_head() should always succeed here,
> +			 * because: a) we hold the ptl lock, and b) we've just
> +			 * checked that the huge page is present in the page
> +			 * tables. If the huge page is present, then the tail
> +			 * pages must also be present. The ptl prevents the
> +			 * head page and tail pages from being rearranged in
> +			 * any way. So this page must be available at this
> +			 * point, unless the page refcount overflowed:
> +			 */
> +			if ((pgmap->flags & PGMAP_COMPOUND) &&
> +			    WARN_ON_ONCE(!try_grab_compound_head(pages[i-1],
> +								 nr_pages_hpage,
> +								 flags))) {
> +				put_dev_pagemap(pgmap);
> +				spin_unlock(ptl);
> +				remainder = 0;
> +				err = -ENOMEM;
> +				break;
> +			}
> +			put_dev_pagemap(pgmap);
> +		}
> +		spin_unlock(ptl);
> +	}
> +	*nr_pages = remainder;
> +	/*
> +	 * setting position is actually required only if remainder is
> +	 * not zero but it's faster not to add a "if (remainder)"
> +	 * branch.
> +	 */
> +	*position = vaddr;
> +
> +	return i ? i : err;
> +}
> +
>   int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
>   		  struct vm_area_struct *vma)
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2020-12-09  5:59   ` John Hubbard
  2020-12-09  6:33     ` Matthew Wilcox
  2021-02-20  1:43     ` Dan Williams
  2021-02-20  1:24   ` Dan Williams
  1 sibling, 2 replies; 67+ messages in thread
From: John Hubbard @ 2020-12-09  5:59 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe  <jgg@ziepe.ca>,
	Jane Chu <jane.chu@oracle.com>,
	Muchun Song, Mike Kravetz, Andrew

On 12/8/20 9:28 AM, Joao Martins wrote:
> Add a new flag for struct dev_pagemap which designates that a a pagemap

a a

> is described as a set of compound pages or in other words, that how
> pages are grouped together in the page tables are reflected in how we
> describe struct pages. This means that rather than initializing
> individual struct pages, we also initialize these struct pages, as

Let's not say "rather than x, we also do y", because it's self-contradictory.
I think you want to just leave out the "also", like this:

"This means that rather than initializing> individual struct pages, we
initialize these struct pages ..."

Is that right?

> compound pages (on x86: 2M or 1G compound pages)
> 
> For certain ZONE_DEVICE users, like device-dax, which have a fixed page
> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> thus playing the same tricks as hugetlb pages.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/linux/memremap.h | 2 ++
>   mm/memremap.c            | 8 ++++++--
>   mm/page_alloc.c          | 7 +++++++
>   3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 79c49e7f5c30..f8f26b2cc3da 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -90,6 +90,7 @@ struct dev_pagemap_ops {
>   };
>   
>   #define PGMAP_ALTMAP_VALID	(1 << 0)
> +#define PGMAP_COMPOUND		(1 << 1)
>   
>   /**
>    * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> @@ -114,6 +115,7 @@ struct dev_pagemap {
>   	struct completion done;
>   	enum memory_type type;
>   	unsigned int flags;
> +	unsigned int align;

This also needs an "@aline" entry in the comment block above.

>   	const struct dev_pagemap_ops *ops;
>   	void *owner;
>   	int nr_range;
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 16b2fb482da1..287a24b7a65a 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>   	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>   				PHYS_PFN(range->start),
>   				PHYS_PFN(range_len(range)), pgmap);
> -	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> -			- pfn_first(pgmap, range_id));
> +	if (pgmap->flags & PGMAP_COMPOUND)
> +		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> +			- pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));

Is there some reason that we cannot use range_len(), instead of pfn_end() minus
pfn_first()? (Yes, this more about the pre-existing code than about your change.)

And if not, then why are the nearby range_len() uses OK? I realize that range_len()
is simpler and skips a case, but it's not clear that it's required here. But I'm
new to this area so be warned. :)

Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
happen to want, but is not named accordingly. Can you use or create something
more accurately named? Like "number of pages in this large page"?

> +	else
> +		percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> +				- pfn_first(pgmap, range_id));
>   	return 0;
>   
>   err_add_memory:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eaa227a479e4..9716ecd58e29 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6116,6 +6116,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
>   	unsigned long pfn, end_pfn = start_pfn + nr_pages;
>   	struct pglist_data *pgdat = zone->zone_pgdat;
>   	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> +	bool compound = pgmap->flags & PGMAP_COMPOUND;
> +	unsigned int align = PHYS_PFN(pgmap->align);

Maybe align_pfn or pfn_align? Don't want the same name for things that are actually
different types, in meaning anyway.


>   	unsigned long zone_idx = zone_idx(zone);
>   	unsigned long start = jiffies;
>   	int nid = pgdat->node_id;
> @@ -6171,6 +6173,11 @@ void __ref memmap_init_zone_device(struct zone *zone,
>   		}
>   	}
>   
> +	if (compound) {
> +		for (pfn = start_pfn; pfn < end_pfn; pfn += align)
> +			prep_compound_page(pfn_to_page(pfn), order_base_2(align));
> +	}
> +
>   	pr_info("%s initialised %lu pages in %ums\n", __func__,
>   		nr_pages, jiffies_to_msecs(jiffies - start));
>   }
> 

thanks,
-- 
John Hubbard
NVIDIA
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate
  2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins
@ 2020-12-09  6:16   ` John Hubbard
  2020-12-09 13:51     ` Joao Martins
  2021-02-20  1:49   ` Dan Williams
  1 sibling, 1 reply; 67+ messages in thread
From: John Hubbard @ 2020-12-09  6:16 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe  <jgg@ziepe.ca>,
	Jane Chu <jane.chu@oracle.com>,
	Muchun Song, Mike Kravetz, Andrew

On 12/8/20 9:28 AM, Joao Martins wrote:
> Replace vmem_altmap with an vmem_context argument. That let us
> express how the vmemmap is gonna be initialized e.g. passing
> flags and a page size for reusing pages upon initializing the
> vmemmap.

How about this instead:

Replace the vmem_altmap argument with a vmem_context argument that
contains vmem_altmap for now. Subsequent patches will add additional
member elements to vmem_context, such as flags and page size.

No behavior changes are intended.

?

> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/linux/memory_hotplug.h |  6 +++++-
>   include/linux/mm.h             |  2 +-
>   mm/memory_hotplug.c            |  3 ++-
>   mm/sparse-vmemmap.c            |  6 +++++-
>   mm/sparse.c                    | 16 ++++++++--------
>   5 files changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 551093b74596..73f8bcbb58a4 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -81,6 +81,10 @@ struct mhp_params {
>   	pgprot_t pgprot;
>   };
>   
> +struct vmem_context {
> +	struct vmem_altmap *altmap;
> +};
> +
>   /*
>    * Zone resizing functions
>    *
> @@ -353,7 +357,7 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
>   				       unsigned long nr_pages);
>   extern bool is_memblock_offlined(struct memory_block *mem);
>   extern int sparse_add_section(int nid, unsigned long pfn,
> -		unsigned long nr_pages, struct vmem_altmap *altmap);
> +		unsigned long nr_pages, struct vmem_context *ctx);
>   extern void sparse_remove_section(struct mem_section *ms,
>   		unsigned long pfn, unsigned long nr_pages,
>   		unsigned long map_offset, struct vmem_altmap *altmap);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index db6ae4d3fb4e..2eb44318bb2d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3000,7 +3000,7 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
>   
>   void *sparse_buffer_alloc(unsigned long size);
>   struct page * __populate_section_memmap(unsigned long pfn,
> -		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
> +		unsigned long nr_pages, int nid, struct vmem_context *ctx);
>   pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>   p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>   pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 63b2e46b6555..f8870c53fe5e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -313,6 +313,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>   	unsigned long cur_nr_pages;
>   	int err;
>   	struct vmem_altmap *altmap = params->altmap;
> +	struct vmem_context ctx = { .altmap = params->altmap };

OK, so this is the one place I can see where ctx is set up. And it's never null.
Let's remember that point...

>   
>   	if (WARN_ON_ONCE(!params->pgprot.pgprot))
>   		return -EINVAL;
> @@ -341,7 +342,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>   		/* Select all remaining pages up to the next section boundary */
>   		cur_nr_pages = min(end_pfn - pfn,
>   				   SECTION_ALIGN_UP(pfn + 1) - pfn);
> -		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
> +		err = sparse_add_section(nid, pfn, cur_nr_pages, &ctx);
>   		if (err)
>   			break;
>   		cond_resched();
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 16183d85a7d5..bcda68ba1381 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -249,15 +249,19 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>   }
>   
>   struct page * __meminit __populate_section_memmap(unsigned long pfn,
> -		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> +		unsigned long nr_pages, int nid, struct vmem_context *ctx)
>   {
>   	unsigned long start = (unsigned long) pfn_to_page(pfn);
>   	unsigned long end = start + nr_pages * sizeof(struct page);
> +	struct vmem_altmap *altmap = NULL;
>   
>   	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>   		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>   		return NULL;
>   
> +	if (ctx)

But...ctx can never be null, right?

I didn't spot any other issues, though.

thanks,
-- 
John Hubbard
NVIDIA

> +		altmap = ctx->altmap;
> +
>   	if (vmemmap_populate(start, end, nid, altmap))
>   		return NULL;
>   
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 7bd23f9d6cef..47ca494398a7 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -443,7 +443,7 @@ static unsigned long __init section_map_size(void)
>   }
>   
>   struct page __init *__populate_section_memmap(unsigned long pfn,
> -		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> +		unsigned long nr_pages, int nid, struct vmem_context *ctx)
>   {
>   	unsigned long size = section_map_size();
>   	struct page *map = sparse_buffer_alloc(size);
> @@ -648,9 +648,9 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>   
>   #ifdef CONFIG_SPARSEMEM_VMEMMAP
>   static struct page * __meminit populate_section_memmap(unsigned long pfn,
> -		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
> +		unsigned long nr_pages, int nid, struct vmem_context *ctx)
>   {
> -	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
> +	return __populate_section_memmap(pfn, nr_pages, nid, ctx);
>   }
>   
>   static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
> @@ -842,7 +842,7 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>   }
>   
>   static struct page * __meminit section_activate(int nid, unsigned long pfn,
> -		unsigned long nr_pages, struct vmem_altmap *altmap)
> +		unsigned long nr_pages, struct vmem_context *ctx)
>   {
>   	struct mem_section *ms = __pfn_to_section(pfn);
>   	struct mem_section_usage *usage = NULL;
> @@ -874,9 +874,9 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>   	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
>   		return pfn_to_page(pfn);
>   
> -	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
> +	memmap = populate_section_memmap(pfn, nr_pages, nid, ctx);
>   	if (!memmap) {
> -		section_deactivate(pfn, nr_pages, altmap);
> +		section_deactivate(pfn, nr_pages, ctx->altmap);
>   		return ERR_PTR(-ENOMEM);
>   	}
>   
> @@ -902,7 +902,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
>    * * -ENOMEM	- Out of memory.
>    */
>   int __meminit sparse_add_section(int nid, unsigned long start_pfn,
> -		unsigned long nr_pages, struct vmem_altmap *altmap)
> +		unsigned long nr_pages, struct vmem_context *ctx)
>   {
>   	unsigned long section_nr = pfn_to_section_nr(start_pfn);
>   	struct mem_section *ms;
> @@ -913,7 +913,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>   	if (ret < 0)
>   		return ret;
>   
> -	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
> +	memmap = section_activate(nid, start_pfn, nr_pages, ctx);
>   	if (IS_ERR(memmap))
>   		return PTR_ERR(memmap);
>   
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2020-12-09  5:59   ` John Hubbard
@ 2020-12-09  6:33     ` Matthew Wilcox
  2020-12-09 13:12       ` Joao Martins
  2021-02-20  1:43     ` Dan Williams
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2020-12-09  6:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Joao Martins, linux-mm, linux-nvdimm, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 08, 2020 at 09:59:19PM -0800, John Hubbard wrote:
> On 12/8/20 9:28 AM, Joao Martins wrote:
> > Add a new flag for struct dev_pagemap which designates that a a pagemap
> 
> a a
> 
> > is described as a set of compound pages or in other words, that how
> > pages are grouped together in the page tables are reflected in how we
> > describe struct pages. This means that rather than initializing
> > individual struct pages, we also initialize these struct pages, as
> 
> Let's not say "rather than x, we also do y", because it's self-contradictory.
> I think you want to just leave out the "also", like this:
> 
> "This means that rather than initializing> individual struct pages, we
> initialize these struct pages ..."
> 
> Is that right?

I'd phrase it as:

Add a new flag for struct dev_pagemap which specifies that a pagemap is
composed of a set of compound pages instead of individual pages.  When
these pages are initialised, most are initialised as tail pages
instead of order-0 pages.

> > For certain ZONE_DEVICE users, like device-dax, which have a fixed page
> > size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> > thus playing the same tricks as hugetlb pages.

Rather than "playing the same tricks", how about "are treated the same
way as THP or hugetlb pages"?

> > +	if (pgmap->flags & PGMAP_COMPOUND)
> > +		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> > +			- pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
> 
> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
> 
> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
> is simpler and skips a case, but it's not clear that it's required here. But I'm
> new to this area so be warned. :)
> 
> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
> happen to want, but is not named accordingly. Can you use or create something
> more accurately named? Like "number of pages in this large page"?

We have compound_nr(), but that takes a struct page as an argument.
We also have HPAGE_NR_PAGES.  I'm not quite clear what you want.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas
       [not found]   ` <20201208195754.GR5487@ziepe.ca>
@ 2020-12-09  8:05     ` Christoph Hellwig
  2020-12-09 11:19     ` Joao Martins
  1 sibling, 0 replies; 67+ messages in thread
From: Christoph Hellwig @ 2020-12-09  8:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joao Martins, linux-mm, linux-nvdimm, Matthew Wilcox,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 08, 2020 at 03:57:54PM -0400, Jason Gunthorpe wrote:
> What we've talked about is changing the calling convention across all
> of this to something like:
> 
> struct gup_output {
>    struct page **cur;
>    struct page **end;
>    unsigned long vaddr;
>    [..]
> }
> 
> And making the manipulator like you saw for GUP common:
> 
> gup_output_single_page()
> gup_output_pages()
> 
> Then putting this eveywhere. This is the pattern that we ended up with
> in hmm_range_fault, and it seems to be working quite well.
> 
> fast/slow should be much more symmetric in code than they are today,
> IMHO.. I think those differences mainly exist because it used to be
> siloed in arch code. Some of the differences might be bugs, we've seen
> that a few times at least..

something like this:

http://git.infradead.org/users/hch/misc.git/commitdiff/c3d019802dbde5a4cc4160e7ec8ccba479b19f97

from this old and not fully working series:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (9 preceding siblings ...)
  2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins
@ 2020-12-09  9:38 ` David Hildenbrand
  2020-12-09  9:52 ` [External] " Muchun Song
  2021-02-20  1:18 ` Dan Williams
  12 siblings, 0 replies; 67+ messages in thread
From: David Hildenbrand @ 2020-12-09  9:38 UTC (permalink / raw)
  To: Joao Martins, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton

On 08.12.20 18:28, Joao Martins wrote:
> Hey,
> 
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE. 
> 
> [0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/
> 
> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
> 
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 

That's the dream :)

> Along the way I've extended it past 'struct page' overhead *trying* to address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
> 
> So to summarize what the series does:
> 
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
> given @align order. The main difference though, is that contrary to the hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it.

Yeah, I'd argue that this case is a lot easier to handle. When the buddy
is involved, things get more complicated.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [External] [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (10 preceding siblings ...)
  2020-12-09  9:38 ` [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps David Hildenbrand
@ 2020-12-09  9:52 ` Muchun Song
  2021-02-20  1:18 ` Dan Williams
  12 siblings, 0 replies; 67+ messages in thread
From: Muchun Song @ 2020-12-09  9:52 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Mike Kravetz, Andrew Morton

On Wed, Dec 9, 2020 at 1:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Hey,
>
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
>
> [0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/
>

Oh, well. It looks like you agree with my optimization approach
and have fully understood. Also, welcome help me review that
series if you have time. :)

> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
>
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
>
> Along the way I've extended it past 'struct page' overhead *trying* to address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
>
> So to summarize what the series does:
>
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
> given @align order. The main difference though, is that contrary to the hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it. IOW no
> freeing of pages of already initialized vmemmap like the case for hugetlbfs,
> which simplifies the logic (besides not being arch-specific). After these,
> there's quite visible region bootstrap of pmem memmap given that we would
> initialize fewer struct pages depending on the page size.
>
>     NVDIMM namespace bootstrap improves from ~750ms to ~190ms/<=1ms on emulated NVDIMMs
>     with 2M and 1G respectivally. The net gain in improvement is similarly observed
>     in proportion when running on actual NVDIMMs.
>
> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> are working with compound pages i.e. we do 1 increment/decrement to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably, and unpin_user_pages() improves as well when passed a
> set of consecutive pages:
>
>                                            before          after
>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
>
> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
> user. For unpin_user_pages() we have an additional test to demonstrate the
> improvement.  The test performs MR reg/unreg continuously and measuring its
> rate for a given period. So essentially ib_mem_get and ib_mem_release being
> stress tested which at the end of day means: pin_user_pages_longterm() and
> unpin_user_pages() for a scatterlist:
>
>     Before:
>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>
>     After:
>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
>
> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
> is very experimental, and I imported most of follow_hugetlb_page(), except
> that we do the same trick as gup-fast. In doing the patch I feel this batching
> should live in follow_page_mask() and having that being changed to return a set
> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
> patch then brings the previous test of mr reg/unreg (above) on parity
> between device-dax and hugetlbfs.
>
> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
> still running tests. Hence the RFC, asking for comments and general direction
> of the work before continuing.
>
> Patches apply on top of linux-next tag next-20201208 (commit a9e26cb5f261).
>
> Comments and suggestions very much appreciated!
>
> Thanks,
>         Joao
>
> Joao Martins (9):
>   memremap: add ZONE_DEVICE support for compound pages
>   sparse-vmemmap: Consolidate arguments in vmemmap section populate
>   sparse-vmemmap: Reuse vmemmap areas for a given page size
>   mm/page_alloc: Reuse tail struct pages for compound pagemaps
>   device-dax: Compound pagemap support
>   mm/gup: Grab head page refcount once for group of subpages
>   mm/gup: Decrement head page once for group of subpages
>   RDMA/umem: batch page unpin in __ib_mem_release()
>   mm: Add follow_devmap_page() for devdax vmas
>
>  drivers/dax/device.c           |  54 ++++++---
>  drivers/infiniband/core/umem.c |  25 +++-
>  include/linux/huge_mm.h        |   4 +
>  include/linux/memory_hotplug.h |  16 ++-
>  include/linux/memremap.h       |   2 +
>  include/linux/mm.h             |   6 +-
>  mm/gup.c                       | 130 ++++++++++++++++-----
>  mm/huge_memory.c               | 202 +++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c            |  13 ++-
>  mm/memremap.c                  |  13 ++-
>  mm/page_alloc.c                |  28 ++++-
>  mm/sparse-vmemmap.c            |  97 +++++++++++++---
>  mm/sparse.c                    |  16 +--
>  13 files changed, 531 insertions(+), 75 deletions(-)
>
> --
> 2.17.1
>


-- 
Yours,
Muchun
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release()
       [not found]   ` <20201208192935.GA1908088@ziepe.ca>
@ 2020-12-09 10:59     ` Joao Martins
  2020-12-19 13:15       ` Joao Martins
  0 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-09 10:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton



On 12/8/20 7:29 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:29:00PM +0000, Joao Martins wrote:
> 
>>  static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>>  {
>> +	bool make_dirty = umem->writable && dirty;
>> +	struct page **page_list = NULL;
>>  	struct sg_page_iter sg_iter;
>> +	unsigned long nr = 0;
>>  	struct page *page;
>>  
>> +	page_list = (struct page **) __get_free_page(GFP_KERNEL);
> 
> Gah, no, don't do it like this!
> 
> Instead something like:
> 
> 	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
> 	      unpin_use_pages_range_dirty_lock(sg_page(sg), sg->length/PAGE_SIZE,
>                                                umem->writable && dirty);
> 
> And have the mm implementation split the contiguous range of pages into
> pairs of (compound head, ntails) with a bit of maths.
> 
Got it :)

I was trying to avoid another exported symbol.

Albeit upon your suggestion below, it doesn't justify the efficiency/clearness lost.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
       [not found]   ` <20201208194905.GQ5487@ziepe.ca>
@ 2020-12-09 11:05     ` Joao Martins
       [not found]       ` <20201209151505.GV5487@ziepe.ca>
  0 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-09 11:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton



On 12/8/20 7:49 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:58PM +0000, Joao Martins wrote:
>> Much like hugetlbfs or THPs, we treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast():
>>
>>  $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w
>>
>> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
>> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>  mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
>>  1 file changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 98eb8e6d2609..194e6981eb03 100644
>> +++ b/mm/gup.c
>> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>  }
>>  #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>  
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +			   unsigned long end, struct page **pages)
>> +{
>> +	int nr;
>> +
>> +	for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +		pages[nr++] = page++;
>> +
>> +	return nr;
>> +}
>> +
>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> -			     unsigned long end, unsigned int flags,
>> -			     struct page **pages, int *nr)
>> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
>> +				      struct page *head, unsigned long sz,
>> +				      unsigned long addr, unsigned long end,
>> +				      unsigned int flags, struct page **pages)
>> +{
>> +	struct page *page;
>> +	int refs;
>> +
>> +	if (!(pgmap->flags & PGMAP_COMPOUND))
>> +		return -1;
>> +
>> +	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> 
> All the places that call record_subpages do some kind of maths like
> this, it should be placed inside record_subpages and not opencoded
> everywhere.
> 
Makes sense.

>> +	refs = record_subpages(page, addr, end, pages);
>> +
>> +	SetPageReferenced(page);
>> +	head = try_grab_compound_head(head, refs, flags);
>> +	if (!head) {
>> +		ClearPageReferenced(page);
>> +		return 0;
>> +	}
>> +
>> +	return refs;
>> +}
> 
> Why is all of this special? Any time we see a PMD/PGD/etc pointing to
> PFN we can apply this optimization. How come device has its own
> special path to do this?? 
> 
I think the reason is that zone_device struct pages have no relationship to one other. So
you anyways need to change individual pages, as opposed to just the head page.

I made it special to avoid breaking other ZONE_DEVICE users (and gating that with
PGMAP_COMPOUND). But if there's no concerns with that, I can unilaterally enable it.

> Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap?
> (We already removed that from the hmm version of this, was that wrong?
> Is this different?) Dan?
> 
> Also undo_dev_pagemap() is now out of date, we have unpin_user_pages()
> for that and no other error unwind touches ClearPageReferenced..
> 
/me nods Yeap I saw that too.

> Basic idea is good though!
> 
Cool, thanks!

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas
       [not found]   ` <20201208195754.GR5487@ziepe.ca>
  2020-12-09  8:05     ` Christoph Hellwig
@ 2020-12-09 11:19     ` Joao Martins
  1 sibling, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-09 11:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/8/20 7:57 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:29:01PM +0000, Joao Martins wrote:
>> Similar to follow_hugetlb_page() add a follow_devmap_page which rather
>> than calling follow_page() per 4K page in a PMD/PUD it does so for the
>> entire PMD, where we lock the pmd/pud, get all pages , unlock.
>>
>> While doing so, we only change the refcount once when PGMAP_COMPOUND is
>> passed in.
>>
>> This let us improve {pin,get}_user_pages{,_longterm}() considerably:
>>
>> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w
>>
>> (<test>) [before] -> [after]
>> (get_user_pages 2M pages) ~150k us -> ~8.9k us
>> (pin_user_pages 2M pages) ~192k us -> ~9k us
>> (pin_user_pages_longterm 2M pages) ~200k us -> ~19k us
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> I've special-cased this to device-dax vmas given its similar page size
>> guarantees as hugetlbfs, but I feel this is a bit wrong. I am
>> replicating follow_hugetlb_page() as RFC ought to seek feedback whether
>> this should be generalized if no fundamental issues exist. In such case,
>> should I be changing follow_page_mask() to take either an array of pages
>> or a function pointer and opaque arguments which would let caller pick
>> its structure?
> 
> I would be extremely sad if this was the only way to do this :(
> 
> We should be trying to make things more general. 

Yeap, indeed.

Specially, when similar problem is observed for THP, at least from the
measurements I saw. It is all slow, except for hugetlbfs.

> The
> hmm_range_fault_path() doesn't have major special cases for device, I
> am struggling to understand why gup fast and slow do.
> 
> What we've talked about is changing the calling convention across all
> of this to something like:
> 
> struct gup_output {
>    struct page **cur;
>    struct page **end;
>    unsigned long vaddr;
>    [..]
> }
> 
> And making the manipulator like you saw for GUP common:
> 
> gup_output_single_page()
> gup_output_pages()
> 
> Then putting this eveywhere. This is the pattern that we ended up with
> in hmm_range_fault, and it seems to be working quite well.
> 
> fast/slow should be much more symmetric in code than they are today,
> IMHO.. 

Thanks for the suggestions above.

I think those differences mainly exist because it used to be
> siloed in arch code. Some of the differences might be bugs, we've seen
> that a few times at least..
Interesting, wasn't aware of the siloing.

I'll go investigate how this all refactoring goes together, at the point
of which a future iteration of this particular patch probably needs to
move independently from this series.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages
       [not found]   ` <20201208193446.GP5487@ziepe.ca>
  2020-12-09  5:06     ` John Hubbard
@ 2020-12-09 12:17     ` Joao Martins
  2020-12-17 19:05     ` Joao Martins
  2 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-09 12:17 UTC (permalink / raw)
  To: Jason Gunthorpe, John Hubbard, Daniel Jordan
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/8/20 7:34 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:59PM +0000, Joao Martins wrote:
>> Rather than decrementing the ref count one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>  mm/gup.c | 41 ++++++++++++++++++++++++++++++++---------
>>  1 file changed, 32 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 194e6981eb03..3a9a7229f418 100644
>> +++ b/mm/gup.c
>> @@ -212,6 +212,18 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>  }
>>  #endif /* CONFIG_DEV_PAGEMAP_OPS */
>>  
>> +static int record_refs(struct page **pages, int npages)
>> +{
>> +	struct page *head = compound_head(pages[0]);
>> +	int refs = 1, index;
>> +
>> +	for (index = 1; index < npages; index++, refs++)
>> +		if (compound_head(pages[index]) != head)
>> +			break;
>> +
>> +	return refs;
>> +}
>> +
>>  /**
>>   * unpin_user_page() - release a dma-pinned page
>>   * @page:            pointer to page to be released
>> @@ -221,9 +233,9 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>   * that such pages can be separately tracked and uniquely handled. In
>>   * particular, interactions with RDMA and filesystems need special handling.
>>   */
>> -void unpin_user_page(struct page *page)
>> +static void __unpin_user_page(struct page *page, int refs)
> 
> Refs should be unsigned everywhere.
> 
/me nods

> I suggest using clear language 'page' here should always be a compound
> head called 'head' (or do we have another common variable name for
> this?)
> 
> 'refs' is number of tail pages within the compound, so 'ntails' or
> something
> 
The usage of 'refs' seems to align with the rest of the GUP code. It's always referring to
tail pages and unpin case isn't any different IIUC.

I suppose we can always change that, but maybe better do that renaming in one shot as a
post cleanup?

>>  {
>> -	int refs = 1;
>> +	int orig_refs = refs;
>>  
>>  	page = compound_head(page);
> 
> Caller should always do this
> 
/me nods

>> @@ -237,14 +249,19 @@ void unpin_user_page(struct page *page)
>>  		return;
>>  
>>  	if (hpage_pincount_available(page))
>> -		hpage_pincount_sub(page, 1);
>> +		hpage_pincount_sub(page, refs);
>>  	else
>> -		refs = GUP_PIN_COUNTING_BIAS;
>> +		refs *= GUP_PIN_COUNTING_BIAS;
>>  
>>  	if (page_ref_sub_and_test(page, refs))
>>  		__put_page(page);
>>  
>> -	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
>> +	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
>> +}
> 
> And really this should be placed directly after
> try_grab_compound_head() and be given a similar name
> 'unpin_compound_head()'. Even better would be to split the FOLL_PIN
> part into a function so there was a clear logical pairing.
> 
> And reviewing it like that I want to ask if this unpin sequence is in
> the right order.. I would expect it to be the reverse order of the get
> 
> John?
> 
> Is it safe to call mod_node_page_state() after releasing the refcount?
> This could race with hot-unplugging the struct pages so I think it is
> wrong.
> 
It appears to be case based on John's follow up comment.

>> +void unpin_user_page(struct page *page)
>> +{
>> +	__unpin_user_page(page, 1);
> 
> Thus this is
> 
> 	__unpin_user_page(compound_head(page), 1);
> 
Got it.

>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  				 bool make_dirty)
>>  {
>>  	unsigned long index;
>> +	int refs = 1;
>>  
>>  	/*
>>  	 * TODO: this can be optimized for huge pages: if a series of pages is
>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  		return;
>>  	}
>>  
>> -	for (index = 0; index < npages; index++) {
>> +	for (index = 0; index < npages; index += refs) {
>>  		struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, &head, &ntails) {
>        		if (!PageDirty(head))
> 			set_page_dirty_lock(head, ntails);
> 		unpin_user_page(head, ntails);
> }
> 
/me nods Let me attempt at that.

> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
I like the idea of a page range API alternative to unpin_user_pages(), but
improving current unpin_user_pages() would improve other unpin users too.

Perhaps the logic can be common, and the current unpin_user_pages() would have
the second iteration part, while the new (faster) API be based on computation.

> No reason not to fix set_page_dirty_lock() too while you are here.
> 
OK.

> Also, this patch and the next can be completely independent of the
> rest of the series, it is valuable regardless of the other tricks. You
> can split them and progress them independently.
> 
Yeap, let me do that.

> .. and I was just talking about this with Daniel Jordan and some other
> people at your company :)
> 

:)
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2020-12-09  6:33     ` Matthew Wilcox
@ 2020-12-09 13:12       ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-09 13:12 UTC (permalink / raw)
  To: Matthew Wilcox, John Hubbard
  Cc: linux-mm, linux-nvdimm, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/9/20 6:33 AM, Matthew Wilcox wrote:
> On Tue, Dec 08, 2020 at 09:59:19PM -0800, John Hubbard wrote:
>> On 12/8/20 9:28 AM, Joao Martins wrote:
>>> Add a new flag for struct dev_pagemap which designates that a a pagemap
>>
>> a a
>>
Ugh. Yeah will fix.

>>> is described as a set of compound pages or in other words, that how
>>> pages are grouped together in the page tables are reflected in how we
>>> describe struct pages. This means that rather than initializing
>>> individual struct pages, we also initialize these struct pages, as
>>
>> Let's not say "rather than x, we also do y", because it's self-contradictory.
>> I think you want to just leave out the "also", like this:
>>
>> "This means that rather than initializing> individual struct pages, we
>> initialize these struct pages ..."
>>
>> Is that right?
> 
Nop, my previous text was broken.

> I'd phrase it as:
> 
> Add a new flag for struct dev_pagemap which specifies that a pagemap is
> composed of a set of compound pages instead of individual pages.  When
> these pages are initialised, most are initialised as tail pages
> instead of order-0 pages.
> 
Thanks, I will use this instead.

>>> For certain ZONE_DEVICE users, like device-dax, which have a fixed page
>>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>>> thus playing the same tricks as hugetlb pages.
> 
> Rather than "playing the same tricks", how about "are treated the same
> way as THP or hugetlb pages"?
> 
>>> +	if (pgmap->flags & PGMAP_COMPOUND)
>>> +		percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>>> +			- pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
>>
>> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
>> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
>>
Indeed one could use range_len() / pgmap->align and it would work. But (...)

>> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
>> is simpler and skips a case, but it's not clear that it's required here. But I'm
>> new to this area so be warned. :)
>>
My use of pfns to calculate the nr of pages was to remain consistent with the rest of the
code in the function taking references in the pgmap->ref. The usages one sees ofrange_len
are are when the hotplug takes place which work at addresses and not PFNs.

>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
>> happen to want, but is not named accordingly. Can you use or create something
>> more accurately named? Like "number of pages in this large page"?
> 
> We have compound_nr(), but that takes a struct page as an argument.
> We also have HPAGE_NR_PAGES.  I'm not quite clear what you want.
> 
If possible I would rather keep the pfns as with the rest of the code. Another alternative
is like a range_nr_pages helper but I am not sure it's worth the trouble for one caller.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
  2020-12-09  4:40   ` John Hubbard
@ 2020-12-09 13:44     ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-09 13:44 UTC (permalink / raw)
  To: John Hubbard, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/9/20 4:40 AM, John Hubbard wrote:
> On 12/8/20 9:28 AM, Joao Martins wrote:
>> Much like hugetlbfs or THPs, we treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast():
>>
>>   $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> 
> "gup_test", now that you're in linux-next, actually.
> 
> (Maybe I'll retrofit that test with getopt_long(), those options are
> getting more elaborate.)
> 
:)

>>
>> (get_user_pages_fast 2M pages) ~75k us -> ~3.6k us
>> (pin_user_pages_fast 2M pages) ~125k us -> ~3.8k us
> 
> That is a beautiful result! I'm very motivated to see if this patchset
> can make it in, in some form.
> 
Cool!

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   mm/gup.c | 67 ++++++++++++++++++++++++++++++++++++++++++--------------
>>   1 file changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 98eb8e6d2609..194e6981eb03 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2250,22 +2250,68 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>>   }
>>   #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>   
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> +			   unsigned long end, struct page **pages)
>> +{
>> +	int nr;
>> +
>> +	for (nr = 0; addr != end; addr += PAGE_SIZE)
>> +		pages[nr++] = page++;
>> +
>> +	return nr;
>> +}
>> +
>>   #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> -static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> -			     unsigned long end, unsigned int flags,
>> -			     struct page **pages, int *nr)
>> +static int __gup_device_compound_huge(struct dev_pagemap *pgmap,
>> +				      struct page *head, unsigned long sz,
> 
> If this variable survives (I see Jason requested a reorg of this math stuff,
> and I also like that idea), then I'd like a slightly better name for "sz".
> 
Yeap.

> I was going to suggest one, but then realized that I can't understand how this
> works. See below...
> 
>> +				      unsigned long addr, unsigned long end,
>> +				      unsigned int flags, struct page **pages)
>> +{
>> +	struct page *page;
>> +	int refs;
>> +
>> +	if (!(pgmap->flags & PGMAP_COMPOUND))
>> +		return -1;
> 
> btw, I'm unhappy with returning -1 here and assigning it later to a refs variable.
> (And that will show up even more clearly as an issue if you attempt to make
> refs unsigned everywhere!)
> 
Yes true.

The usage of @refs = -1 (therefore an int) was to differentiate when we are not in a
PGMAP_COMPOUND pgmap (and so for logic to keep as today).

Notice that in the PGMAP_COMPOUND case if we fail to grab the head compound page we return 0.

> I'm not going to suggest anything because there are a lot of ways to structure
> these routines, and I don't want to overly constrain you. Just please don't assign
> negative values to any refs variables.
> 
OK.

TBH I'm a little afraid this can turn into further complexity if I have to keep the
non-compound pgmap around. But I will see how I can adjust this.

>> +
>> +	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> 
> If you pass in PMD_SHIFT or PUD_SHIFT for, that's a number-of-bits, isn't it?
> Not a size. And if it's not a size, then sz - 1 doesn't work, does it? If it
> does work, then better naming might help. I'm probably missing a really
> obvious math trick here.

You're right. That was a mistake on my end, indeed. But the mistake wouldn't change the
logic, as the PageReference bit only applies to the head page.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate
  2020-12-09  6:16   ` John Hubbard
@ 2020-12-09 13:51     ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-09 13:51 UTC (permalink / raw)
  To: John Hubbard, linux-mm
  Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
	Mike Kravetz, Andrew Morton



On 12/9/20 6:16 AM, John Hubbard wrote:
> On 12/8/20 9:28 AM, Joao Martins wrote:
>> Replace vmem_altmap with an vmem_context argument. That let us
>> express how the vmemmap is gonna be initialized e.g. passing
>> flags and a page size for reusing pages upon initializing the
>> vmemmap.
> 
> How about this instead:
> 
> Replace the vmem_altmap argument with a vmem_context argument that
> contains vmem_altmap for now. Subsequent patches will add additional
> member elements to vmem_context, such as flags and page size.
> 
> No behavior changes are intended.
> 
> ?
> 
Yeap, it's better than way. Thanks.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   include/linux/memory_hotplug.h |  6 +++++-
>>   include/linux/mm.h             |  2 +-
>>   mm/memory_hotplug.c            |  3 ++-
>>   mm/sparse-vmemmap.c            |  6 +++++-
>>   mm/sparse.c                    | 16 ++++++++--------
>>   5 files changed, 21 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> index 551093b74596..73f8bcbb58a4 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -81,6 +81,10 @@ struct mhp_params {
>>   	pgprot_t pgprot;
>>   };
>>   
>> +struct vmem_context {
>> +	struct vmem_altmap *altmap;
>> +};
>> +
>>   /*
>>    * Zone resizing functions
>>    *
>> @@ -353,7 +357,7 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
>>   				       unsigned long nr_pages);
>>   extern bool is_memblock_offlined(struct memory_block *mem);
>>   extern int sparse_add_section(int nid, unsigned long pfn,
>> -		unsigned long nr_pages, struct vmem_altmap *altmap);
>> +		unsigned long nr_pages, struct vmem_context *ctx);
>>   extern void sparse_remove_section(struct mem_section *ms,
>>   		unsigned long pfn, unsigned long nr_pages,
>>   		unsigned long map_offset, struct vmem_altmap *altmap);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index db6ae4d3fb4e..2eb44318bb2d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3000,7 +3000,7 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
>>   
>>   void *sparse_buffer_alloc(unsigned long size);
>>   struct page * __populate_section_memmap(unsigned long pfn,
>> -		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
>> +		unsigned long nr_pages, int nid, struct vmem_context *ctx);
>>   pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>>   p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>   pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 63b2e46b6555..f8870c53fe5e 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -313,6 +313,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>>   	unsigned long cur_nr_pages;
>>   	int err;
>>   	struct vmem_altmap *altmap = params->altmap;
>> +	struct vmem_context ctx = { .altmap = params->altmap };
> 
> OK, so this is the one place I can see where ctx is set up. And it's never null.
> Let's remember that point...
> 

(...)

>>   
>>   	if (WARN_ON_ONCE(!params->pgprot.pgprot))
>>   		return -EINVAL;
>> @@ -341,7 +342,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>>   		/* Select all remaining pages up to the next section boundary */
>>   		cur_nr_pages = min(end_pfn - pfn,
>>   				   SECTION_ALIGN_UP(pfn + 1) - pfn);
>> -		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
>> +		err = sparse_add_section(nid, pfn, cur_nr_pages, &ctx);
>>   		if (err)
>>   			break;
>>   		cond_resched();
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 16183d85a7d5..bcda68ba1381 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -249,15 +249,19 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>   }
>>   
>>   struct page * __meminit __populate_section_memmap(unsigned long pfn,
>> -		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
>> +		unsigned long nr_pages, int nid, struct vmem_context *ctx)
>>   {
>>   	unsigned long start = (unsigned long) pfn_to_page(pfn);
>>   	unsigned long end = start + nr_pages * sizeof(struct page);
>> +	struct vmem_altmap *altmap = NULL;
>>   
>>   	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>>   		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>   		return NULL;
>>   
>> +	if (ctx)
> 
> But...ctx can never be null, right?
> 
Indeed.

This is an artifact of an old version of this where the passed parameter
could be null.

> I didn't spot any other issues, though.
> 
> thanks,
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
       [not found]       ` <20201209151505.GV5487@ziepe.ca>
@ 2020-12-09 16:02         ` Joao Martins
       [not found]           ` <20201209162438.GW5487@ziepe.ca>
  0 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-09 16:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/9/20 3:15 PM, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 11:05:39AM +0000, Joao Martins wrote:
>>> Why is all of this special? Any time we see a PMD/PGD/etc pointing to
>>> PFN we can apply this optimization. How come device has its own
>>> special path to do this?? 
>>
>> I think the reason is that zone_device struct pages have no
>> relationship to one other. So you anyways need to change individual
>> pages, as opposed to just the head page.
> 
> Huh? That can't be, unpin doesn't know the memory type when it unpins
> it, and as your series shows unpin always operates on the compound
> head. Thus pinning must also operate on compound heads
> 
I was referring to the code without this series, in the paragraph above.
Meaning today zone_device pages are *not* represented compound pages. And so
compound_head(page) on a non compound page just returns the page itself.

Otherwise, try_grab_page() (e.g. when pinning pages) would be broken.

>> I made it special to avoid breaking other ZONE_DEVICE users (and
>> gating that with PGMAP_COMPOUND). But if there's no concerns with
>> that, I can unilaterally enable it.
> 
> I didn't understand what PGMAP_COMPOUND was supposed to be for..
>  
PGMAP_COMPOUND purpose is to online these pages as compound pages (so head
and tails).

Today (without the series) struct pages are not represented the way they
are expressed in the page tables, which is what I am hoping to fix in this
series thus initializing these as compound pages of a given order. But me
introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
and new (compound pages) co-exist.

I wasn't sure I could just enable regardless, worried that I would be breaking
other ZONE_DEVICE/memremap_pages users.

>>> Why do we need to check PGMAP_COMPOUND? Why do we need to get pgmap?
>>> (We already removed that from the hmm version of this, was that wrong?
>>> Is this different?) Dan?
> 
> And this is the key question - why do we need to get a pgmap here?
> 
> I'm going to assert that a pgmap cannot be destroyed concurrently with
> fast gup running. This is surely true on x86 as the TLB flush that
> must have preceeded a pgmap destroy excludes fast gup. Other arches
> must emulate this in their pgmap implementations.
> 
> So, why do we need pgmap here? Hoping Dan might know
> 
> If we delete the pgmap then the devmap stop being special.
>
I will let Dan chip in.

> CH and I looked at this and deleted it from the hmm side:
> 
> commit 068354ade5dd9e2b07d9b0c57055a681db6f4e37
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Fri Mar 27 17:00:13 2020 -0300
> 
>     mm/hmm: remove pgmap checking for devmap pages
>     
>     The checking boils down to some racy check if the pagemap is still
>     available or not. Instead of checking this, rely entirely on the
>     notifiers, if a pagemap is destroyed then all pages that belong to it must
>     be removed from the tables and the notifiers triggered.
>     
>     Link: https://lore.kernel.org/r/20200327200021.29372-2-jgg@ziepe.ca
> 
> Though I am wondering if this whole hmm thing is racy with memory
> unplug. Hmm.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
       [not found]           ` <20201209162438.GW5487@ziepe.ca>
@ 2020-12-09 17:27             ` Joao Martins
  2020-12-09 18:14             ` Matthew Wilcox
  1 sibling, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-09 17:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/9/20 4:24 PM, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote:
> 
>> Today (without the series) struct pages are not represented the way they
>> are expressed in the page tables, which is what I am hoping to fix in this
>> series thus initializing these as compound pages of a given order. But me
>> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
>> and new (compound pages) co-exist.
> 
> Oooh, that I didn't know.. That is kind of horrible to have a PMD
> pointing at an order 0 page only in this one special case.
> 
> Still, I think it would be easier to teach record_subpages() that a
> PMD doesn't necessarily point to a high order page, eg do something
> like I suggested for the SGL where it extracts the page order and
> iterates over the contiguous range of pfns.
> 
/me nods

> This way it can remain general with no particularly special path for
> devmap or a special PGMAP_COMPOUND check here.

The less special paths the better, indeed.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
       [not found]           ` <20201209162438.GW5487@ziepe.ca>
  2020-12-09 17:27             ` Joao Martins
@ 2020-12-09 18:14             ` Matthew Wilcox
  2020-12-10 15:43               ` Joao Martins
  1 sibling, 1 reply; 67+ messages in thread
From: Matthew Wilcox @ 2020-12-09 18:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joao Martins, linux-mm, linux-nvdimm, Muchun Song, Mike Kravetz,
	Andrew Morton

On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote:
> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote:
> 
> > Today (without the series) struct pages are not represented the way they
> > are expressed in the page tables, which is what I am hoping to fix in this
> > series thus initializing these as compound pages of a given order. But me
> > introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
> > and new (compound pages) co-exist.
> 
> Oooh, that I didn't know.. That is kind of horrible to have a PMD
> pointing at an order 0 page only in this one special case.

Uh, yes.  I'm surprised it hasn't caused more problems.

> Still, I think it would be easier to teach record_subpages() that a
> PMD doesn't necessarily point to a high order page, eg do something
> like I suggested for the SGL where it extracts the page order and
> iterates over the contiguous range of pfns.

But we also see good performance improvements from doing all reference
counts on the head page instead of spread throughout the pages, so we
really want compound pages.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages
  2020-12-09 18:14             ` Matthew Wilcox
@ 2020-12-10 15:43               ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-10 15:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, linux-nvdimm, Muchun Song, Mike Kravetz, Andrew Morton,
	Jason Gunthorpe



On 12/9/20 6:14 PM, Matthew Wilcox wrote:
> On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote:
>> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote:
>>
>>> Today (without the series) struct pages are not represented the way they
>>> are expressed in the page tables, which is what I am hoping to fix in this
>>> series thus initializing these as compound pages of a given order. But me
>>> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound)
>>> and new (compound pages) co-exist.
>>
>> Oooh, that I didn't know.. That is kind of horrible to have a PMD
>> pointing at an order 0 page only in this one special case.
> 
> Uh, yes.  I'm surprised it hasn't caused more problems.
> 
There was 1 or 2 problems in the KVM MMU related to zone device pages.

See commit e851265a816f ("KVM: x86/mmu: Use huge pages for DAX-backed files")
which eventually lead to commit db5432165e9b5 ("KVM: x86/mmu: Walk host page
tables to find THP mappings") to be less amenable to metadata changes.

>> Still, I think it would be easier to teach record_subpages() that a
>> PMD doesn't necessarily point to a high order page, eg do something
>> like I suggested for the SGL where it extracts the page order and
>> iterates over the contiguous range of pfns.
> 
> But we also see good performance improvements from doing all reference
> counts on the head page instead of spread throughout the pages, so we
> really want compound pages.

Going further than just refcounts and borrowing your (or someone else?)
idea, perhaps also a FOLL_HEAD gup flag that would let us only work with
head pages (or folios). Which would consequently let us pin/grab bigger
swathes of memory e.g. 1G (in 2M head pages) or 512G (in 1G head pages)
with just 1 page for storing the struct pages[*]. Albeit I suspect the
numbers would have to justify it.

	Joao

[*] One page happens to be what's used for RDMA/umem and vdpa as callers
of pin_user_pages*()
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages
       [not found]   ` <20201208193446.GP5487@ziepe.ca>
  2020-12-09  5:06     ` John Hubbard
  2020-12-09 12:17     ` Joao Martins
@ 2020-12-17 19:05     ` Joao Martins
       [not found]       ` <20201217200530.GK5487@ziepe.ca>
  2 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2020-12-17 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton, Daniel Jordan, John Hubbard

On 12/8/20 7:34 PM, Jason Gunthorpe wrote:
>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  				 bool make_dirty)
>>  {
>>  	unsigned long index;
>> +	int refs = 1;
>>  
>>  	/*
>>  	 * TODO: this can be optimized for huge pages: if a series of pages is
>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  		return;
>>  	}
>>  
>> -	for (index = 0; index < npages; index++) {
>> +	for (index = 0; index < npages; index += refs) {
>>  		struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, &head, &ntails) {
>        		if (!PageDirty(head))
> 			set_page_dirty_lock(head, ntails);
> 		unpin_user_page(head, ntails);
> }
> 
> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
> No reason not to fix set_page_dirty_lock() too while you are here.
> 

The wack of atomics you mentioned earlier you referred to, I suppose it
ends being account_page_dirtied(). See partial diff at the end.

I was looking at the latter part and renaming all the fs that supply
set_page_dirty()... But now my concern is whether it's really safe to
assume that filesystems that supply it ... have indeed the ability to dirty
@ntails pages. Functionally, fixing set_page_dirty_lock() means we don't call
set_page_dirty(head) @ntails times as it happens today, we would only call once
with ntails as argument.

Perhaps the safest thing to do is still to iterate over
@ntails and call .set_page_dirty(page) and instead introduce
a set_page_range_dirty() which individual filesystems can separately
supply and give precedence of ->set_page_range_dirty() as opposed
to ->set_page_dirty() ?

	Joao

--------------------->8------------------------------

diff --git a/mm/gup.c b/mm/gup.c
index 41ab3d48e1bb..5f8a0f16ab62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -295,7 +295,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long
npages,
                 * next writeback cycle. This is harmless.
                 */
                if (!PageDirty(head))
-                       set_page_dirty_lock(head);
+                       set_page_range_dirty_lock(head, ntails);
                put_compound_head(head, ntails, FOLL_PIN);
        }
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 088729ea80b2..4642d037f657 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2417,7 +2417,8 @@ int __set_page_dirty_no_writeback(struct page *page, unsigned int
ntails)
  *
  * NOTE: This relies on being atomic wrt interrupts.
  */
-void account_page_dirtied(struct page *page, struct address_space *mapping)
+void account_page_dirtied(struct page *page, struct address_space *mapping,
+                         unsigned int ntails)
 {
        struct inode *inode = mapping->host;

@@ -2425,17 +2426,18 @@ void account_page_dirtied(struct page *page, struct address_space
*mapping)

        if (mapping_can_writeback(mapping)) {
                struct bdi_writeback *wb;
+               int nr = ntails + 1;

                inode_attach_wb(inode, page);
                wb = inode_to_wb(inode);

-               __inc_lruvec_page_state(page, NR_FILE_DIRTY);
-               __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-               __inc_node_page_state(page, NR_DIRTIED);
-               inc_wb_stat(wb, WB_RECLAIMABLE);
-               inc_wb_stat(wb, WB_DIRTIED);
-               task_io_account_write(PAGE_SIZE);
-               current->nr_dirtied++;
+               mod_lruvec_page_state(page, NR_FILE_DIRTY, nr);
+               mod_zone_page_state(page_zone(page), NR_ZONE_WRITE_PENDING, nr);
+               mod_node_page_state(page_pgdat(page), NR_DIRTIED, nr);
+               __add_wb_stat(wb, WB_RECLAIMABLE, nr);
+               __add_wb_stat(wb, WB_DIRTIED, nr);
+               task_io_account_write(nr * PAGE_SIZE);
+               current->nr_dirtied += nr;
                this_cpu_inc(bdp_ratelimits);

                mem_cgroup_track_foreign_dirty(page, wb);
@@ -2485,7 +2487,7 @@ int __set_page_dirty_nobuffers(struct page *page, unsigned int ntails)
                xa_lock_irqsave(&mapping->i_pages, flags);
                BUG_ON(page_mapping(page) != mapping);
                WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-               account_page_dirtied(page, mapping);
+               account_page_dirtied(page, mapping, ntails);
                __xa_set_mark(&mapping->i_pages, page_index(page),
                                   PAGECACHE_TAG_DIRTY);
                xa_unlock_irqrestore(&mapping->i_pages, flags);
@@ -2624,6 +2626,27 @@ int set_page_dirty_lock(struct page *page)
 }
 EXPORT_SYMBOL(set_page_dirty_lock);

+/*
+ * set_page_range_dirty() is racy if the caller has no reference against
+ * page->mapping->host, and if the page is unlocked.  This is because another
+ * CPU could truncate the page off the mapping and then free the mapping.
+ *
+ * Usually, the page _is_ locked, or the caller is a user-space process which
+ * holds a reference on the inode by having an open file.
+ *
+ * In other cases, the page should be locked before running set_page_range_dirty().
+ */
+int set_page_range_dirty_lock(struct page *page, unsigned int ntails)
+{
+       int ret;
+
+       lock_page(page);
+       ret = set_page_range_dirty(page, ntails);
+       unlock_page(page);
+       return ret;
+}
+EXPORT_SYMBOL(set_page_range_dirty_lock);
+
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages
       [not found]       ` <20201217200530.GK5487@ziepe.ca>
@ 2020-12-17 22:34         ` Joao Martins
  2020-12-19  2:06         ` John Hubbard
  1 sibling, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-17 22:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton, Daniel Jordan, John Hubbard

On 12/17/20 8:05 PM, Jason Gunthorpe wrote:
> On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
>>> No reason not to fix set_page_dirty_lock() too while you are here.
>>
>> The wack of atomics you mentioned earlier you referred to, I suppose it
>> ends being account_page_dirtied(). See partial diff at the end.
> 
> Well, even just eliminating the lock_page, page_mapping, PageDirty,
> etc is already a big win.
> 
> If mapping->a_ops->set_page_dirty() needs to be called multiple times
> on the head page I'd probably just suggest:
> 
>   while (ntails--)
>         rc |= (*spd)(head);
> 
> At least as a start.
> 
/me nods

> If you have workloads that have page_mapping != NULL then look at
> another series to optimze that. Looks a bit large though given the
> number of places implementing set_page_dirty
> 
Yes. I don't have a particular workload, was just wondering what you had in
mind, as at a glance, changing all the places without messing filesystems looks like
the subject of a separate series.

> I think the current reality is calling set_page_dirty on an actual
> file system is busted anyhow, so I think mapping is generally going to
> be NULL here?

Perhaps -- I'll have to check.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages
       [not found]       ` <20201217200530.GK5487@ziepe.ca>
  2020-12-17 22:34         ` Joao Martins
@ 2020-12-19  2:06         ` John Hubbard
  2020-12-19 13:10           ` Joao Martins
  1 sibling, 1 reply; 67+ messages in thread
From: John Hubbard @ 2020-12-19  2:06 UTC (permalink / raw)
  To: Jason Gunthorpe, Joao Martins
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton  <akpm@linux-foundation.org>,
	Daniel Jordan

On 12/17/20 12:05 PM, Jason Gunthorpe wrote:
> On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
>>> No reason not to fix set_page_dirty_lock() too while you are here.
>>
>> The wack of atomics you mentioned earlier you referred to, I suppose it
>> ends being account_page_dirtied(). See partial diff at the end.
> 
> Well, even just eliminating the lock_page, page_mapping, PageDirty,
> etc is already a big win.
> 
> If mapping->a_ops->set_page_dirty() needs to be called multiple times
> on the head page I'd probably just suggest:
> 
>    while (ntails--)
>          rc |= (*spd)(head);

I think once should be enough. There is no counter for page dirtiness,
and this kind of accounting is always tracked in the head page, so there
is no reason to repeatedly call set_page_dirty() from the same
spot.


thanks,
-- 
John Hubbard
NVIDIA
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 7/9] mm/gup: Decrement head page once for group of subpages
  2020-12-19  2:06         ` John Hubbard
@ 2020-12-19 13:10           ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-19 13:10 UTC (permalink / raw)
  To: John Hubbard, Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton, Daniel Jordan

On 12/19/20 2:06 AM, John Hubbard wrote:
> On 12/17/20 12:05 PM, Jason Gunthorpe wrote:
>> On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
>>>> No reason not to fix set_page_dirty_lock() too while you are here.
>>>
>>> The wack of atomics you mentioned earlier you referred to, I suppose it
>>> ends being account_page_dirtied(). See partial diff at the end.
>>
>> Well, even just eliminating the lock_page, page_mapping, PageDirty,
>> etc is already a big win.
>>
>> If mapping->a_ops->set_page_dirty() needs to be called multiple times
>> on the head page I'd probably just suggest:
>>
>>    while (ntails--)
>>          rc |= (*spd)(head);
> 
> I think once should be enough. There is no counter for page dirtiness,
> and this kind of accounting is always tracked in the head page, so there
> is no reason to repeatedly call set_page_dirty() from the same
> spot.
> 
I think that's what we do even today, considering the Dirty bit is only set on the
compound head (regardless of accounting). Even without this patch,
IIUC we don't call a second set_page_dirty(head) after the first time
we dirty it. So probably there's no optimization to do here, as you say.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release()
  2020-12-09 10:59     ` Joao Martins
@ 2020-12-19 13:15       ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2020-12-19 13:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, linux-nvdimm, Matthew Wilcox, Muchun Song,
	Mike Kravetz, Andrew Morton

On 12/9/20 10:59 AM, Joao Martins wrote:
> On 12/8/20 7:29 PM, Jason Gunthorpe wrote:
>> On Tue, Dec 08, 2020 at 05:29:00PM +0000, Joao Martins wrote:
>>
>>>  static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>>>  {
>>> +	bool make_dirty = umem->writable && dirty;
>>> +	struct page **page_list = NULL;
>>>  	struct sg_page_iter sg_iter;
>>> +	unsigned long nr = 0;
>>>  	struct page *page;
>>>  
>>> +	page_list = (struct page **) __get_free_page(GFP_KERNEL);
>>
>> Gah, no, don't do it like this!
>>
>> Instead something like:
>>
>> 	for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
>> 	      unpin_use_pages_range_dirty_lock(sg_page(sg), sg->length/PAGE_SIZE,
>>                                                umem->writable && dirty);
>>
>> And have the mm implementation split the contiguous range of pages into
>> pairs of (compound head, ntails) with a bit of maths.
>>
> Got it :)
> 
> I was trying to avoid another exported symbol.
> 
> Albeit upon your suggestion below, it doesn't justify the efficiency/clearness lost.
> 
This more efficient suggestion of yours leads to a further speed up from:

	1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)

to

	1370 rounds in 5.003 sec: 3651.562 usec / round (hugetlbfs)

Right after I come back from holidays I will follow up with this series in separate.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
                   ` (11 preceding siblings ...)
  2020-12-09  9:52 ` [External] " Muchun Song
@ 2021-02-20  1:18 ` Dan Williams
  2021-02-22 11:06   ` Joao Martins
  2021-02-23 16:28   ` Joao Martins
  12 siblings, 2 replies; 67+ messages in thread
From: Dan Williams @ 2021-02-20  1:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Hey,
>
> This small series, attempts at minimizing 'struct page' overhead by
> pursuing a similar approach as Muchun Song series "Free some vmemmap
> pages of hugetlb page"[0] but applied to devmap/ZONE_DEVICE.
>
> [0] https://lore.kernel.org/linux-mm/20201130151838.11208-1-songmuchun@bytedance.com/

Clever!

>
> The link above describes it quite nicely, but the idea is to reuse tail
> page vmemmap areas, particular the area which only describes tail pages.
> So a vmemmap page describes 64 struct pages, and the first page for a given
> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
> vmemmap page would contain only tail pages, and that's what gets reused across
> the rest of the subsection/section. The bigger the page size, the bigger the
> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>
> In terms of savings, per 1Tb of memory, the struct page cost would go down
> with compound pagemap:
>
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)

Nice!

>
> Along the way I've extended it past 'struct page' overhead *trying* to address a
> few performance issues we knew about for pmem, specifically on the
> {pin,get}_user_pages* function family with device-dax vmas which are really
> slow even of the fast variants. THP is great on -fast variants but all except
> hugetlbfs perform rather poorly on non-fast gup.
>
> So to summarize what the series does:
>
> Patches 1-5: Much like Muchun series, we reuse tail page areas across a given
> page size (namely @align was referred by remaining memremap/dax code) and
> enabling of memremap to initialize the ZONE_DEVICE pages as compound pages or a
> given @align order. The main difference though, is that contrary to the hugetlbfs
> series, there's no vmemmap for the area, because we are onlining it. IOW no
> freeing of pages of already initialized vmemmap like the case for hugetlbfs,
> which simplifies the logic (besides not being arch-specific). After these,
> there's quite visible region bootstrap of pmem memmap given that we would
> initialize fewer struct pages depending on the page size.
>
>     NVDIMM namespace bootstrap improves from ~750ms to ~190ms/<=1ms on emulated NVDIMMs
>     with 2M and 1G respectivally. The net gain in improvement is similarly observed
>     in proportion when running on actual NVDIMMs.

I
>
> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> are working with compound pages i.e. we do 1 increment/decrement to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably, and unpin_user_pages() improves as well when passed a
> set of consecutive pages:
>
>                                            before          after
>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us

Compelling!

>
> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
> user. For unpin_user_pages() we have an additional test to demonstrate the
> improvement.  The test performs MR reg/unreg continuously and measuring its
> rate for a given period. So essentially ib_mem_get and ib_mem_release being
> stress tested which at the end of day means: pin_user_pages_longterm() and
> unpin_user_pages() for a scatterlist:
>
>     Before:
>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>
>     After:
>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)

Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
that question myself when I get to patch 8.

>
> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
> is very experimental, and I imported most of follow_hugetlb_page(), except
> that we do the same trick as gup-fast. In doing the patch I feel this batching
> should live in follow_page_mask() and having that being changed to return a set
> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
> patch then brings the previous test of mr reg/unreg (above) on parity
> between device-dax and hugetlbfs.
>
> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
> still running tests. Hence the RFC, asking for comments and general direction
> of the work before continuing.

Will go look at the code, but I don't see anything scary conceptually
here. The fact that pfn_to_page() does not need to change is among the
most compelling features of this approach.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
  2020-12-09  5:59   ` John Hubbard
@ 2021-02-20  1:24   ` Dan Williams
  2021-02-22 11:09     ` Joao Martins
  1 sibling, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-20  1:24 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Add a new flag for struct dev_pagemap which designates that a a pagemap
> is described as a set of compound pages or in other words, that how
> pages are grouped together in the page tables are reflected in how we
> describe struct pages. This means that rather than initializing
> individual struct pages, we also initialize these struct pages, as
> compound pages (on x86: 2M or 1G compound pages)
>
> For certain ZONE_DEVICE users, like device-dax, which have a fixed page
> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> thus playing the same tricks as hugetlb pages.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/memremap.h | 2 ++
>  mm/memremap.c            | 8 ++++++--
>  mm/page_alloc.c          | 7 +++++++
>  3 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 79c49e7f5c30..f8f26b2cc3da 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -90,6 +90,7 @@ struct dev_pagemap_ops {
>  };
>
>  #define PGMAP_ALTMAP_VALID     (1 << 0)
> +#define PGMAP_COMPOUND         (1 << 1)

Why is a new flag needed versus just the align attribute? In other
words there should be no need to go back to the old/slow days of
'struct page' per pfn after compound support is added.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2020-12-09  5:59   ` John Hubbard
  2020-12-09  6:33     ` Matthew Wilcox
@ 2021-02-20  1:43     ` Dan Williams
  2021-02-22 11:24       ` Joao Martins
  1 sibling, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-20  1:43 UTC (permalink / raw)
  To: John Hubbard
  Cc: Joao Martins, Linux MM, linux-nvdimm, Matthew Wilcox,
	Jason Gunthorpe, Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/8/20 9:28 AM, Joao Martins wrote:
> > Add a new flag for struct dev_pagemap which designates that a a pagemap
>
> a a
>
> > is described as a set of compound pages or in other words, that how
> > pages are grouped together in the page tables are reflected in how we
> > describe struct pages. This means that rather than initializing
> > individual struct pages, we also initialize these struct pages, as
>
> Let's not say "rather than x, we also do y", because it's self-contradictory.
> I think you want to just leave out the "also", like this:
>
> "This means that rather than initializing> individual struct pages, we
> initialize these struct pages ..."
>
> Is that right?
>
> > compound pages (on x86: 2M or 1G compound pages)
> >
> > For certain ZONE_DEVICE users, like device-dax, which have a fixed page
> > size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> > thus playing the same tricks as hugetlb pages.
> >
> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > ---
> >   include/linux/memremap.h | 2 ++
> >   mm/memremap.c            | 8 ++++++--
> >   mm/page_alloc.c          | 7 +++++++
> >   3 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 79c49e7f5c30..f8f26b2cc3da 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -90,6 +90,7 @@ struct dev_pagemap_ops {
> >   };
> >
> >   #define PGMAP_ALTMAP_VALID  (1 << 0)
> > +#define PGMAP_COMPOUND               (1 << 1)
> >
> >   /**
> >    * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> > @@ -114,6 +115,7 @@ struct dev_pagemap {
> >       struct completion done;
> >       enum memory_type type;
> >       unsigned int flags;
> > +     unsigned int align;
>
> This also needs an "@aline" entry in the comment block above.
>
> >       const struct dev_pagemap_ops *ops;
> >       void *owner;
> >       int nr_range;
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 16b2fb482da1..287a24b7a65a 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >       memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >                               PHYS_PFN(range->start),
> >                               PHYS_PFN(range_len(range)), pgmap);
> > -     percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> > -                     - pfn_first(pgmap, range_id));
> > +     if (pgmap->flags & PGMAP_COMPOUND)
> > +             percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> > +                     - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
>
> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
>
> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
> is simpler and skips a case, but it's not clear that it's required here. But I'm
> new to this area so be warned. :)

There's a subtle distinction between the range that was passed in and
the pfns that are activated inside of it. See the offset trickery in
pfn_first().

> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
> happen to want, but is not named accordingly. Can you use or create something
> more accurately named? Like "number of pages in this large page"?

It's not the number of pages in a large page it's converting bytes to
pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
though process was if I'm going to add () might as well use a macro
that already does this.

That said I think this calculation is broken precisely because
pfn_first() makes the result unaligned.

Rather than fix the unaligned pfn_first() problem I would use this
support as an opportunity to revisit the option of storing pages in
the vmem_altmap reserve soace. The altmap's whole reason for existence
was that 1.5% of large PMEM might completely swamp DRAM. However if
that overhead is reduced by an order (or orders) of magnitude the
primary need for vmem_altmap vanishes.

Now, we'll still need to keep it around for the ->align == PAGE_SIZE
case, but for most part existing deployments that are specifying page
map on PMEM and an align > PAGE_SIZE can instead just transparently be
upgraded to page map on a smaller amount of DRAM.

>
> > +     else
> > +             percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> > +                             - pfn_first(pgmap, range_id));
> >       return 0;
> >
> >   err_add_memory:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index eaa227a479e4..9716ecd58e29 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6116,6 +6116,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >       unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >       struct pglist_data *pgdat = zone->zone_pgdat;
> >       struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> > +     bool compound = pgmap->flags & PGMAP_COMPOUND;
> > +     unsigned int align = PHYS_PFN(pgmap->align);
>
> Maybe align_pfn or pfn_align? Don't want the same name for things that are actually
> different types, in meaning anyway.

Good catch.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate
  2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins
  2020-12-09  6:16   ` John Hubbard
@ 2021-02-20  1:49   ` Dan Williams
  2021-02-22 11:26     ` Joao Martins
  1 sibling, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-20  1:49 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 8, 2020 at 9:31 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Replace vmem_altmap with an vmem_context argument. That let us
> express how the vmemmap is gonna be initialized e.g. passing
> flags and a page size for reusing pages upon initializing the
> vmemmap.
>

Per the comment on the last patch, if compound dev_pagemap never
collides with vmem_altmap then I don't think this patch is needed.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size
  2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins
@ 2021-02-20  3:34   ` Dan Williams
  2021-02-22 11:42     ` Joao Martins
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-20  3:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that
> struct pages are onlined with a given alignment, and should reuse the
> tail pages vmemmap areas. On that circunstamce we reuse the PFN backing

s/On that circunstamce we reuse/Reuse/

Kills a "we" and switches to imperative tense. I noticed a couple
other "we"s in the previous patches, but this crossed my threshold to
make a comment.

> only the tail pages subsections, while letting the head page PFN remain
> different. This presumes that the backing page structs are compound
> pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with
> PGMAP_COMPOUND set)
>
> On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary

s/lets us save/saves/

> PFNs necessary

s/8 necessary PFNs necessary/8 PFNs necessary/

> to describe the subsection's 32K struct pages we are
> onlining.

s/we are onlining/being mapped/

...because ZONE_DEVICE pages are never "onlined".

> On a 1G compound pagemap it let us save 4096 pages.

s/lets us save/saves/

>
> Sections are 128M (or bigger/smaller),

Huh?

> and such when initializing a
> compound memory map where we are initializing compound struct pages, we
> need to preserve the tail page to be reused across the rest of the areas
> for pagesizes which bigger than a section.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> I wonder, rather than separating vmem_context and mhp_params, that
> one would just pick the latter. Albeit  semantically the ctx aren't
> necessarily paramters, context passed from multiple sections onlining
> (i.e. multiple calls to populate_section_memmap). Also provided that
> this is internal state, which isn't passed to external modules, except
>  @align and @flags for page size and requesting whether to reuse tail
> page areas.
> ---
>  include/linux/memory_hotplug.h | 10 ++++
>  include/linux/mm.h             |  2 +-
>  mm/memory_hotplug.c            | 12 ++++-
>  mm/memremap.c                  |  3 ++
>  mm/sparse-vmemmap.c            | 93 ++++++++++++++++++++++++++++------
>  5 files changed, 103 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 73f8bcbb58a4..e15bb82805a3 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -70,6 +70,10 @@ typedef int __bitwise mhp_t;
>   */
>  #define MEMHP_MERGE_RESOURCE   ((__force mhp_t)BIT(0))
>
> +/*
> + */
> +#define MEMHP_REUSE_VMEMMAP    ((__force mhp_t)BIT(1))
> +
>  /*
>   * Extended parameters for memory hotplug:
>   * altmap: alternative allocator for memmap array (optional)
> @@ -79,10 +83,16 @@ typedef int __bitwise mhp_t;
>  struct mhp_params {
>         struct vmem_altmap *altmap;
>         pgprot_t pgprot;
> +       unsigned int align;
> +       mhp_t flags;
>  };
>
>  struct vmem_context {
>         struct vmem_altmap *altmap;
> +       mhp_t flags;
> +       unsigned int align;
> +       void *block;
> +       unsigned long block_page;
>  };
>
>  /*
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2eb44318bb2d..8b0155441835 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3006,7 +3006,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                           struct vmem_altmap *altmap);
> +                           struct vmem_altmap *altmap, void *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
>  struct vmem_altmap;
>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f8870c53fe5e..56121dfcc44b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,14 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
>         return 0;
>  }
>
> +static void vmem_context_init(struct vmem_context *ctx, struct mhp_params *params)
> +{
> +       memset(ctx, 0, sizeof(*ctx));
> +       ctx->align = params->align;
> +       ctx->altmap = params->altmap;
> +       ctx->flags = params->flags;
> +}
> +
>  /*
>   * Reasonably generic function for adding memory.  It is
>   * expected that archs that support memory hotplug will
> @@ -313,7 +321,7 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>         unsigned long cur_nr_pages;
>         int err;
>         struct vmem_altmap *altmap = params->altmap;
> -       struct vmem_context ctx = { .altmap = params->altmap };
> +       struct vmem_context ctx;
>
>         if (WARN_ON_ONCE(!params->pgprot.pgprot))
>                 return -EINVAL;
> @@ -338,6 +346,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>         if (err)
>                 return err;
>
> +       vmem_context_init(&ctx, params);
> +
>         for (; pfn < end_pfn; pfn += cur_nr_pages) {
>                 /* Select all remaining pages up to the next section boundary */
>                 cur_nr_pages = min(end_pfn - pfn,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 287a24b7a65a..ecfa74848ac6 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -253,6 +253,9 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>                         goto err_kasan;
>                 }
>
> +               if (pgmap->flags & PGMAP_COMPOUND)
> +                       params->align = pgmap->align;
> +
>                 error = arch_add_memory(nid, range->start, range_len(range),
>                                         params);
>         }
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index bcda68ba1381..364c071350e8 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>  }
>
>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                                      struct vmem_altmap *altmap)
> +                                      struct vmem_altmap *altmap, void *block)
>  {
>         pte_t *pte = pte_offset_kernel(pmd, addr);
>         if (pte_none(*pte)) {
>                 pte_t entry;
> -               void *p;
> -
> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> -               if (!p)
> -                       return NULL;
> +               void *p = block;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> +                       if (!p)
> +                               return NULL;
> +               } else {
> +                       get_page(virt_to_page(block));
> +               }
>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>                 set_pte_at(&init_mm, addr, pte, entry);
>         }
> @@ -216,8 +220,10 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> -                                        int node, struct vmem_altmap *altmap)
> +static void *__meminit __vmemmap_populate_basepages(unsigned long start,
> +                                          unsigned long end, int node,
> +                                          struct vmem_altmap *altmap,
> +                                          void *block)
>  {
>         unsigned long addr = start;
>         pgd_t *pgd;
> @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>         for (; addr < end; addr += PAGE_SIZE) {
>                 pgd = vmemmap_pgd_populate(addr, node);
>                 if (!pgd)
> -                       return -ENOMEM;
> +                       return NULL;
>                 p4d = vmemmap_p4d_populate(pgd, addr, node);
>                 if (!p4d)
> -                       return -ENOMEM;
> +                       return NULL;
>                 pud = vmemmap_pud_populate(p4d, addr, node);
>                 if (!pud)
> -                       return -ENOMEM;
> +                       return NULL;
>                 pmd = vmemmap_pmd_populate(pud, addr, node);
>                 if (!pmd)
> -                       return -ENOMEM;
> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +                       return NULL;
> +               pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
>                 if (!pte)
> -                       return -ENOMEM;
> +                       return NULL;
>                 vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>         }
>
> +       return __va(__pfn_to_phys(pte_pfn(*pte)));
> +}
> +
> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> +                                        int node, struct vmem_altmap *altmap)
> +{
> +       if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
> +               return -ENOMEM;
>         return 0;
>  }
>
> +static struct page * __meminit vmemmap_populate_reuse(unsigned long start,
> +                                       unsigned long end, int node,
> +                                       struct vmem_context *ctx)
> +{
> +       unsigned long size, addr = start;
> +       unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page);
> +
> +       size = min(psize, end - start);
> +
> +       for (; addr < end; addr += size) {
> +               unsigned long head = addr + PAGE_SIZE;
> +               unsigned long tail = addr;
> +               unsigned long last = addr + size;
> +               void *area;
> +
> +               if (ctx->block_page &&
> +                   IS_ALIGNED((addr - ctx->block_page), psize))
> +                       ctx->block = NULL;
> +
> +               area  = ctx->block;
> +               if (!area) {
> +                       if (!__vmemmap_populate_basepages(addr, head, node,
> +                                                         ctx->altmap, NULL))
> +                               return NULL;
> +
> +                       tail = head + PAGE_SIZE;
> +                       area = __vmemmap_populate_basepages(head, tail, node,
> +                                                           ctx->altmap, NULL);
> +                       if (!area)
> +                               return NULL;
> +
> +                       ctx->block = area;
> +                       ctx->block_page = addr;
> +               }
> +
> +               if (!__vmemmap_populate_basepages(tail, last, node,
> +                                                 ctx->altmap, area))
> +                       return NULL;
> +       }

I think that compound page accounting and combined altmap accounting
makes this difficult to read, and I think the compound page case
deserves it's own first class loop rather than reusing
vmemmap_populate_basepages(). With the suggestion to drop altmap
support I'd expect a vmmemap_populate_compound that takes a compound
page size and goes the right think with respect to mapping all the
tail pages to the same pfn.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps
  2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins
@ 2021-02-20  6:17   ` Dan Williams
  2021-02-22 12:01     ` Joao Martins
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-20  6:17 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Dec 8, 2020 at 9:31 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> When PGMAP_COMPOUND is set, all pages are onlined at a given huge page
> alignment and using compound pages to describe them as opposed to a
> struct per 4K.
>

Same s/online/mapped/ comment as other changelogs.

> To minimize struct page overhead and given the usage of compound pages we
> utilize the fact that most tail pages look the same, we online the
> subsection while pointing to the same pages. Thus request VMEMMAP_REUSE
> in add_pages.
>
> With VMEMMAP_REUSE, provided we reuse most tail pages the amount of
> struct pages we need to initialize is a lot smaller that the total
> amount of structs we would normnally online. Thus allow an @init_order
> to be passed to specify how much pages we want to prep upon creating a
> compound page.
>
> Finally when onlining all struct pages in memmap_init_zone_device, make
> sure that we only initialize the unique struct pages i.e. the first 2
> 4K pages from @align which means 128 struct pages out of 32768 for 2M
> @align or 262144 for a 1G @align.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/memremap.c   |  4 +++-
>  mm/page_alloc.c | 23 ++++++++++++++++++++---
>  2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index ecfa74848ac6..3eca07916b9d 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -253,8 +253,10 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>                         goto err_kasan;
>                 }
>
> -               if (pgmap->flags & PGMAP_COMPOUND)
> +               if (pgmap->flags & PGMAP_COMPOUND) {
>                         params->align = pgmap->align;
> +                       params->flags = MEMHP_REUSE_VMEMMAP;

The "reuse" naming is not my favorite. Yes, page reuse is happening,
but what is more relevant is that the vmemmap is in a given minimum
page size mode. So it's less of a flag and more of enum that selects
between PAGE_SIZE, HPAGE_SIZE, and PUD_PAGE_SIZE (GPAGE_SIZE?).

> +               }
>
>                 error = arch_add_memory(nid, range->start, range_len(range),
>                                         params);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9716ecd58e29..180a7d4e9285 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -691,10 +691,11 @@ void free_compound_page(struct page *page)
>         __free_pages_ok(page, compound_order(page), FPI_NONE);
>  }
>
> -void prep_compound_page(struct page *page, unsigned int order)
> +static void __prep_compound_page(struct page *page, unsigned int order,
> +                                unsigned int init_order)
>  {
>         int i;
> -       int nr_pages = 1 << order;
> +       int nr_pages = 1 << init_order;
>
>         __SetPageHead(page);
>         for (i = 1; i < nr_pages; i++) {
> @@ -711,6 +712,11 @@ void prep_compound_page(struct page *page, unsigned int order)
>                 atomic_set(compound_pincount_ptr(page), 0);
>  }
>
> +void prep_compound_page(struct page *page, unsigned int order)
> +{
> +       __prep_compound_page(page, order, order);
> +}
> +
>  #ifdef CONFIG_DEBUG_PAGEALLOC
>  unsigned int _debug_guardpage_minorder;
>
> @@ -6108,6 +6114,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>  }
>
>  #ifdef CONFIG_ZONE_DEVICE
> +
> +#define MEMMAP_COMPOUND_SIZE (2 * (PAGE_SIZE/sizeof(struct page)))
> +
>  void __ref memmap_init_zone_device(struct zone *zone,
>                                    unsigned long start_pfn,
>                                    unsigned long nr_pages,
> @@ -6138,6 +6147,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>                 struct page *page = pfn_to_page(pfn);
>
> +               /* Skip already initialized pages. */
> +               if (compound && (pfn % align >= MEMMAP_COMPOUND_SIZE)) {
> +                       pfn = ALIGN(pfn, align) - 1;
> +                       continue;
> +               }
> +
>                 __init_single_page(page, pfn, zone_idx, nid);
>
>                 /*
> @@ -6175,7 +6190,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
>
>         if (compound) {
>                 for (pfn = start_pfn; pfn < end_pfn; pfn += align)
> -                       prep_compound_page(pfn_to_page(pfn), order_base_2(align));
> +                       __prep_compound_page(pfn_to_page(pfn),
> +                                          order_base_2(align),
> +                                          order_base_2(MEMMAP_COMPOUND_SIZE));
>         }

Alex did quite a bit of work to optimize this path, and this
organization appears to undo it. I'd prefer to keep it all in one loop
so a 'struct page' is only initialized once. Otherwise by the time the
above loop finishes and this one starts the 'struct page's are
probably cache cold again.

So I'd break prep_compoud_page into separate head and tail  init and
call them at the right time in one loop.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-02-20  1:18 ` Dan Williams
@ 2021-02-22 11:06   ` Joao Martins
  2021-02-22 14:32     ` Joao Martins
  2021-02-23 16:28   ` Joao Martins
  1 sibling, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-02-22 11:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On 2/20/21 1:18 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> The link above describes it quite nicely, but the idea is to reuse tail
>> page vmemmap areas, particular the area which only describes tail pages.
>> So a vmemmap page describes 64 struct pages, and the first page for a given
>> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
>> vmemmap page would contain only tail pages, and that's what gets reused across
>> the rest of the subsection/section. The bigger the page size, the bigger the
>> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>>
>> In terms of savings, per 1Tb of memory, the struct page cost would go down
>> with compound pagemap:
>>
>> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
>> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 
> Nice!
> 

I failed to mention this in the cover letter but I should say that with this trick we will
need to build the vmemmap page tables with basepages for 2M align, as opposed to hugepages
in the vmemmap page tables (as you probably could tell from the patches). This means that
we have to allocate a PMD page, and that costs 2GB per 1Tb (as opposed to 4M). This is
fixable for 1G align by reusing PMD pages (albeit I haven't done that in this RFC series).

The footprint reduction is still big, so to iterate the numbers above (and I will fix this
in the v2 cover letter):

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)

For vmemmap page tables, we need to use base pages for 2M pages. So taking that into
account, in this RFC series:

* with 2M pages we lose 6G instead of 16G (0.586% instead of 1.5% of total memory)
* with 1G pages we lose ~2GB instead of 16G (0.19% instead of 1.5% of total memory)

For 1G align, we are able to reuse vmemmap PMDs that only point to tail pages, so
ultimately we can get the page table overhead from 2GB to 12MB:

* with 1G pages we lose 20MB instead of 16G (0.0019% instead of 1.5% of total memory)

>>
>> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
>> user. For unpin_user_pages() we have an additional test to demonstrate the
>> improvement.  The test performs MR reg/unreg continuously and measuring its
>> rate for a given period. So essentially ib_mem_get and ib_mem_release being
>> stress tested which at the end of day means: pin_user_pages_longterm() and
>> unpin_user_pages() for a scatterlist:
>>
>>     Before:
>>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>>
>>     After:
>>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
> 
> Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
> that question myself when I get to patch 8.
> 
Because the unpinning improvements aren't ZONE_DEVICE specific.

FWIW, I moved those two offending patches outside of this series:

  https://lore.kernel.org/linux-mm/20210212130843.13865-1-joao.m.martins@oracle.com/

>>
>> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
>> is very experimental, and I imported most of follow_hugetlb_page(), except
>> that we do the same trick as gup-fast. In doing the patch I feel this batching
>> should live in follow_page_mask() and having that being changed to return a set
>> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
>> patch then brings the previous test of mr reg/unreg (above) on parity
>> between device-dax and hugetlbfs.
>>
>> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
>> still running tests. Hence the RFC, asking for comments and general direction
>> of the work before continuing.
> 
> Will go look at the code, but I don't see anything scary conceptually
> here. The fact that pfn_to_page() does not need to change is among the
> most compelling features of this approach.
> 
Glad to hear that :D
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-20  1:24   ` Dan Williams
@ 2021-02-22 11:09     ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2021-02-22 11:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On 2/20/21 1:24 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Add a new flag for struct dev_pagemap which designates that a a pagemap
>> is described as a set of compound pages or in other words, that how
>> pages are grouped together in the page tables are reflected in how we
>> describe struct pages. This means that rather than initializing
>> individual struct pages, we also initialize these struct pages, as
>> compound pages (on x86: 2M or 1G compound pages)
>>
>> For certain ZONE_DEVICE users, like device-dax, which have a fixed page
>> size, this creates an opportunity to optimize GUP and GUP-fast walkers,
>> thus playing the same tricks as hugetlb pages.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/memremap.h | 2 ++
>>  mm/memremap.c            | 8 ++++++--
>>  mm/page_alloc.c          | 7 +++++++
>>  3 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index 79c49e7f5c30..f8f26b2cc3da 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -90,6 +90,7 @@ struct dev_pagemap_ops {
>>  };
>>
>>  #define PGMAP_ALTMAP_VALID     (1 << 0)
>> +#define PGMAP_COMPOUND         (1 << 1)
> 
> Why is a new flag needed versus just the align attribute? In other
> words there should be no need to go back to the old/slow days of
> 'struct page' per pfn after compound support is added.
> 
Ack, I suppose I could just use pgmap @align attribute as you mentioned.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-20  1:43     ` Dan Williams
@ 2021-02-22 11:24       ` Joao Martins
  2021-02-22 20:37         ` Dan Williams
  0 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-02-22 11:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On 2/20/21 1:43 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
>> On 12/8/20 9:28 AM, Joao Martins wrote:
>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>> index 16b2fb482da1..287a24b7a65a 100644
>>> --- a/mm/memremap.c
>>> +++ b/mm/memremap.c
>>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>>       memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>>                               PHYS_PFN(range->start),
>>>                               PHYS_PFN(range_len(range)), pgmap);
>>> -     percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>>> -                     - pfn_first(pgmap, range_id));
>>> +     if (pgmap->flags & PGMAP_COMPOUND)
>>> +             percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>>> +                     - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
>>
>> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
>> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
>>
>> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
>> is simpler and skips a case, but it's not clear that it's required here. But I'm
>> new to this area so be warned. :)
> 
> There's a subtle distinction between the range that was passed in and
> the pfns that are activated inside of it. See the offset trickery in
> pfn_first().
> 
>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
>> happen to want, but is not named accordingly. Can you use or create something
>> more accurately named? Like "number of pages in this large page"?
> 
> It's not the number of pages in a large page it's converting bytes to
> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
> though process was if I'm going to add () might as well use a macro
> that already does this.
> 
> That said I think this calculation is broken precisely because
> pfn_first() makes the result unaligned.
> 
> Rather than fix the unaligned pfn_first() problem I would use this
> support as an opportunity to revisit the option of storing pages in
> the vmem_altmap reserve soace. The altmap's whole reason for existence
> was that 1.5% of large PMEM might completely swamp DRAM. However if
> that overhead is reduced by an order (or orders) of magnitude the
> primary need for vmem_altmap vanishes.
> 
> Now, we'll still need to keep it around for the ->align == PAGE_SIZE
> case, but for most part existing deployments that are specifying page
> map on PMEM and an align > PAGE_SIZE can instead just transparently be
> upgraded to page map on a smaller amount of DRAM.
> 
I feel the altmap is still relevant. Even with the struct page reuse for
tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
1Tb (strictly speaking about what's stored in the altmap). Muchun and
Matthew were thinking (in another thread) on compound_head() adjustments
that probably can make this overhead go to 2G (if we learn to differentiate
the reused head page from the real head page). But even there it's still
2G per 1Tb. 1G pages, though, have a better story to remove altmap need.

One thing to point out about altmap is that the degradation (in pinning and
unpining) we observed with struct page's in device memory, is no longer observed
once 1) we batch ref count updates as we move to compound pages 2) reusing
tail pages seems to lead to these struct pages staying more likely in cache
which perhaps contributes to dirtying a lot less cachelines.

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate
  2021-02-20  1:49   ` Dan Williams
@ 2021-02-22 11:26     ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2021-02-22 11:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On 2/20/21 1:49 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:31 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Replace vmem_altmap with an vmem_context argument. That let us
>> express how the vmemmap is gonna be initialized e.g. passing
>> flags and a page size for reusing pages upon initializing the
>> vmemmap.
>>
> 
> Per the comment on the last patch, if compound dev_pagemap never
> collides with vmem_altmap then I don't think this patch is needed.
> 
See my previous patch reply. It *might* be worth keeping that around.

And since the RFC, nvdimm is going to need a slight adjustment for the
altmap reserve pfn range, should we keep altmap around.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size
  2021-02-20  3:34   ` Dan Williams
@ 2021-02-22 11:42     ` Joao Martins
  2021-02-22 22:40       ` Dan Williams
  0 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-02-22 11:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton



On 2/20/21 3:34 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that
>> struct pages are onlined with a given alignment, and should reuse the
>> tail pages vmemmap areas. On that circunstamce we reuse the PFN backing
> 
> s/On that circunstamce we reuse/Reuse/
> 
> Kills a "we" and switches to imperative tense. I noticed a couple
> other "we"s in the previous patches, but this crossed my threshold to
> make a comment.
> 
/me nods. Will fix.

>> only the tail pages subsections, while letting the head page PFN remain
>> different. This presumes that the backing page structs are compound
>> pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with
>> PGMAP_COMPOUND set)
>>
>> On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary
> 
> s/lets us save/saves/
> 
Will fix.

>> PFNs necessary
> 
> s/8 necessary PFNs necessary/8 PFNs necessary/

Will fix.

> 
>> to describe the subsection's 32K struct pages we are
>> onlining.
> 
> s/we are onlining/being mapped/
> 
> ...because ZONE_DEVICE pages are never "onlined".
> 
>> On a 1G compound pagemap it let us save 4096 pages.
> 
> s/lets us save/saves/
> 

Will fix both.

>>
>> Sections are 128M (or bigger/smaller),
> 
> Huh?
> 

Section size is arch-dependent if we are being hollistic.
On x86 it's 64M, 128M or 512M right?

 #ifdef CONFIG_X86_32
 # ifdef CONFIG_X86_PAE
 #  define SECTION_SIZE_BITS     29
 #  define MAX_PHYSMEM_BITS      36
 # else
 #  define SECTION_SIZE_BITS     26
 #  define MAX_PHYSMEM_BITS      32
 # endif
 #else /* CONFIG_X86_32 */
 # define SECTION_SIZE_BITS      27 /* matt - 128 is convenient right now */
 # define MAX_PHYSMEM_BITS       (pgtable_l5_enabled() ? 52 : 46)
 #endif

Also, me pointing about section sizes, is because a 1GB+ page vmemmap population will
cross sections in how sparsemem populates the vmemmap. And on that case we gotta reuse the
the PTE/PMD pages across multiple invocations of vmemmap_populate_basepages(). Either
that, or looking at the previous page PTE, but that might be ineficient.

>> @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>         for (; addr < end; addr += PAGE_SIZE) {
>>                 pgd = vmemmap_pgd_populate(addr, node);
>>                 if (!pgd)
>> -                       return -ENOMEM;
>> +                       return NULL;
>>                 p4d = vmemmap_p4d_populate(pgd, addr, node);
>>                 if (!p4d)
>> -                       return -ENOMEM;
>> +                       return NULL;
>>                 pud = vmemmap_pud_populate(p4d, addr, node);
>>                 if (!pud)
>> -                       return -ENOMEM;
>> +                       return NULL;
>>                 pmd = vmemmap_pmd_populate(pud, addr, node);
>>                 if (!pmd)
>> -                       return -ENOMEM;
>> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> +                       return NULL;
>> +               pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
>>                 if (!pte)
>> -                       return -ENOMEM;
>> +                       return NULL;
>>                 vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>>         }
>>
>> +       return __va(__pfn_to_phys(pte_pfn(*pte)));
>> +}
>> +
>> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> +                                        int node, struct vmem_altmap *altmap)
>> +{
>> +       if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
>> +               return -ENOMEM;
>>         return 0;
>>  }
>>
>> +static struct page * __meminit vmemmap_populate_reuse(unsigned long start,
>> +                                       unsigned long end, int node,
>> +                                       struct vmem_context *ctx)
>> +{
>> +       unsigned long size, addr = start;
>> +       unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page);
>> +
>> +       size = min(psize, end - start);
>> +
>> +       for (; addr < end; addr += size) {
>> +               unsigned long head = addr + PAGE_SIZE;
>> +               unsigned long tail = addr;
>> +               unsigned long last = addr + size;
>> +               void *area;
>> +
>> +               if (ctx->block_page &&
>> +                   IS_ALIGNED((addr - ctx->block_page), psize))
>> +                       ctx->block = NULL;
>> +
>> +               area  = ctx->block;
>> +               if (!area) {
>> +                       if (!__vmemmap_populate_basepages(addr, head, node,
>> +                                                         ctx->altmap, NULL))
>> +                               return NULL;
>> +
>> +                       tail = head + PAGE_SIZE;
>> +                       area = __vmemmap_populate_basepages(head, tail, node,
>> +                                                           ctx->altmap, NULL);
>> +                       if (!area)
>> +                               return NULL;
>> +
>> +                       ctx->block = area;
>> +                       ctx->block_page = addr;
>> +               }
>> +
>> +               if (!__vmemmap_populate_basepages(tail, last, node,
>> +                                                 ctx->altmap, area))
>> +                       return NULL;
>> +       }
> 
> I think that compound page accounting and combined altmap accounting
> makes this difficult to read, and I think the compound page case
> deserves it's own first class loop rather than reusing
> vmemmap_populate_basepages(). With the suggestion to drop altmap
> support I'd expect a vmmemap_populate_compound that takes a compound
> page size and goes the right think with respect to mapping all the
> tail pages to the same pfn.
> 
I can move this to a separate loop as suggested.

But to be able to map all tail pages in one call of vmemmap_populate_compound()
this might requires changes in sparsemem generic code that I am not so sure
they are warranted the added complexity. Otherwise I'll have to probably keep
this logic of @ctx to be able to pass the page to be reused (i.e. @block and
@block_page). That's actually the main reason that made me introduce
a struct vmem_context.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps
  2021-02-20  6:17   ` Dan Williams
@ 2021-02-22 12:01     ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2021-02-22 12:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton



On 2/20/21 6:17 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:31 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> When PGMAP_COMPOUND is set, all pages are onlined at a given huge page
>> alignment and using compound pages to describe them as opposed to a
>> struct per 4K.
>>
> 
> Same s/online/mapped/ comment as other changelogs.
> 
Ack.

>> To minimize struct page overhead and given the usage of compound pages we
>> utilize the fact that most tail pages look the same, we online the
>> subsection while pointing to the same pages. Thus request VMEMMAP_REUSE
>> in add_pages.
>>
>> With VMEMMAP_REUSE, provided we reuse most tail pages the amount of
>> struct pages we need to initialize is a lot smaller that the total
>> amount of structs we would normnally online. Thus allow an @init_order
>> to be passed to specify how much pages we want to prep upon creating a
>> compound page.
>>
>> Finally when onlining all struct pages in memmap_init_zone_device, make
>> sure that we only initialize the unique struct pages i.e. the first 2
>> 4K pages from @align which means 128 struct pages out of 32768 for 2M
>> @align or 262144 for a 1G @align.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  mm/memremap.c   |  4 +++-
>>  mm/page_alloc.c | 23 ++++++++++++++++++++---
>>  2 files changed, 23 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index ecfa74848ac6..3eca07916b9d 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -253,8 +253,10 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>                         goto err_kasan;
>>                 }
>>
>> -               if (pgmap->flags & PGMAP_COMPOUND)
>> +               if (pgmap->flags & PGMAP_COMPOUND) {
>>                         params->align = pgmap->align;
>> +                       params->flags = MEMHP_REUSE_VMEMMAP;
> 
> The "reuse" naming is not my favorite. 

I also dislike it, but couldn't come up with a better one :(

> Yes, page reuse is happening,
> but what is more relevant is that the vmemmap is in a given minimum
> page size mode. So it's less of a flag and more of enum that selects
> between PAGE_SIZE, HPAGE_SIZE, and PUD_PAGE_SIZE (GPAGE_SIZE?).
> 
That does sound cleaner, but at the same time, won't we get confused
with pgmap @align and the vmemmap/memhp @align ?

Hmm, I also I think there's value in having two different attributes as
they have two different intents. A pgmap @align means is 'represent its
metadata as a huge page of a given size' and the vmemmap/memhp @align
lets tell the sparsemem that we are mapping metadata of a given @align.

The compound pages (pgmap->align) might be useful for other ZONE_DEVICE
users. But I am not sure everybody will want to immediately switch to the
'struct page reuse' trick.

>> +               }
>>
>>                 error = arch_add_memory(nid, range->start, range_len(range),
>>                                         params);
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 9716ecd58e29..180a7d4e9285 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -691,10 +691,11 @@ void free_compound_page(struct page *page)
>>         __free_pages_ok(page, compound_order(page), FPI_NONE);
>>  }
>>
>> -void prep_compound_page(struct page *page, unsigned int order)
>> +static void __prep_compound_page(struct page *page, unsigned int order,
>> +                                unsigned int init_order)
>>  {
>>         int i;
>> -       int nr_pages = 1 << order;
>> +       int nr_pages = 1 << init_order;
>>
>>         __SetPageHead(page);
>>         for (i = 1; i < nr_pages; i++) {
>> @@ -711,6 +712,11 @@ void prep_compound_page(struct page *page, unsigned int order)
>>                 atomic_set(compound_pincount_ptr(page), 0);
>>  }
>>
>> +void prep_compound_page(struct page *page, unsigned int order)
>> +{
>> +       __prep_compound_page(page, order, order);
>> +}
>> +
>>  #ifdef CONFIG_DEBUG_PAGEALLOC
>>  unsigned int _debug_guardpage_minorder;
>>
>> @@ -6108,6 +6114,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>  }
>>
>>  #ifdef CONFIG_ZONE_DEVICE
>> +
>> +#define MEMMAP_COMPOUND_SIZE (2 * (PAGE_SIZE/sizeof(struct page)))
>> +
>>  void __ref memmap_init_zone_device(struct zone *zone,
>>                                    unsigned long start_pfn,
>>                                    unsigned long nr_pages,
>> @@ -6138,6 +6147,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>         for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>                 struct page *page = pfn_to_page(pfn);
>>
>> +               /* Skip already initialized pages. */
>> +               if (compound && (pfn % align >= MEMMAP_COMPOUND_SIZE)) {
>> +                       pfn = ALIGN(pfn, align) - 1;
>> +                       continue;
>> +               }
>> +
>>                 __init_single_page(page, pfn, zone_idx, nid);
>>
>>                 /*
>> @@ -6175,7 +6190,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
>>
>>         if (compound) {
>>                 for (pfn = start_pfn; pfn < end_pfn; pfn += align)
>> -                       prep_compound_page(pfn_to_page(pfn), order_base_2(align));
>> +                       __prep_compound_page(pfn_to_page(pfn),
>> +                                          order_base_2(align),
>> +                                          order_base_2(MEMMAP_COMPOUND_SIZE));
>>         }
> 
> Alex did quite a bit of work to optimize this path, and this
> organization appears to undo it. I'd prefer to keep it all in one loop
> so a 'struct page' is only initialized once. Otherwise by the time the
> above loop finishes and this one starts the 'struct page's are
> probably cache cold again.
> 
> So I'd break prep_compoud_page into separate head and tail  init and
> call them at the right time in one loop.
> 
Ah, makes sense! I'll split into head/tail counter parts -- Might get even
faster that already is.

Which makes me wonder if we shouldn't replace that line:

"memmap_init_zone_device initialized NNNNNN pages in 0ms\n"

to use 'us' or 'ns' where applicable. That's ought to be more useful information
to the user.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-02-22 11:06   ` Joao Martins
@ 2021-02-22 14:32     ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2021-02-22 14:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton



On 2/22/21 11:06 AM, Joao Martins wrote:
> On 2/20/21 1:18 AM, Dan Williams wrote:
>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> The link above describes it quite nicely, but the idea is to reuse tail
>>> page vmemmap areas, particular the area which only describes tail pages.
>>> So a vmemmap page describes 64 struct pages, and the first page for a given
>>> ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
>>> vmemmap page would contain only tail pages, and that's what gets reused across
>>> the rest of the subsection/section. The bigger the page size, the bigger the
>>> savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).
>>>
>>> In terms of savings, per 1Tb of memory, the struct page cost would go down
>>> with compound pagemap:
>>>
>>> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
>>> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
>>
>> Nice!
>>
> 
> I failed to mention this in the cover letter but I should say that with this trick we will
> need to build the vmemmap page tables with basepages for 2M align, as opposed to hugepages
> in the vmemmap page tables (as you probably could tell from the patches). 

Also to be clear: by "we will need to build the vmemmap page tables with basepages for 2M
align" I strictly refer to the ZONE_DEVICE range we are mapping the struct pages. It's not
the enterity of the vmemmap!

> This means that
> we have to allocate a PMD page, and that costs 2GB per 1Tb (as opposed to 4M). This is
> fixable for 1G align by reusing PMD pages (albeit I haven't done that in this RFC series).
> 
> The footprint reduction is still big, so to iterate the numbers above (and I will fix this
> in the v2 cover letter):
> 
> * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
> * with 1G pages we lose 8MB instead of 16G (0.0007% instead of 1.5% of total memory)
> 
> For vmemmap page tables, we need to use base pages for 2M pages. So taking that into
> account, in this RFC series:
> 
> * with 2M pages we lose 6G instead of 16G (0.586% instead of 1.5% of total memory)
> * with 1G pages we lose ~2GB instead of 16G (0.19% instead of 1.5% of total memory)
> 
> For 1G align, we are able to reuse vmemmap PMDs that only point to tail pages, so
> ultimately we can get the page table overhead from 2GB to 12MB:
> 
> * with 1G pages we lose 20MB instead of 16G (0.0019% instead of 1.5% of total memory)
> 
>>>
>>> The RDMA patch (patch 8/9) is to demonstrate the improvement for an existing
>>> user. For unpin_user_pages() we have an additional test to demonstrate the
>>> improvement.  The test performs MR reg/unreg continuously and measuring its
>>> rate for a given period. So essentially ib_mem_get and ib_mem_release being
>>> stress tested which at the end of day means: pin_user_pages_longterm() and
>>> unpin_user_pages() for a scatterlist:
>>>
>>>     Before:
>>>     159 rounds in 5.027 sec: 31617.923 usec / round (device-dax)
>>>     466 rounds in 5.009 sec: 10748.456 usec / round (hugetlbfs)
>>>
>>>     After:
>>>     305 rounds in 5.010 sec: 16426.047 usec / round (device-dax)
>>>     1073 rounds in 5.004 sec: 4663.622 usec / round (hugetlbfs)
>>
>> Why does hugetlbfs get faster for a ZONE_DEVICE change? Might answer
>> that question myself when I get to patch 8.
>>
> Because the unpinning improvements aren't ZONE_DEVICE specific.
> 
> FWIW, I moved those two offending patches outside of this series:
> 
>   https://lore.kernel.org/linux-mm/20210212130843.13865-1-joao.m.martins@oracle.com/
> 
>>>
>>> Patch 9: Improves {pin,get}_user_pages() and its longterm counterpart. It
>>> is very experimental, and I imported most of follow_hugetlb_page(), except
>>> that we do the same trick as gup-fast. In doing the patch I feel this batching
>>> should live in follow_page_mask() and having that being changed to return a set
>>> of pages/something-else when walking over PMD/PUDs for THP / devmap pages. This
>>> patch then brings the previous test of mr reg/unreg (above) on parity
>>> between device-dax and hugetlbfs.
>>>
>>> Some of the patches are a little fresh/WIP (specially patch 3 and 9) and we are
>>> still running tests. Hence the RFC, asking for comments and general direction
>>> of the work before continuing.
>>
>> Will go look at the code, but I don't see anything scary conceptually
>> here. The fact that pfn_to_page() does not need to change is among the
>> most compelling features of this approach.
>>
> Glad to hear that :D
> 
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-22 11:24       ` Joao Martins
@ 2021-02-22 20:37         ` Dan Williams
  2021-02-23 15:46           ` Joao Martins
  2021-03-10 18:12           ` Joao Martins
  0 siblings, 2 replies; 67+ messages in thread
From: Dan Williams @ 2021-02-22 20:37 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/20/21 1:43 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >> On 12/8/20 9:28 AM, Joao Martins wrote:
> >>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>> index 16b2fb482da1..287a24b7a65a 100644
> >>> --- a/mm/memremap.c
> >>> +++ b/mm/memremap.c
> >>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >>>       memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >>>                               PHYS_PFN(range->start),
> >>>                               PHYS_PFN(range_len(range)), pgmap);
> >>> -     percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >>> -                     - pfn_first(pgmap, range_id));
> >>> +     if (pgmap->flags & PGMAP_COMPOUND)
> >>> +             percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >>> +                     - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
> >>
> >> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
> >> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
> >>
> >> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
> >> is simpler and skips a case, but it's not clear that it's required here. But I'm
> >> new to this area so be warned. :)
> >
> > There's a subtle distinction between the range that was passed in and
> > the pfns that are activated inside of it. See the offset trickery in
> > pfn_first().
> >
> >> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
> >> happen to want, but is not named accordingly. Can you use or create something
> >> more accurately named? Like "number of pages in this large page"?
> >
> > It's not the number of pages in a large page it's converting bytes to
> > pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
> > though process was if I'm going to add () might as well use a macro
> > that already does this.
> >
> > That said I think this calculation is broken precisely because
> > pfn_first() makes the result unaligned.
> >
> > Rather than fix the unaligned pfn_first() problem I would use this
> > support as an opportunity to revisit the option of storing pages in
> > the vmem_altmap reserve soace. The altmap's whole reason for existence
> > was that 1.5% of large PMEM might completely swamp DRAM. However if
> > that overhead is reduced by an order (or orders) of magnitude the
> > primary need for vmem_altmap vanishes.
> >
> > Now, we'll still need to keep it around for the ->align == PAGE_SIZE
> > case, but for most part existing deployments that are specifying page
> > map on PMEM and an align > PAGE_SIZE can instead just transparently be
> > upgraded to page map on a smaller amount of DRAM.
> >
> I feel the altmap is still relevant. Even with the struct page reuse for
> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
> 1Tb (strictly speaking about what's stored in the altmap). Muchun and
> Matthew were thinking (in another thread) on compound_head() adjustments
> that probably can make this overhead go to 2G (if we learn to differentiate
> the reused head page from the real head page).

I think that realization is more justification to make a new first
class vmemmap_populate_compound_pages() rather than try to reuse
vmemmap_populate_basepages() with new parameters.

> But even there it's still
> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need.

The concern that led to altmap is that someone would build a system
with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and
minimum RAM, and mapping all PMEM consumes all RAM. As far as I
understand real world populations are rarely going past 8:1, that
seems to make 'struct page' in RAM feasible even for the 2M compound
page case.

Let me ask you for a data point, since you're one of the people
actively deploying such systems, would you still use the 'struct page'
in PMEM capability after this set was merged?

> One thing to point out about altmap is that the degradation (in pinning and
> unpining) we observed with struct page's in device memory, is no longer observed
> once 1) we batch ref count updates as we move to compound pages 2) reusing
> tail pages seems to lead to these struct pages staying more likely in cache
> which perhaps contributes to dirtying a lot less cachelines.

True, it makes it more palatable to survive 'struct page' in PMEM, but
it's an ongoing maintenance burden that I'm not sure there are users
after putting 'struct page' on a diet. Don't get me wrong the
capability is still needed for filesystem-dax, but the distinction is
that vmemmap_populate_compound_pages() need never worry about an
altmap.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size
  2021-02-22 11:42     ` Joao Martins
@ 2021-02-22 22:40       ` Dan Williams
  2021-02-23 15:46         ` Joao Martins
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-22 22:40 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Mon, Feb 22, 2021 at 3:42 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 2/20/21 3:34 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Introduce a new flag, MEMHP_REUSE_VMEMMAP, which signals that
> >> struct pages are onlined with a given alignment, and should reuse the
> >> tail pages vmemmap areas. On that circunstamce we reuse the PFN backing
> >
> > s/On that circunstamce we reuse/Reuse/
> >
> > Kills a "we" and switches to imperative tense. I noticed a couple
> > other "we"s in the previous patches, but this crossed my threshold to
> > make a comment.
> >
> /me nods. Will fix.
>
> >> only the tail pages subsections, while letting the head page PFN remain
> >> different. This presumes that the backing page structs are compound
> >> pages, such as the case for compound pagemaps (i.e. ZONE_DEVICE with
> >> PGMAP_COMPOUND set)
> >>
> >> On 2M compound pagemaps, it lets us save 6 pages out of the 8 necessary
> >
> > s/lets us save/saves/
> >
> Will fix.
>
> >> PFNs necessary
> >
> > s/8 necessary PFNs necessary/8 PFNs necessary/
>
> Will fix.
>
> >
> >> to describe the subsection's 32K struct pages we are
> >> onlining.
> >
> > s/we are onlining/being mapped/
> >
> > ...because ZONE_DEVICE pages are never "onlined".
> >
> >> On a 1G compound pagemap it let us save 4096 pages.
> >
> > s/lets us save/saves/
> >
>
> Will fix both.
>
> >>
> >> Sections are 128M (or bigger/smaller),
> >
> > Huh?
> >
>
> Section size is arch-dependent if we are being hollistic.
> On x86 it's 64M, 128M or 512M right?
>
>  #ifdef CONFIG_X86_32
>  # ifdef CONFIG_X86_PAE
>  #  define SECTION_SIZE_BITS     29
>  #  define MAX_PHYSMEM_BITS      36
>  # else
>  #  define SECTION_SIZE_BITS     26
>  #  define MAX_PHYSMEM_BITS      32
>  # endif
>  #else /* CONFIG_X86_32 */
>  # define SECTION_SIZE_BITS      27 /* matt - 128 is convenient right now */
>  # define MAX_PHYSMEM_BITS       (pgtable_l5_enabled() ? 52 : 46)
>  #endif
>
> Also, me pointing about section sizes, is because a 1GB+ page vmemmap population will
> cross sections in how sparsemem populates the vmemmap. And on that case we gotta reuse the
> the PTE/PMD pages across multiple invocations of vmemmap_populate_basepages(). Either
> that, or looking at the previous page PTE, but that might be ineficient.

Ok, makes sense I think saying this description of needing to handle
section crossing is clearer than mentioning one of the section sizes.

>
> >> @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> >>         for (; addr < end; addr += PAGE_SIZE) {
> >>                 pgd = vmemmap_pgd_populate(addr, node);
> >>                 if (!pgd)
> >> -                       return -ENOMEM;
> >> +                       return NULL;
> >>                 p4d = vmemmap_p4d_populate(pgd, addr, node);
> >>                 if (!p4d)
> >> -                       return -ENOMEM;
> >> +                       return NULL;
> >>                 pud = vmemmap_pud_populate(p4d, addr, node);
> >>                 if (!pud)
> >> -                       return -ENOMEM;
> >> +                       return NULL;
> >>                 pmd = vmemmap_pmd_populate(pud, addr, node);
> >>                 if (!pmd)
> >> -                       return -ENOMEM;
> >> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> >> +                       return NULL;
> >> +               pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
> >>                 if (!pte)
> >> -                       return -ENOMEM;
> >> +                       return NULL;
> >>                 vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> >>         }
> >>
> >> +       return __va(__pfn_to_phys(pte_pfn(*pte)));
> >> +}
> >> +
> >> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> >> +                                        int node, struct vmem_altmap *altmap)
> >> +{
> >> +       if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
> >> +               return -ENOMEM;
> >>         return 0;
> >>  }
> >>
> >> +static struct page * __meminit vmemmap_populate_reuse(unsigned long start,
> >> +                                       unsigned long end, int node,
> >> +                                       struct vmem_context *ctx)
> >> +{
> >> +       unsigned long size, addr = start;
> >> +       unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page);
> >> +
> >> +       size = min(psize, end - start);
> >> +
> >> +       for (; addr < end; addr += size) {
> >> +               unsigned long head = addr + PAGE_SIZE;
> >> +               unsigned long tail = addr;
> >> +               unsigned long last = addr + size;
> >> +               void *area;
> >> +
> >> +               if (ctx->block_page &&
> >> +                   IS_ALIGNED((addr - ctx->block_page), psize))
> >> +                       ctx->block = NULL;
> >> +
> >> +               area  = ctx->block;
> >> +               if (!area) {
> >> +                       if (!__vmemmap_populate_basepages(addr, head, node,
> >> +                                                         ctx->altmap, NULL))
> >> +                               return NULL;
> >> +
> >> +                       tail = head + PAGE_SIZE;
> >> +                       area = __vmemmap_populate_basepages(head, tail, node,
> >> +                                                           ctx->altmap, NULL);
> >> +                       if (!area)
> >> +                               return NULL;
> >> +
> >> +                       ctx->block = area;
> >> +                       ctx->block_page = addr;
> >> +               }
> >> +
> >> +               if (!__vmemmap_populate_basepages(tail, last, node,
> >> +                                                 ctx->altmap, area))
> >> +                       return NULL;
> >> +       }
> >
> > I think that compound page accounting and combined altmap accounting
> > makes this difficult to read, and I think the compound page case
> > deserves it's own first class loop rather than reusing
> > vmemmap_populate_basepages(). With the suggestion to drop altmap
> > support I'd expect a vmmemap_populate_compound that takes a compound
> > page size and goes the right think with respect to mapping all the
> > tail pages to the same pfn.
> >
> I can move this to a separate loop as suggested.
>
> But to be able to map all tail pages in one call of vmemmap_populate_compound()
> this might requires changes in sparsemem generic code that I am not so sure
> they are warranted the added complexity. Otherwise I'll have to probably keep
> this logic of @ctx to be able to pass the page to be reused (i.e. @block and
> @block_page). That's actually the main reason that made me introduce
> a struct vmem_context.

Do you need to pass in a vmem_context, isn't that context local to
vmemmap_populate_compound_pages()?
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-22 20:37         ` Dan Williams
@ 2021-02-23 15:46           ` Joao Martins
  2021-02-23 16:50             ` Dan Williams
  2021-03-10 18:12           ` Joao Martins
  1 sibling, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-02-23 15:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On 2/22/21 8:37 PM, Dan Williams wrote:
> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/20/21 1:43 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>> On 12/8/20 9:28 AM, Joao Martins wrote:
>>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>>> index 16b2fb482da1..287a24b7a65a 100644
>>>>> --- a/mm/memremap.c
>>>>> +++ b/mm/memremap.c
>>>>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
>>>>>       memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
>>>>>                               PHYS_PFN(range->start),
>>>>>                               PHYS_PFN(range_len(range)), pgmap);
>>>>> -     percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
>>>>> -                     - pfn_first(pgmap, range_id));
>>>>> +     if (pgmap->flags & PGMAP_COMPOUND)
>>>>> +             percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
>>>>> +                     - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
>>>>
>>>> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
>>>> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
>>>>
>>>> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
>>>> is simpler and skips a case, but it's not clear that it's required here. But I'm
>>>> new to this area so be warned. :)
>>>
>>> There's a subtle distinction between the range that was passed in and
>>> the pfns that are activated inside of it. See the offset trickery in
>>> pfn_first().
>>>
>>>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
>>>> happen to want, but is not named accordingly. Can you use or create something
>>>> more accurately named? Like "number of pages in this large page"?
>>>
>>> It's not the number of pages in a large page it's converting bytes to
>>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
>>> though process was if I'm going to add () might as well use a macro
>>> that already does this.
>>>
>>> That said I think this calculation is broken precisely because
>>> pfn_first() makes the result unaligned.
>>>
>>> Rather than fix the unaligned pfn_first() problem I would use this
>>> support as an opportunity to revisit the option of storing pages in
>>> the vmem_altmap reserve soace. The altmap's whole reason for existence
>>> was that 1.5% of large PMEM might completely swamp DRAM. However if
>>> that overhead is reduced by an order (or orders) of magnitude the
>>> primary need for vmem_altmap vanishes.
>>>
>>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE
>>> case, but for most part existing deployments that are specifying page
>>> map on PMEM and an align > PAGE_SIZE can instead just transparently be
>>> upgraded to page map on a smaller amount of DRAM.
>>>
>> I feel the altmap is still relevant. Even with the struct page reuse for
>> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
>> 1Tb (strictly speaking about what's stored in the altmap). Muchun and
>> Matthew were thinking (in another thread) on compound_head() adjustments
>> that probably can make this overhead go to 2G (if we learn to differentiate
>> the reused head page from the real head page).
> 
> I think that realization is more justification to make a new first
> class vmemmap_populate_compound_pages() rather than try to reuse
> vmemmap_populate_basepages() with new parameters.
> 
I was already going to move this to vmemmap_populate_compound_pages() based
on your earlier suggestion :)

>> But even there it's still
>> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need.
> 
> The concern that led to altmap is that someone would build a system
> with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and
> minimum RAM, and mapping all PMEM consumes all RAM. As far as I
> understand real world populations are rarely going past 8:1, that
> seems to make 'struct page' in RAM feasible even for the 2M compound
> page case.
> 
> Let me ask you for a data point, since you're one of the people
> actively deploying such systems, would you still use the 'struct page'
> in PMEM capability after this set was merged?
> 
We might be sticking to RAM stored 'struct page' yes, but hard to say atm
what the future holds.

>> One thing to point out about altmap is that the degradation (in pinning and
>> unpining) we observed with struct page's in device memory, is no longer observed
>> once 1) we batch ref count updates as we move to compound pages 2) reusing
>> tail pages seems to lead to these struct pages staying more likely in cache
>> which perhaps contributes to dirtying a lot less cachelines.
> 
> True, it makes it more palatable to survive 'struct page' in PMEM, but
> it's an ongoing maintenance burden that I'm not sure there are users
> after putting 'struct page' on a diet. 

FWIW all I was trying to point out is that the 2M huge page overhead is still non
trivial. It is indeed much better than it is ATM yes, but still 6G per 1TB with 2M huge
pages. Only with 1G would be non-existent overhead, but then we have a trade-off elsewhere
in terms of poisoning a whole 1G page and what not.

> Don't get me wrong the
> capability is still needed for filesystem-dax, but the distinction is
> that vmemmap_populate_compound_pages() need never worry about an
> altmap.
> 
IMO there's not much added complexity strictly speaking about altmap. We still use the
same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is
being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards
hotplug making use of it) we are using it in the exact same way.

The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing
vmemmap blocks allocated in previous vmemmap pages, and preserving that across section
onlining (for 1G pages).

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size
  2021-02-22 22:40       ` Dan Williams
@ 2021-02-23 15:46         ` Joao Martins
  0 siblings, 0 replies; 67+ messages in thread
From: Joao Martins @ 2021-02-23 15:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On 2/22/21 10:40 PM, Dan Williams wrote:
> On Mon, Feb 22, 2021 at 3:42 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/20/21 3:34 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> Sections are 128M (or bigger/smaller),
>>>
>>> Huh?
>>>
>>
>> Section size is arch-dependent if we are being hollistic.
>> On x86 it's 64M, 128M or 512M right?
>>
>>  #ifdef CONFIG_X86_32
>>  # ifdef CONFIG_X86_PAE
>>  #  define SECTION_SIZE_BITS     29
>>  #  define MAX_PHYSMEM_BITS      36
>>  # else
>>  #  define SECTION_SIZE_BITS     26
>>  #  define MAX_PHYSMEM_BITS      32
>>  # endif
>>  #else /* CONFIG_X86_32 */
>>  # define SECTION_SIZE_BITS      27 /* matt - 128 is convenient right now */
>>  # define MAX_PHYSMEM_BITS       (pgtable_l5_enabled() ? 52 : 46)
>>  #endif
>>
>> Also, me pointing about section sizes, is because a 1GB+ page vmemmap population will
>> cross sections in how sparsemem populates the vmemmap. And on that case we gotta reuse the
>> the PTE/PMD pages across multiple invocations of vmemmap_populate_basepages(). Either
>> that, or looking at the previous page PTE, but that might be ineficient.
> 
> Ok, makes sense I think saying this description of needing to handle
> section crossing is clearer than mentioning one of the section sizes.
> 
I'll amend the commit message to have this.

>>
>>>> @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>>>         for (; addr < end; addr += PAGE_SIZE) {
>>>>                 pgd = vmemmap_pgd_populate(addr, node);
>>>>                 if (!pgd)
>>>> -                       return -ENOMEM;
>>>> +                       return NULL;
>>>>                 p4d = vmemmap_p4d_populate(pgd, addr, node);
>>>>                 if (!p4d)
>>>> -                       return -ENOMEM;
>>>> +                       return NULL;
>>>>                 pud = vmemmap_pud_populate(p4d, addr, node);
>>>>                 if (!pud)
>>>> -                       return -ENOMEM;
>>>> +                       return NULL;
>>>>                 pmd = vmemmap_pmd_populate(pud, addr, node);
>>>>                 if (!pmd)
>>>> -                       return -ENOMEM;
>>>> -               pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>>>> +                       return NULL;
>>>> +               pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
>>>>                 if (!pte)
>>>> -                       return -ENOMEM;
>>>> +                       return NULL;
>>>>                 vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>>>>         }
>>>>
>>>> +       return __va(__pfn_to_phys(pte_pfn(*pte)));
>>>> +}
>>>> +
>>>> +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>>> +                                        int node, struct vmem_altmap *altmap)
>>>> +{
>>>> +       if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
>>>> +               return -ENOMEM;
>>>>         return 0;
>>>>  }
>>>>
>>>> +static struct page * __meminit vmemmap_populate_reuse(unsigned long start,
>>>> +                                       unsigned long end, int node,
>>>> +                                       struct vmem_context *ctx)
>>>> +{
>>>> +       unsigned long size, addr = start;
>>>> +       unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page);
>>>> +
>>>> +       size = min(psize, end - start);
>>>> +
>>>> +       for (; addr < end; addr += size) {
>>>> +               unsigned long head = addr + PAGE_SIZE;
>>>> +               unsigned long tail = addr;
>>>> +               unsigned long last = addr + size;
>>>> +               void *area;
>>>> +
>>>> +               if (ctx->block_page &&
>>>> +                   IS_ALIGNED((addr - ctx->block_page), psize))
>>>> +                       ctx->block = NULL;
>>>> +
>>>> +               area  = ctx->block;
>>>> +               if (!area) {
>>>> +                       if (!__vmemmap_populate_basepages(addr, head, node,
>>>> +                                                         ctx->altmap, NULL))
>>>> +                               return NULL;
>>>> +
>>>> +                       tail = head + PAGE_SIZE;
>>>> +                       area = __vmemmap_populate_basepages(head, tail, node,
>>>> +                                                           ctx->altmap, NULL);
>>>> +                       if (!area)
>>>> +                               return NULL;
>>>> +
>>>> +                       ctx->block = area;
>>>> +                       ctx->block_page = addr;
>>>> +               }
>>>> +
>>>> +               if (!__vmemmap_populate_basepages(tail, last, node,
>>>> +                                                 ctx->altmap, area))
>>>> +                       return NULL;
>>>> +       }
>>>
>>> I think that compound page accounting and combined altmap accounting
>>> makes this difficult to read, and I think the compound page case
>>> deserves it's own first class loop rather than reusing
>>> vmemmap_populate_basepages(). With the suggestion to drop altmap
>>> support I'd expect a vmmemap_populate_compound that takes a compound
>>> page size and goes the right think with respect to mapping all the
>>> tail pages to the same pfn.
>>>
>> I can move this to a separate loop as suggested.
>>
>> But to be able to map all tail pages in one call of vmemmap_populate_compound()
>> this might requires changes in sparsemem generic code that I am not so sure
>> they are warranted the added complexity. Otherwise I'll have to probably keep
>> this logic of @ctx to be able to pass the page to be reused (i.e. @block and
>> @block_page). That's actually the main reason that made me introduce
>> a struct vmem_context.
> 
> Do you need to pass in a vmem_context, isn't that context local to
> vmemmap_populate_compound_pages()?
> 

Hmm, so we allocate a vmem_context (inited to zeroes) in __add_pages(), and then we use
the same vmem_context across all sections we are onling from the pfn range passed in
__add_pages(). So all sections use the same vmem_context. Then we take care in
vmemmap_populate_compound_pages() to check whether there was a @block allocated that needs
to be reused.

So while the content itself is private/local to vmemmap_populate_compound_pages() we still
rely on the ability that vmemmap_populate_compound_pages() always gets the same
vmem_context location passed in for all sections being onlined in the whole pfn range.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-02-20  1:18 ` Dan Williams
  2021-02-22 11:06   ` Joao Martins
@ 2021-02-23 16:28   ` Joao Martins
  2021-02-23 16:44     ` Dan Williams
  1 sibling, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-02-23 16:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On 2/20/21 1:18 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
>> are working with compound pages i.e. we do 1 increment/decrement to the head
>> page for a given set of N subpages compared as opposed to N individual writes.
>> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
>> improves considerably, and unpin_user_pages() improves as well when passed a
>> set of consecutive pages:
>>
>>                                            before          after
>>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> 
> Compelling!
> 

BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
device-dax?

Looking at the history, I understand that fsdax can't support it atm, but I am not sure
that the same holds for device-dax. I have this small chunk (see below the scissors mark)
which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure if there is
a fundamental issue for the other types that makes this an unwelcoming change.

	Joao

--------------------->8---------------------

Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
 MEMORY_DEVICE_GENERIC

The downside would be one extra lookup in dev_pagemap tree
for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
per gup-fast() call.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h |  5 +++++
 mm/gup.c           | 24 +++++++++++++-----------
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 32f0c3986d4f..c89a049bbd7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }

+static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
+{
+	return pgmap->type == MEMORY_DEVICE_GENERIC;
+}
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define page_ref_zero_or_close_to_overflow(page) \
 	((unsigned int) page_ref_count(page) + 127u <= 127u)
diff --git a/mm/gup.c b/mm/gup.c
index 222d1fdc5cfa..03e370d360e6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
long end,
 			goto pte_unmap;

 		if (pte_devmap(pte)) {
-			if (unlikely(flags & FOLL_LONGTERM))
-				goto pte_unmap;
-
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
 				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
+
+			if (unlikely(flags & FOLL_LONGTERM) &&
+			    !devmap_longterm_available(pgmap)) {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+				goto pte_unmap;
+			}
+
 		} else if (pte_special(pte))
 			goto pte_unmap;

@@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			return 0;
 		}

+		if (unlikely(flags & FOLL_LONGTERM) &&
+		    !devmap_longterm_available(pgmap))
+			return 0;
+
@@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;

-	if (pmd_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
+	if (pmd_devmap(orig))
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
 					     pages, nr);
-	}

 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
@@ -2390,12 +2395,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;

-	if (pud_devmap(orig)) {
-		if (unlikely(flags & FOLL_LONGTERM))
-			return 0;
+	if (pud_devmap(orig))
 		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
 					     pages, nr);
-	}

 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
-- 
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-02-23 16:28   ` Joao Martins
@ 2021-02-23 16:44     ` Dan Williams
  2021-02-23 17:15       ` Joao Martins
       [not found]       ` <20210223185435.GO2643399@ziepe.ca>
  0 siblings, 2 replies; 67+ messages in thread
From: Dan Williams @ 2021-02-23 16:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Feb 23, 2021 at 8:30 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/20/21 1:18 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> >> are working with compound pages i.e. we do 1 increment/decrement to the head
> >> page for a given set of N subpages compared as opposed to N individual writes.
> >> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> >> improves considerably, and unpin_user_pages() improves as well when passed a
> >> set of consecutive pages:
> >>
> >>                                            before          after
> >>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> >>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >
> > Compelling!
> >
>
> BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
> device-dax?
>

Good catch.

Must have been an oversight of the conversion. FOLL_LONGTERM collides
with filesystem operations, but not device-dax. In fact that's the
motivation for device-dax in the first instance, no need to coordinate
runtime physical address layout changes because the device is
statically allocated.

> Looking at the history, I understand that fsdax can't support it atm, but I am not sure
> that the same holds for device-dax. I have this small chunk (see below the scissors mark)
> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure if there is
> a fundamental issue for the other types that makes this an unwelcoming change.
>
>         Joao
>
> --------------------->8---------------------
>
> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
>  MEMORY_DEVICE_GENERIC
>
> The downside would be one extra lookup in dev_pagemap tree
> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> per gup-fast() call.

I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
path. It should be measurable that this change is at least as fast or
faster than falling back to the slow path, but it would be good to
measure.

Code changes look good to me.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h |  5 +++++
>  mm/gup.c           | 24 +++++++++++++-----------
>  2 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 32f0c3986d4f..c89a049bbd7a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
>                 page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>
> +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
> +{

I'd call this devmap_can_longterm().

> +       return pgmap->type == MEMORY_DEVICE_GENERIC;
> +}
> +
>  /* 127: arbitrary random number, small enough to assemble well */
>  #define page_ref_zero_or_close_to_overflow(page) \
>         ((unsigned int) page_ref_count(page) + 127u <= 127u)
> diff --git a/mm/gup.c b/mm/gup.c
> index 222d1fdc5cfa..03e370d360e6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned
> long end,
>                         goto pte_unmap;
>
>                 if (pte_devmap(pte)) {
> -                       if (unlikely(flags & FOLL_LONGTERM))
> -                               goto pte_unmap;
> -
>                         pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
>                         if (unlikely(!pgmap)) {
>                                 undo_dev_pagemap(nr, nr_start, flags, pages);
>                                 goto pte_unmap;
>                         }
> +
> +                       if (unlikely(flags & FOLL_LONGTERM) &&
> +                           !devmap_longterm_available(pgmap)) {
> +                               undo_dev_pagemap(nr, nr_start, flags, pages);
> +                               goto pte_unmap;
> +                       }
> +
>                 } else if (pte_special(pte))
>                         goto pte_unmap;
>
> @@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>                         return 0;
>                 }
>
> +               if (unlikely(flags & FOLL_LONGTERM) &&
> +                   !devmap_longterm_available(pgmap))
> +                       return 0;
> +
> @@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>         if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
>                 return 0;
>
> -       if (pmd_devmap(orig)) {
> -               if (unlikely(flags & FOLL_LONGTERM))
> -                       return 0;
> +       if (pmd_devmap(orig))
>                 return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
>                                              pages, nr);
> -       }
>
>         page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
>         refs = record_subpages(page, addr, end, pages + *nr);
> @@ -2390,12 +2395,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>         if (!pud_access_permitted(orig, flags & FOLL_WRITE))
>                 return 0;
>
> -       if (pud_devmap(orig)) {
> -               if (unlikely(flags & FOLL_LONGTERM))
> -                       return 0;
> +       if (pud_devmap(orig))
>                 return __gup_device_huge_pud(orig, pudp, addr, end, flags,
>                                              pages, nr);
> -       }
>
>         page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>         refs = record_subpages(page, addr, end, pages + *nr);
> --
> 2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-23 15:46           ` Joao Martins
@ 2021-02-23 16:50             ` Dan Williams
  2021-02-23 17:18               ` Joao Martins
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-23 16:50 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On Tue, Feb 23, 2021 at 7:46 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/22/21 8:37 PM, Dan Williams wrote:
> > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> On 2/20/21 1:43 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>> On 12/8/20 9:28 AM, Joao Martins wrote:
> >>>>> diff --git a/mm/memremap.c b/mm/memremap.c
> >>>>> index 16b2fb482da1..287a24b7a65a 100644
> >>>>> --- a/mm/memremap.c
> >>>>> +++ b/mm/memremap.c
> >>>>> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >>>>>       memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >>>>>                               PHYS_PFN(range->start),
> >>>>>                               PHYS_PFN(range_len(range)), pgmap);
> >>>>> -     percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> >>>>> -                     - pfn_first(pgmap, range_id));
> >>>>> +     if (pgmap->flags & PGMAP_COMPOUND)
> >>>>> +             percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> >>>>> +                     - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
> >>>>
> >>>> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
> >>>> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
> >>>>
> >>>> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
> >>>> is simpler and skips a case, but it's not clear that it's required here. But I'm
> >>>> new to this area so be warned. :)
> >>>
> >>> There's a subtle distinction between the range that was passed in and
> >>> the pfns that are activated inside of it. See the offset trickery in
> >>> pfn_first().
> >>>
> >>>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
> >>>> happen to want, but is not named accordingly. Can you use or create something
> >>>> more accurately named? Like "number of pages in this large page"?
> >>>
> >>> It's not the number of pages in a large page it's converting bytes to
> >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
> >>> though process was if I'm going to add () might as well use a macro
> >>> that already does this.
> >>>
> >>> That said I think this calculation is broken precisely because
> >>> pfn_first() makes the result unaligned.
> >>>
> >>> Rather than fix the unaligned pfn_first() problem I would use this
> >>> support as an opportunity to revisit the option of storing pages in
> >>> the vmem_altmap reserve soace. The altmap's whole reason for existence
> >>> was that 1.5% of large PMEM might completely swamp DRAM. However if
> >>> that overhead is reduced by an order (or orders) of magnitude the
> >>> primary need for vmem_altmap vanishes.
> >>>
> >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE
> >>> case, but for most part existing deployments that are specifying page
> >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be
> >>> upgraded to page map on a smaller amount of DRAM.
> >>>
> >> I feel the altmap is still relevant. Even with the struct page reuse for
> >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
> >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and
> >> Matthew were thinking (in another thread) on compound_head() adjustments
> >> that probably can make this overhead go to 2G (if we learn to differentiate
> >> the reused head page from the real head page).
> >
> > I think that realization is more justification to make a new first
> > class vmemmap_populate_compound_pages() rather than try to reuse
> > vmemmap_populate_basepages() with new parameters.
> >
> I was already going to move this to vmemmap_populate_compound_pages() based
> on your earlier suggestion :)
>
> >> But even there it's still
> >> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need.
> >
> > The concern that led to altmap is that someone would build a system
> > with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and
> > minimum RAM, and mapping all PMEM consumes all RAM. As far as I
> > understand real world populations are rarely going past 8:1, that
> > seems to make 'struct page' in RAM feasible even for the 2M compound
> > page case.
> >
> > Let me ask you for a data point, since you're one of the people
> > actively deploying such systems, would you still use the 'struct page'
> > in PMEM capability after this set was merged?
> >
> We might be sticking to RAM stored 'struct page' yes, but hard to say atm
> what the future holds.
>
> >> One thing to point out about altmap is that the degradation (in pinning and
> >> unpining) we observed with struct page's in device memory, is no longer observed
> >> once 1) we batch ref count updates as we move to compound pages 2) reusing
> >> tail pages seems to lead to these struct pages staying more likely in cache
> >> which perhaps contributes to dirtying a lot less cachelines.
> >
> > True, it makes it more palatable to survive 'struct page' in PMEM, but
> > it's an ongoing maintenance burden that I'm not sure there are users
> > after putting 'struct page' on a diet.
>
> FWIW all I was trying to point out is that the 2M huge page overhead is still non
> trivial. It is indeed much better than it is ATM yes, but still 6G per 1TB with 2M huge
> pages. Only with 1G would be non-existent overhead, but then we have a trade-off elsewhere
> in terms of poisoning a whole 1G page and what not.
>
> > Don't get me wrong the
> > capability is still needed for filesystem-dax, but the distinction is
> > that vmemmap_populate_compound_pages() need never worry about an
> > altmap.
> >
> IMO there's not much added complexity strictly speaking about altmap. We still use the
> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is
> being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards
> hotplug making use of it) we are using it in the exact same way.
>
> The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing
> vmemmap blocks allocated in previous vmemmap pages, and preserving that across section
> onlining (for 1G pages).

True, I'm less worried about the complexity as much as
opportunistically converting configurations to RAM backed pages. It's
already the case that poison handling is page mapping size aligned for
device-dax, and filesystem-dax needs to stick with non-compound-pages
for the foreseeable future.

Ok, let's try to keep altmap in vmemmap_populate_compound_pages() and
see how it looks.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-02-23 16:44     ` Dan Williams
@ 2021-02-23 17:15       ` Joao Martins
  2021-02-23 18:15         ` Dan Williams
       [not found]       ` <20210223185435.GO2643399@ziepe.ca>
  1 sibling, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-02-23 17:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On 2/23/21 4:44 PM, Dan Williams wrote:
> On Tue, Feb 23, 2021 at 8:30 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 2/20/21 1:18 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
>>>> are working with compound pages i.e. we do 1 increment/decrement to the head
>>>> page for a given set of N subpages compared as opposed to N individual writes.
>>>> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
>>>> improves considerably, and unpin_user_pages() improves as well when passed a
>>>> set of consecutive pages:
>>>>
>>>>                                            before          after
>>>>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>>>>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
>>>
>>> Compelling!
>>>
>>
>> BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
>> device-dax?
>>
> 
> Good catch.
> 
> Must have been an oversight of the conversion. FOLL_LONGTERM collides
> with filesystem operations, but not device-dax. 

hmmmm, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 7af75561e171
("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume that
by "DAX pages" the submitter was only referring to fs-dax pages.

> In fact that's the
> motivation for device-dax in the first instance, no need to coordinate
> runtime physical address layout changes because the device is
> statically allocated.
> 
/me nods

>> Looking at the history, I understand that fsdax can't support it atm, but I am not sure
>> that the same holds for device-dax. I have this small chunk (see below the scissors mark)
>> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure if there is
>> a fundamental issue for the other types that makes this an unwelcoming change.
>>
>>         Joao
>>
>> --------------------->8---------------------
>>
>> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
>>  MEMORY_DEVICE_GENERIC
>>
>> The downside would be one extra lookup in dev_pagemap tree
>> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
>> per gup-fast() call.
> 
> I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> path. It should be measurable that this change is at least as fast or
> faster than falling back to the slow path, but it would be good to
> measure.
> 
But with the changes I am/will-be making I hope gup-fast and gup-slow
will be as fast (for present pmd/puds ofc, as the fault makes it slower).

I'll formally submit below patch, once I ran over the numbers.

> Code changes look good to me.
> 
Cool! Will add in the suggested change below.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h |  5 +++++
>>  mm/gup.c           | 24 +++++++++++++-----------
>>  2 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 32f0c3986d4f..c89a049bbd7a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
>>                 page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>>  }
>>
>> +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
>> +{
> 
> I'd call this devmap_can_longterm().
> 
Ack.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-23 16:50             ` Dan Williams
@ 2021-02-23 17:18               ` Joao Martins
  2021-02-23 18:18                 ` Dan Williams
  0 siblings, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-02-23 17:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On 2/23/21 4:50 PM, Dan Williams wrote:
> On Tue, Feb 23, 2021 at 7:46 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/22/21 8:37 PM, Dan Williams wrote:
>>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> On 2/20/21 1:43 AM, Dan Williams wrote:
>>>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>>>> On 12/8/20 9:28 AM, Joao Martins wrote:

[...]

>>> Don't get me wrong the
>>> capability is still needed for filesystem-dax, but the distinction is
>>> that vmemmap_populate_compound_pages() need never worry about an
>>> altmap.
>>>
>> IMO there's not much added complexity strictly speaking about altmap. We still use the
>> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is
>> being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards
>> hotplug making use of it) we are using it in the exact same way.
>>
>> The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing
>> vmemmap blocks allocated in previous vmemmap pages, and preserving that across section
>> onlining (for 1G pages).
> 
> True, I'm less worried about the complexity as much as
> opportunistically converting configurations to RAM backed pages. It's
> already the case that poison handling is page mapping size aligned for
> device-dax, and filesystem-dax needs to stick with non-compound-pages
> for the foreseeable future.
> 
Hmm, I was sort off wondering that fsdax could move to compound pages too as
opposed to base pages, albeit not necessarily using the vmemmap page reuse
as it splits pages IIUC.

> Ok, let's try to keep altmap in vmemmap_populate_compound_pages() and
> see how it looks.
> 
OK, will do.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
  2021-02-23 17:15       ` Joao Martins
@ 2021-02-23 18:15         ` Dan Williams
  0 siblings, 0 replies; 67+ messages in thread
From: Dan Williams @ 2021-02-23 18:15 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Feb 23, 2021 at 9:16 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/23/21 4:44 PM, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 8:30 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> On 2/20/21 1:18 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
> >>>> are working with compound pages i.e. we do 1 increment/decrement to the head
> >>>> page for a given set of N subpages compared as opposed to N individual writes.
> >>>> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> >>>> improves considerably, and unpin_user_pages() improves as well when passed a
> >>>> set of consecutive pages:
> >>>>
> >>>>                                            before          after
> >>>>     (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> >>>>     (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >>>
> >>> Compelling!
> >>>
> >>
> >> BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for
> >> device-dax?
> >>
> >
> > Good catch.
> >
> > Must have been an oversight of the conversion. FOLL_LONGTERM collides
> > with filesystem operations, but not device-dax.
>
> hmmmm, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 7af75561e171
> ("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume that
> by "DAX pages" the submitter was only referring to fs-dax pages.

Agree, that was an fsdax only assumption. Maybe this went unnoticed
because the primary gup-fast case for direct-I/O was not impacted.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-23 17:18               ` Joao Martins
@ 2021-02-23 18:18                 ` Dan Williams
  0 siblings, 0 replies; 67+ messages in thread
From: Dan Williams @ 2021-02-23 18:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On Tue, Feb 23, 2021 at 9:19 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/23/21 4:50 PM, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> On 2/22/21 8:37 PM, Dan Williams wrote:
> >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>> On 2/20/21 1:43 AM, Dan Williams wrote:
> >>>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>>>> On 12/8/20 9:28 AM, Joao Martins wrote:
>
> [...]
>
> >>> Don't get me wrong the
> >>> capability is still needed for filesystem-dax, but the distinction is
> >>> that vmemmap_populate_compound_pages() need never worry about an
> >>> altmap.
> >>>
> >> IMO there's not much added complexity strictly speaking about altmap. We still use the
> >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So whatever it is
> >> being maintained for fsdax or other altmap consumers (e.g. we seem to be working towards
> >> hotplug making use of it) we are using it in the exact same way.
> >>
> >> The complexity of the future vmemmap_populate_compound_pages() has more to do with reusing
> >> vmemmap blocks allocated in previous vmemmap pages, and preserving that across section
> >> onlining (for 1G pages).
> >
> > True, I'm less worried about the complexity as much as
> > opportunistically converting configurations to RAM backed pages. It's
> > already the case that poison handling is page mapping size aligned for
> > device-dax, and filesystem-dax needs to stick with non-compound-pages
> > for the foreseeable future.
> >
> Hmm, I was sort off wondering that fsdax could move to compound pages too as
> opposed to base pages, albeit not necessarily using the vmemmap page reuse
> as it splits pages IIUC.

I'm not sure compound pages for fsdax would work long term because
there's no infrastructure to reassemble compound pages after a split.
So if you fracture a block and then coalesce it back to a 2MB or 1GB
aligned block there's nothing to go fixup the compound page... unless
the filesystem wants to get into mm metadata fixups.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
       [not found]       ` <20210223185435.GO2643399@ziepe.ca>
@ 2021-02-23 22:48         ` Dan Williams
       [not found]           ` <20210223230723.GP2643399@ziepe.ca>
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-23 22:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joao Martins, Linux MM, linux-nvdimm, Matthew Wilcox,
	Muchun Song, Mike Kravetz, Andrew Morton

On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
>
> > > The downside would be one extra lookup in dev_pagemap tree
> > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > per gup-fast() call.
> >
> > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > path. It should be measurable that this change is at least as fast or
> > faster than falling back to the slow path, but it would be good to
> > measure.
>
> What is the dev_pagemap thing doing in gup fast anyhow?
>
> I've been wondering for a while..

It's there to synchronize against dax-device removal. The device will
suspend removal awaiting all page references to be dropped, but
gup-fast could be racing device removal. So gup-fast checks for
pte_devmap() to grab a live reference to the device before assuming it
can pin a page.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
       [not found]           ` <20210223230723.GP2643399@ziepe.ca>
@ 2021-02-24  0:14             ` Dan Williams
       [not found]               ` <20210224010017.GQ2643399@ziepe.ca>
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2021-02-24  0:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joao Martins, Linux MM, linux-nvdimm, Matthew Wilcox,
	Muchun Song, Mike Kravetz, Andrew Morton, Ralph Campbell

[ add Ralph ]

On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > >
> > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > per gup-fast() call.
> > > >
> > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > path. It should be measurable that this change is at least as fast or
> > > > faster than falling back to the slow path, but it would be good to
> > > > measure.
> > >
> > > What is the dev_pagemap thing doing in gup fast anyhow?
> > >
> > > I've been wondering for a while..
> >
> > It's there to synchronize against dax-device removal. The device will
> > suspend removal awaiting all page references to be dropped, but
> > gup-fast could be racing device removal. So gup-fast checks for
> > pte_devmap() to grab a live reference to the device before assuming it
> > can pin a page.
>
> From the perspective of CPU A it can't tell if CPU B is doing a HW
> page table walk or a GUP fast when it invalidates a page table. The
> design of gup-fast is supposed to be the same as the design of a HW
> page table walk, and the tlb invalidate CPU A does when removing a
> page from a page table is supposed to serialize against both a HW page
> table walk and gup-fast.
>
> Given that the HW page table walker does not do dev_pagemap stuff, why
> does gup-fast?

gup-fast historically assumed that the 'struct page' and memory
backing the page-table walk could not physically be removed from the
system during its walk because those pages were allocated from the
page allocator before being mapped into userspace. So there is an
implied elevated reference on any page that gup-fast would be asked to
walk, or pte_special() is there to "say wait, nevermind this isn't a
page allocator page fallback to gup-slow()". pte_devmap() is there to
say "wait, there is no implied elevated reference for this page, check
and hold dev_pagemap alive until a page reference can be taken". So it
splits the difference between pte_special() and typical page allocator
pages.

> Can you sketch the exact race this is protecting against?

Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
issues direct I/O with that mapping as the target buffer, Thread2 does
"echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
the dev_pagemap check reference gup-fast could execute
get_page(pte_page(pte)) on a page that doesn't even exist anymore
because the driver unbind has already performed remove_pages().

Effectively the same percpu_ref that protects the pmem0 block device
from new command submissions while the device is dying also prevents
new dax page references being taken while the device is dying.

This could be solved with the traditional gup-fast rules if the device
driver could tell the filesystem to unmap all dax files and force them
to re-fault through the gup-slow path to see that the device is now
dying. I'll likely be working on that sooner rather than later given
some of the expectations of the CXL persistent memory "dirty shutdown"
detection.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
       [not found]               ` <20210224010017.GQ2643399@ziepe.ca>
@ 2021-02-24  1:32                 ` Dan Williams
  0 siblings, 0 replies; 67+ messages in thread
From: Dan Williams @ 2021-02-24  1:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joao Martins, Linux MM, linux-nvdimm, Matthew Wilcox,
	Muchun Song, Mike Kravetz, Andrew Morton, Ralph Campbell

On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote:
> > [ add Ralph ]
> >
> > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >
> > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > > > >
> > > > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > > > per gup-fast() call.
> > > > > >
> > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > > > path. It should be measurable that this change is at least as fast or
> > > > > > faster than falling back to the slow path, but it would be good to
> > > > > > measure.
> > > > >
> > > > > What is the dev_pagemap thing doing in gup fast anyhow?
> > > > >
> > > > > I've been wondering for a while..
> > > >
> > > > It's there to synchronize against dax-device removal. The device will
> > > > suspend removal awaiting all page references to be dropped, but
> > > > gup-fast could be racing device removal. So gup-fast checks for
> > > > pte_devmap() to grab a live reference to the device before assuming it
> > > > can pin a page.
> > >
> > > From the perspective of CPU A it can't tell if CPU B is doing a HW
> > > page table walk or a GUP fast when it invalidates a page table. The
> > > design of gup-fast is supposed to be the same as the design of a HW
> > > page table walk, and the tlb invalidate CPU A does when removing a
> > > page from a page table is supposed to serialize against both a HW page
> > > table walk and gup-fast.
> > >
> > > Given that the HW page table walker does not do dev_pagemap stuff, why
> > > does gup-fast?
> >
> > gup-fast historically assumed that the 'struct page' and memory
> > backing the page-table walk could not physically be removed from the
> > system during its walk because those pages were allocated from the
> > page allocator before being mapped into userspace.
>
> No, I'd say gup-fast assumes that any non-special PTE it finds in a
> page table must have a struct page.
>
> If something wants to remove that struct page it must first remove all
> the PTEs pointing at it from the entire system and flush the TLBs,
> which directly prevents a future gup-fast from running and trying to
> access the struct page. No extra locking needed
>
> > implied elevated reference on any page that gup-fast would be asked to
> > walk, or pte_special() is there to "say wait, nevermind this isn't a
> > page allocator page fallback to gup-slow()".
>
> pte_special says there is no struct page, and some of those cases can
> be fixed up in gup-slow.
>
> > > Can you sketch the exact race this is protecting against?
> >
> > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
> > issues direct I/O with that mapping as the target buffer, Thread2 does
> > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
> > the dev_pagemap check reference gup-fast could execute
> > get_page(pte_page(pte)) on a page that doesn't even exist anymore
> > because the driver unbind has already performed remove_pages().
>
> Surely the unbind either waits for all the VMAs to be destroyed or
> zaps them before allowing things to progress to remove_pages()?

If we're talking about device-dax this is precisely what it does, zaps
and prevents new faults from resolving, but filesystem-dax...

> Having a situation where the CPU page tables still point at physical
> pages that have been removed sounds so crazy/insecure, that can't be
> what is happening, can it??

Hmm, that may be true and an original dax bug! The unbind of a
block-device from underneath the filesystem does trigger the
filesystem to emergency shutdown / go read-only, but unless that
process also includes a global zap of all dax mappings not only is
that violating expectations of "page-tables to disappearing memory",
but the filesystem may also want to guarantee that no further dax
writes can happen after shutdown. Right now I believe it only assumes
that mmap I/O will come from page writeback so there's no need to
bother applications with mappings to page cache, but dax mappings need
to be ripped away.

/me goes to look at what filesytems guarantee when the block-device is
surprise removed out from under them.

In any event, this accelerates the effort to go implement
fs-global-dax-zap at the request of the device driver.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-02-22 20:37         ` Dan Williams
  2021-02-23 15:46           ` Joao Martins
@ 2021-03-10 18:12           ` Joao Martins
  2021-03-12  5:54             ` Dan Williams
  1 sibling, 1 reply; 67+ messages in thread
From: Joao Martins @ 2021-03-10 18:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On 2/22/21 8:37 PM, Dan Williams wrote:
> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/20/21 1:43 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>>> On 12/8/20 9:28 AM, Joao Martins wrote:
>> One thing to point out about altmap is that the degradation (in pinning and
>> unpining) we observed with struct page's in device memory, is no longer observed
>> once 1) we batch ref count updates as we move to compound pages 2) reusing
>> tail pages seems to lead to these struct pages staying more likely in cache
>> which perhaps contributes to dirtying a lot less cachelines.
> 
> True, it makes it more palatable to survive 'struct page' in PMEM, 

I want to retract for now what I said above wrt to the no degradation with
struct page in device comment. I was fooled by a bug on a patch later down
this series. Particular because I accidentally cleared PGMAP_ALTMAP_VALID when
unilaterally setting PGMAP_COMPOUND, which consequently lead to always
allocating struct pages from memory. No wonder the numbers were just as fast.
I am still confident that it's going to be faster and observe less degradation
in pinning/init. Init for now is worst-case 2x faster. But to be *as fast* struct
pages in memory might still be early to say.

The broken masking of the PGMAP_ALTMAP_VALID bit did hide one flaw, where
we don't support altmap for basepages on x86/mm and it apparently depends
on architectures to implement it (and a couple other issues). The vmemmap
allocation isn't the problem, so the previous comment in this thread that
altmap doesn't change much in the vmemmap_populate_compound_pages() is
still accurate.

The problem though resides on the freeing of vmemmap pagetables with
basepages *with altmap* (e.g. at dax-device teardown) which require arch
support. Doing it properly would mean making the altmap reserve smaller
(given fewer pages are allocated), and the ability for the altmap pfn
allocator to track references per pfn. But I think it deserves its own
separate patch series (probably almost just as big).

Perhaps for this set I can stick without altmap as you suggested, and
use hugepage vmemmap population (which wouldn't
lead to device memory savings) instead of reusing base pages . I would
still leave the compound page support logic as metadata representation
for > 4K @align, as I think that's the right thing to do. And then
a separate series onto improving altmap to leverage the metadata reduction
support as done with non-device struct pages.

Thoughts?

	Joao
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
  2021-03-10 18:12           ` Joao Martins
@ 2021-03-12  5:54             ` Dan Williams
  0 siblings, 0 replies; 67+ messages in thread
From: Dan Williams @ 2021-03-12  5:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, linux-nvdimm, Matthew Wilcox, Jason Gunthorpe,
	Muchun Song, Mike Kravetz, Andrew Morton, John Hubbard

On Wed, Mar 10, 2021 at 10:13 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/22/21 8:37 PM, Dan Williams wrote:
> > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> On 2/20/21 1:43 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@nvidia.com> wrote:
> >>>> On 12/8/20 9:28 AM, Joao Martins wrote:
> >> One thing to point out about altmap is that the degradation (in pinning and
> >> unpining) we observed with struct page's in device memory, is no longer observed
> >> once 1) we batch ref count updates as we move to compound pages 2) reusing
> >> tail pages seems to lead to these struct pages staying more likely in cache
> >> which perhaps contributes to dirtying a lot less cachelines.
> >
> > True, it makes it more palatable to survive 'struct page' in PMEM,
>
> I want to retract for now what I said above wrt to the no degradation with
> struct page in device comment. I was fooled by a bug on a patch later down
> this series. Particular because I accidentally cleared PGMAP_ALTMAP_VALID when
> unilaterally setting PGMAP_COMPOUND, which consequently lead to always
> allocating struct pages from memory. No wonder the numbers were just as fast.
> I am still confident that it's going to be faster and observe less degradation
> in pinning/init. Init for now is worst-case 2x faster. But to be *as fast* struct
> pages in memory might still be early to say.
>
> The broken masking of the PGMAP_ALTMAP_VALID bit did hide one flaw, where
> we don't support altmap for basepages on x86/mm and it apparently depends
> on architectures to implement it (and a couple other issues). The vmemmap
> allocation isn't the problem, so the previous comment in this thread that
> altmap doesn't change much in the vmemmap_populate_compound_pages() is
> still accurate.
>
> The problem though resides on the freeing of vmemmap pagetables with
> basepages *with altmap* (e.g. at dax-device teardown) which require arch
> support. Doing it properly would mean making the altmap reserve smaller
> (given fewer pages are allocated), and the ability for the altmap pfn
> allocator to track references per pfn. But I think it deserves its own
> separate patch series (probably almost just as big).
>
> Perhaps for this set I can stick without altmap as you suggested, and
> use hugepage vmemmap population (which wouldn't
> lead to device memory savings) instead of reusing base pages . I would
> still leave the compound page support logic as metadata representation
> for > 4K @align, as I think that's the right thing to do. And then
> a separate series onto improving altmap to leverage the metadata reduction
> support as done with non-device struct pages.
>
> Thoughts?

The space savings is the whole point. So I agree with moving altmap
support to a follow-on enhancement, but land the non-altmap basepage
support in the first round.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2021-03-12  5:54 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-08 17:28 [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2020-12-08 17:28 ` [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages Joao Martins
2020-12-09  5:59   ` John Hubbard
2020-12-09  6:33     ` Matthew Wilcox
2020-12-09 13:12       ` Joao Martins
2021-02-20  1:43     ` Dan Williams
2021-02-22 11:24       ` Joao Martins
2021-02-22 20:37         ` Dan Williams
2021-02-23 15:46           ` Joao Martins
2021-02-23 16:50             ` Dan Williams
2021-02-23 17:18               ` Joao Martins
2021-02-23 18:18                 ` Dan Williams
2021-03-10 18:12           ` Joao Martins
2021-03-12  5:54             ` Dan Williams
2021-02-20  1:24   ` Dan Williams
2021-02-22 11:09     ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 2/9] sparse-vmemmap: Consolidate arguments in vmemmap section populate Joao Martins
2020-12-09  6:16   ` John Hubbard
2020-12-09 13:51     ` Joao Martins
2021-02-20  1:49   ` Dan Williams
2021-02-22 11:26     ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given mhp_params::align Joao Martins
2020-12-08 17:38   ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size Joao Martins
2021-02-20  3:34   ` Dan Williams
2021-02-22 11:42     ` Joao Martins
2021-02-22 22:40       ` Dan Williams
2021-02-23 15:46         ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 4/9] mm/page_alloc: Reuse tail struct pages for compound pagemaps Joao Martins
2021-02-20  6:17   ` Dan Williams
2021-02-22 12:01     ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 5/9] device-dax: Compound pagemap support Joao Martins
2020-12-08 17:28 ` [PATCH RFC 6/9] mm/gup: Grab head page refcount once for group of subpages Joao Martins
2020-12-09  4:40   ` John Hubbard
2020-12-09 13:44     ` Joao Martins
     [not found]   ` <20201208194905.GQ5487@ziepe.ca>
2020-12-09 11:05     ` Joao Martins
     [not found]       ` <20201209151505.GV5487@ziepe.ca>
2020-12-09 16:02         ` Joao Martins
     [not found]           ` <20201209162438.GW5487@ziepe.ca>
2020-12-09 17:27             ` Joao Martins
2020-12-09 18:14             ` Matthew Wilcox
2020-12-10 15:43               ` Joao Martins
2020-12-08 17:28 ` [PATCH RFC 7/9] mm/gup: Decrement head page " Joao Martins
     [not found]   ` <20201208193446.GP5487@ziepe.ca>
2020-12-09  5:06     ` John Hubbard
2020-12-09 12:17     ` Joao Martins
2020-12-17 19:05     ` Joao Martins
     [not found]       ` <20201217200530.GK5487@ziepe.ca>
2020-12-17 22:34         ` Joao Martins
2020-12-19  2:06         ` John Hubbard
2020-12-19 13:10           ` Joao Martins
2020-12-08 17:29 ` [PATCH RFC 8/9] RDMA/umem: batch page unpin in __ib_mem_release() Joao Martins
2020-12-09  5:18   ` John Hubbard
     [not found]   ` <20201208192935.GA1908088@ziepe.ca>
2020-12-09 10:59     ` Joao Martins
2020-12-19 13:15       ` Joao Martins
2020-12-08 17:29 ` [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas Joao Martins
2020-12-09  5:23   ` John Hubbard
     [not found]   ` <20201208195754.GR5487@ziepe.ca>
2020-12-09  8:05     ` Christoph Hellwig
2020-12-09 11:19     ` Joao Martins
2020-12-09  9:38 ` [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps David Hildenbrand
2020-12-09  9:52 ` [External] " Muchun Song
2021-02-20  1:18 ` Dan Williams
2021-02-22 11:06   ` Joao Martins
2021-02-22 14:32     ` Joao Martins
2021-02-23 16:28   ` Joao Martins
2021-02-23 16:44     ` Dan Williams
2021-02-23 17:15       ` Joao Martins
2021-02-23 18:15         ` Dan Williams
     [not found]       ` <20210223185435.GO2643399@ziepe.ca>
2021-02-23 22:48         ` Dan Williams
     [not found]           ` <20210223230723.GP2643399@ziepe.ca>
2021-02-24  0:14             ` Dan Williams
     [not found]               ` <20210224010017.GQ2643399@ziepe.ca>
2021-02-24  1:32                 ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).