* [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid()
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
` (10 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()). For pagemap with compound pages fetch the
compound_head in case we are handling a tail page memory failure.
Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.
Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
mm/memory-failure.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 24210c9bd843..94240d772623 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1318,6 +1318,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
goto out;
}
+ page = compound_head(page);
+
/*
* Prevent the inode from being freed while we are interrogating
* the address_space, typically this would be handled by
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
` (9 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
mm/page_alloc.c | 32 +++++++++++++++++++++-----------
1 file changed, 21 insertions(+), 11 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c53fe4fa10bf..43dd98446b0b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -692,24 +692,34 @@ void free_compound_page(struct page *page)
__free_pages_ok(page, compound_order(page), FPI_NONE);
}
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+ set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+ set_compound_order(page, order);
+ atomic_set(compound_mapcount_ptr(page), -1);
+ if (hpage_pincount_available(page))
+ atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+ struct page *p = head + tail_idx;
+
+ set_page_count(p, 0);
+ p->mapping = TAIL_MAPPING;
+ set_compound_head(p, head);
+}
+
void prep_compound_page(struct page *page, unsigned int order)
{
int i;
int nr_pages = 1 << order;
__SetPageHead(page);
- for (i = 1; i < nr_pages; i++) {
- struct page *p = page + i;
- set_page_count(p, 0);
- p->mapping = TAIL_MAPPING;
- set_compound_head(p, page);
- }
+ for (i = 1; i < nr_pages; i++)
+ prep_compound_tail(page, i);
- set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
- set_compound_order(page, order);
- atomic_set(compound_mapcount_ptr(page), -1);
- if (hpage_pincount_available(page))
- atomic_set(compound_pincount_ptr(page), 0);
+ prep_compound_head(page, order);
}
#ifdef CONFIG_DEBUG_PAGEALLOC
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
2021-03-25 23:09 ` [PATCH v1 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-03-25 23:09 ` [PATCH v1 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
` (8 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
Move struct page init to an helper function __init_zone_device_page().
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
1 file changed, 41 insertions(+), 33 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 43dd98446b0b..58974067bbd4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6237,6 +6237,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
}
#ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+ unsigned long zone_idx, int nid,
+ struct dev_pagemap *pgmap)
+{
+
+ __init_single_page(page, pfn, zone_idx, nid);
+
+ /*
+ * Mark page reserved as it will need to wait for onlining
+ * phase for it to be fully associated with a zone.
+ *
+ * We can use the non-atomic __set_bit operation for setting
+ * the flag as we are still initializing the pages.
+ */
+ __SetPageReserved(page);
+
+ /*
+ * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+ * and zone_device_data. It is a bug if a ZONE_DEVICE page is
+ * ever freed or placed on a driver-private list.
+ */
+ page->pgmap = pgmap;
+ page->zone_device_data = NULL;
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made.
+ *
+ * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+ * because this is done early in section_activate()
+ */
+ if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+ cond_resched();
+ }
+}
+
void __ref memmap_init_zone_device(struct zone *zone,
unsigned long start_pfn,
unsigned long nr_pages,
@@ -6265,39 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
struct page *page = pfn_to_page(pfn);
- __init_single_page(page, pfn, zone_idx, nid);
-
- /*
- * Mark page reserved as it will need to wait for onlining
- * phase for it to be fully associated with a zone.
- *
- * We can use the non-atomic __set_bit operation for setting
- * the flag as we are still initializing the pages.
- */
- __SetPageReserved(page);
-
- /*
- * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
- * and zone_device_data. It is a bug if a ZONE_DEVICE page is
- * ever freed or placed on a driver-private list.
- */
- page->pgmap = pgmap;
- page->zone_device_data = NULL;
-
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made.
- *
- * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
- * because this is done early in section_activate()
- */
- if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- cond_resched();
- }
+ __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
}
pr_info("%s initialised %lu pages in %ums\n", __func__,
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (2 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
[not found] ` <CAPcyv4gs_rHL7FPqyQEb3yT4jrv8Wo_xA2ojKsppoBfmDocq8A@mail.gmail.com>
2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
` (7 subsequent siblings)
11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
Add a new align property for struct dev_pagemap which specifies that a
pagemap is composed of a set of compound pages of size @align, instead
of base pages. When these pages are initialised, most are initialised as
tail pages instead of order-0 pages.
For certain ZONE_DEVICE users like device-dax which have a fixed page
size, this creates an opportunity to optimize GUP and GUP-fast walkers,
treating it the same way as THP or hugetlb pages.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
include/linux/memremap.h | 13 +++++++++++++
mm/memremap.c | 8 ++++++--
mm/page_alloc.c | 24 +++++++++++++++++++++++-
3 files changed, 42 insertions(+), 3 deletions(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index b46f63dcaed3..bb28d82dda5e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -114,6 +114,7 @@ struct dev_pagemap {
struct completion done;
enum memory_type type;
unsigned int flags;
+ unsigned long align;
const struct dev_pagemap_ops *ops;
void *owner;
int nr_range;
@@ -130,6 +131,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
return NULL;
}
+static inline unsigned long pgmap_align(struct dev_pagemap *pgmap)
+{
+ if (!pgmap || !pgmap->align)
+ return PAGE_SIZE;
+ return pgmap->align;
+}
+
+static inline unsigned long pgmap_pfn_align(struct dev_pagemap *pgmap)
+{
+ return PHYS_PFN(pgmap_align(pgmap));
+}
+
#ifdef CONFIG_ZONE_DEVICE
bool pfn_zone_device_reserved(unsigned long pfn);
void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..d160853670c4 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
PHYS_PFN(range->start),
PHYS_PFN(range_len(range)), pgmap);
- percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
- - pfn_first(pgmap, range_id));
+ if (pgmap_align(pgmap) > PAGE_SIZE)
+ percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+ - pfn_first(pgmap, range_id)) / pgmap_pfn_align(pgmap));
+ else
+ percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+ - pfn_first(pgmap, range_id));
return 0;
err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 58974067bbd4..3a77f9e43f3a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6285,6 +6285,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
unsigned long pfn, end_pfn = start_pfn + nr_pages;
struct pglist_data *pgdat = zone->zone_pgdat;
struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+ unsigned int pfn_align = pgmap_pfn_align(pgmap);
+ unsigned int order_align = order_base_2(pfn_align);
unsigned long zone_idx = zone_idx(zone);
unsigned long start = jiffies;
int nid = pgdat->node_id;
@@ -6302,10 +6304,30 @@ void __ref memmap_init_zone_device(struct zone *zone,
nr_pages = end_pfn - start_pfn;
}
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
struct page *page = pfn_to_page(pfn);
+ unsigned long i;
__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+ if (pfn_align == 1)
+ continue;
+
+ __SetPageHead(page);
+
+ for (i = 1; i < pfn_align; i++) {
+ __init_zone_device_page(page + i, pfn + i, zone_idx,
+ nid, pgmap);
+ prep_compound_tail(page, i);
+
+ /*
+ * The first and second tail pages need to
+ * initialized first, hence the head page is
+ * prepared last.
+ */
+ if (i == 2)
+ prep_compound_head(page, order_align);
+ }
}
pr_info("%s initialised %lu pages in %ums\n", __func__,
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (3 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
` (6 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
@altmap is stored in a dev_pagemap, but it will be repurposed for
hotplug memory for storing the memmap in the hotplugged memory[*] and
reusing the altmap infrastructure to that end. This is to say that
altmap can't be replaced with a @pgmap as it is going to cover more than
dev_pagemap backend altmaps.
So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:
sparse_add_section
section_activate
populate_section_memmap
__populate_section_memmap
Passing @pgmap allows __populate_section_memmap() to both fetch the align
in which memmap metadata is created for and also to let sparse-vmemmap
fetch pgmap ranges to co-relate to a given section and pick whether to just
reuse tail pages from past onlined sections.
[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
include/linux/memory_hotplug.h | 5 ++++-
include/linux/mm.h | 3 ++-
mm/memory_hotplug.c | 3 ++-
mm/sparse-vmemmap.c | 3 ++-
mm/sparse.c | 24 +++++++++++++++---------
5 files changed, 25 insertions(+), 13 deletions(-)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a85d4b7d15c2..45532192934c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -14,6 +14,7 @@ struct mem_section;
struct memory_block;
struct resource;
struct vmem_altmap;
+struct dev_pagemap;
#ifdef CONFIG_MEMORY_HOTPLUG
struct page *pfn_to_online_page(unsigned long pfn);
@@ -72,6 +73,7 @@ typedef int __bitwise mhp_t;
struct mhp_params {
struct vmem_altmap *altmap;
pgprot_t pgprot;
+ struct dev_pagemap *pgmap;
};
bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -360,7 +362,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
unsigned long nr_pages);
extern bool is_memblock_offlined(struct memory_block *mem);
extern int sparse_add_section(int nid, unsigned long pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap);
+ unsigned long nr_pages, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap);
extern void sparse_remove_section(struct mem_section *ms,
unsigned long pfn, unsigned long nr_pages,
unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb1e191da319..61474602c2b1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3033,7 +3033,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip)
void *sparse_buffer_alloc(unsigned long size);
struct page * __populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap);
pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e0b499a1fee4..2df3b2a7b4b5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -384,7 +384,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
/* Select all remaining pages up to the next section boundary */
cur_nr_pages = min(end_pfn - pfn,
SECTION_ALIGN_UP(pfn + 1) - pfn);
- err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+ err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+ params->pgmap);
if (err)
break;
cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16183d85a7d5..370728c206ee 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -249,7 +249,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
}
struct page * __meminit __populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
{
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index be66a62e22c3..c2abf1281a89 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -443,7 +443,8 @@ static unsigned long __init section_map_size(void)
}
struct page __init *__populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
{
unsigned long size = section_map_size();
struct page *map = sparse_buffer_alloc(size);
@@ -542,7 +543,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
break;
map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
- nid, NULL);
+ nid, NULL, NULL);
if (!map) {
pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
__func__, nid);
@@ -648,9 +649,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static struct page * __meminit populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
{
- return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+ return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
}
static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -719,7 +721,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
}
#else
struct page * __meminit populate_section_memmap(unsigned long pfn,
- unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+ unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
{
return kvmalloc_node(array_size(sizeof(struct page),
PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -842,7 +845,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
}
static struct page * __meminit section_activate(int nid, unsigned long pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap)
+ unsigned long nr_pages, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
{
struct mem_section *ms = __pfn_to_section(pfn);
struct mem_section_usage *usage = NULL;
@@ -874,7 +878,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
if (nr_pages < PAGES_PER_SECTION && early_section(ms))
return pfn_to_page(pfn);
- memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+ memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
if (!memmap) {
section_deactivate(pfn, nr_pages, altmap);
return ERR_PTR(-ENOMEM);
@@ -889,6 +893,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
* @start_pfn: start pfn of the memory range
* @nr_pages: number of pfns to add in the section
* @altmap: device page map
+ * @pgmap: device page map object that owns the section
*
* This is only intended for hotplug.
*
@@ -902,7 +907,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
* * -ENOMEM - Out of memory.
*/
int __meminit sparse_add_section(int nid, unsigned long start_pfn,
- unsigned long nr_pages, struct vmem_altmap *altmap)
+ unsigned long nr_pages, struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
{
unsigned long section_nr = pfn_to_section_nr(start_pfn);
struct mem_section *ms;
@@ -913,7 +919,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
if (ret < 0)
return ret;
- memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+ memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
if (IS_ERR(memmap))
return PTR_ERR(memmap);
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (4 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 05/11] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
` (5 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
Move the actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages()
walk through all base pages it needs to populate.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
mm/sparse-vmemmap.c | 44 ++++++++++++++++++++++++++------------------
1 file changed, 26 insertions(+), 18 deletions(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 370728c206ee..8814876edcfa 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -216,33 +216,41 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
return pgd;
}
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
- int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+ struct vmem_altmap *altmap)
{
- unsigned long addr = start;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ pgd = vmemmap_pgd_populate(addr, node);
+ if (!pgd)
+ return -ENOMEM;
+ p4d = vmemmap_p4d_populate(pgd, addr, node);
+ if (!p4d)
+ return -ENOMEM;
+ pud = vmemmap_pud_populate(p4d, addr, node);
+ if (!pud)
+ return -ENOMEM;
+ pmd = vmemmap_pmd_populate(pud, addr, node);
+ if (!pmd)
+ return -ENOMEM;
+ pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+ if (!pte)
+ return -ENOMEM;
+ vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+ int node, struct vmem_altmap *altmap)
+{
+ unsigned long addr = start;
+
for (; addr < end; addr += PAGE_SIZE) {
- pgd = vmemmap_pgd_populate(addr, node);
- if (!pgd)
- return -ENOMEM;
- p4d = vmemmap_p4d_populate(pgd, addr, node);
- if (!p4d)
- return -ENOMEM;
- pud = vmemmap_pud_populate(p4d, addr, node);
- if (!pud)
- return -ENOMEM;
- pmd = vmemmap_pmd_populate(pud, addr, node);
- if (!pmd)
- return -ENOMEM;
- pte = vmemmap_pte_populate(pmd, addr, node, altmap);
- if (!pte)
+ if (vmemmap_populate_address(addr, node, altmap))
return -ENOMEM;
- vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
}
return 0;
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (5 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 06/11] mm/sparse-vmemmap: refactor vmemmap_populate_basepages() Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
` (4 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.
To minimize struct page overhead we take advantage of the fact that
most tail pages look the same (except the first two). We allocate a
separate page for the vmemmap area which contains the head page and
separate for the next 64 pages. The rest of the subsections then reuse
this tail vmemmap page to initialize the rest of the tail pages.
Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound pagemap with big enough @align (e.g. 1G
PUD) it will cross various sections. To be able to reuse tail pages
across sections belonging to the same gigantic page we fetch the
@range being mapped (nr_ranges + 1). If the section being mapped is
not offset 0 of the @align, then lookup the PFN of the struct page
address that preceeds it and use that to populate the entire
section.
On compound pagemaps with 2M align, this lets mechanism saves 6 pages
out of the 8 necessary PFNs necessary to set the subsection's 512
struct pages being mapped. On a 1G compound pagemap it saves
4094 pages.
Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
include/linux/mm.h | 2 +-
mm/memremap.c | 1 +
mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
3 files changed, 131 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 61474602c2b1..49d717ae40ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
- struct vmem_altmap *altmap);
+ struct vmem_altmap *altmap, void *block);
void *vmemmap_alloc_block(unsigned long size, int node);
struct vmem_altmap;
void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index d160853670c4..2e6bc0b1ff00 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
{
struct mhp_params params = {
.altmap = pgmap_altmap(pgmap),
+ .pgmap = pgmap,
.pgprot = PAGE_KERNEL,
};
const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8814876edcfa..f57c5eada099 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
}
pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
- struct vmem_altmap *altmap)
+ struct vmem_altmap *altmap, void *block)
{
pte_t *pte = pte_offset_kernel(pmd, addr);
if (pte_none(*pte)) {
pte_t entry;
- void *p;
-
- p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
- if (!p)
- return NULL;
+ void *p = block;
+
+ if (!block) {
+ p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+ if (!p)
+ return NULL;
+ } else if (!altmap) {
+ get_page(virt_to_page(block));
+ }
entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
set_pte_at(&init_mm, addr, pte, entry);
}
@@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
}
static int __meminit vmemmap_populate_address(unsigned long addr, int node,
- struct vmem_altmap *altmap)
+ struct vmem_altmap *altmap,
+ void *page, void **ptr)
{
pgd_t *pgd;
p4d_t *p4d;
@@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
pmd = vmemmap_pmd_populate(pud, addr, node);
if (!pmd)
return -ENOMEM;
- pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+ pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
if (!pte)
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+ if (ptr)
+ *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
+ return 0;
}
int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
@@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
unsigned long addr = start;
for (; addr < end; addr += PAGE_SIZE) {
- if (vmemmap_populate_address(addr, node, altmap))
+ if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+ unsigned long end,
+ int node, void *page)
+{
+ unsigned long addr = start;
+
+ for (; addr < end; addr += PAGE_SIZE) {
+ if (vmemmap_populate_address(addr, node, NULL, page, NULL))
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+ void **ptr)
+{
+ return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
+}
+
+static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+{
+ pgd_t *pgd;
+ p4d_t *p4d;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ return NULL;
+
+ p4d = p4d_offset(pgd, addr);
+ if (p4d_none(*p4d))
+ return NULL;
+
+ pud = pud_offset(p4d, addr);
+ if (pud_none(*pud))
+ return NULL;
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return NULL;
+
+ pte = pte_offset_kernel(pmd, addr);
+ if (pte_none(*pte))
+ return NULL;
+
+ return pte;
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+ unsigned long start,
+ unsigned long end, int node,
+ struct dev_pagemap *pgmap)
+{
+ unsigned long offset, size, addr;
+
+ /*
+ * For compound pages bigger than section size (e.g. 1G) fill the rest
+ * of sections as tail pages.
+ *
+ * Note that memremap_pages() resets @nr_range value and will increment
+ * it after each range successful onlining. Thus the value or @nr_range
+ * at section memmap populate corresponds to the in-progress range
+ * being onlined that we care about.
+ */
+ offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
+ if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
+ pgmap_align(pgmap) > SUBSECTION_SIZE) {
+ pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+
+ if (!ptep)
+ return -ENOMEM;
+
+ return vmemmap_populate_range(start, end, node,
+ page_to_virt(pte_page(*ptep)));
+ }
+
+ size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
+ for (addr = start; addr < end; addr += size) {
+ unsigned long next = addr, last = addr + size;
+ void *block;
+
+ /* Populate the head page vmemmap page */
+ if (vmemmap_populate_page(addr, node, NULL))
+ return -ENOMEM;
+
+ /* Populate the tail pages vmemmap page */
+ block = NULL;
+ next = addr + PAGE_SIZE;
+ if (vmemmap_populate_page(next, node, &block))
+ return -ENOMEM;
+
+ /* Reuse the previous page for the rest of tail pages */
+ next += PAGE_SIZE;
+ if (vmemmap_populate_range(next, last, node, block))
return -ENOMEM;
}
@@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
{
unsigned long start = (unsigned long) pfn_to_page(pfn);
unsigned long end = start + nr_pages * sizeof(struct page);
+ unsigned int align = pgmap_align(pgmap);
+ int r;
if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
return NULL;
- if (vmemmap_populate(start, end, nid, altmap))
+ if (align > PAGE_SIZE && !altmap)
+ r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+ else
+ r = vmemmap_populate(start, end, nid, altmap);
+
+ if (r < 0)
return NULL;
return pfn_to_page(pfn);
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (6 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
` (3 subsequent siblings)
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that preceeds
the memmap we are initializing. This is done when a given memmap
address isn't aligned to the pgmap @align (which is safe to do because
@ranges are guaranteed to be aligned to @align).
For pagemaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages. Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD we previously populated which only contain
tail pages).
After this scheme for an 1GB pagemap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages. The latter page gets
its PMD reused across future section mapping of the same pagemap.
Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. For each PUD-aligned
pagemap we go from 40960 bytes down to 16384 bytes. Rather than
requiring 8 PMD page allocations we only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
include/linux/mm.h | 3 +-
mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 63 insertions(+), 19 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 49d717ae40ae..9c1a676d6b95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+ void *block);
pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
struct vmem_altmap *altmap, void *block);
void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index f57c5eada099..291a8a32480a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
return p;
}
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+ void *block)
{
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
- if (!p)
- return NULL;
+ void *p = block;
+
+ if (!block) {
+ p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+ if (!p)
+ return NULL;
+ } else {
+ get_page(virt_to_page(block));
+ }
pmd_populate_kernel(&init_mm, pmd, p);
}
return pmd;
@@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
return pgd;
}
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
- struct vmem_altmap *altmap,
- void *page, void **ptr)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+ struct vmem_altmap *altmap,
+ void *page, pmd_t **ptr)
{
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
- pte_t *pte;
pgd = vmemmap_pgd_populate(addr, node);
if (!pgd)
@@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
pud = vmemmap_pud_populate(p4d, addr, node);
if (!pud)
return -ENOMEM;
- pmd = vmemmap_pmd_populate(pud, addr, node);
+ pmd = vmemmap_pmd_populate(pud, addr, node, page);
if (!pmd)
return -ENOMEM;
+ if (ptr)
+ *ptr = pmd;
+ return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+ struct vmem_altmap *altmap,
+ void *page, void **ptr)
+{
+ pmd_t *pmd;
+ pte_t *pte;
+
+ if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
+ return -ENOMEM;
+
pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
if (!pte)
return -ENOMEM;
@@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
}
-static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+ unsigned long end,
+ int node, void *page)
+{
+ unsigned long addr = start;
+
+ for (; addr < end; addr += PMD_SIZE) {
+ if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
{
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
- pte_t *pte;
pgd = pgd_offset_k(addr);
if (pgd_none(*pgd))
@@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
if (pmd_none(*pmd))
return NULL;
- pte = pte_offset_kernel(pmd, addr);
- if (pte_none(*pte))
- return NULL;
-
- return pte;
+ return pmd;
}
static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
@@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
pgmap_align(pgmap) > SUBSECTION_SIZE) {
- pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+ pmd_t *pmdp;
+ pte_t *ptep;
+
+ addr = start - PAGE_SIZE;
+ pmdp = vmemmap_lookup_address(addr);
+ if (!pmdp)
+ return -ENOMEM;
+
+ /* Reuse the tail pages vmemmap pmd page */
+ if (offset % pgmap->align > PFN_PHYS(PAGES_PER_SECTION))
+ return vmemmap_populate_pmd_range(start, end, node,
+ page_to_virt(pmd_page(*pmdp)));
- if (!ptep)
+ /* Populate the tail pages vmemmap pmd page */
+ ptep = pte_offset_kernel(pmdp, addr);
+ if (pte_none(*ptep))
return -ENOMEM;
return vmemmap_populate_range(start, end, node,
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for compound pagemaps
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (7 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 08/11] mm/sparse-vmemmap: use hugepages for PUD " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
[not found] ` <CAPcyv4gtSqfmuAaX9cs63OvLkf-h4B_5fPiEnM9p9cqLZztXpg@mail.gmail.com>
2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
` (2 subsequent siblings)
11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
When a pgmap @align is set, all pages are mapped at a given huge page
alignment and thus uses compound pages to describe them as opposed to a
struct per 4K.
With @align > PAGE_SIZE and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of unique
struct pages is a lot smaller that the total amount of struct pages
being mapped.
When struct pages are initialize in memmap_init_zone_device, make
sure that only unique struct pages are initialized i.e. the first 2
4K pages per @align which means 128 struct pages, instead of 32768 for
2M @align or 262144 for a 1G @align.
Keep old behaviour with altmap given that it doesn't support reusal
of tail vmemmap areas.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
mm/page_alloc.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3a77f9e43f3a..948dfad6754b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6277,6 +6277,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
}
}
+#define MEMMAP_NR_PAGES (2 * (PAGE_SIZE/sizeof(struct page)))
+
void __ref memmap_init_zone_device(struct zone *zone,
unsigned long start_pfn,
unsigned long nr_pages,
@@ -6287,6 +6289,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
struct vmem_altmap *altmap = pgmap_altmap(pgmap);
unsigned int pfn_align = pgmap_pfn_align(pgmap);
unsigned int order_align = order_base_2(pfn_align);
+ unsigned long ntails = min_t(unsigned long, pfn_align, MEMMAP_NR_PAGES);
unsigned long zone_idx = zone_idx(zone);
unsigned long start = jiffies;
int nid = pgdat->node_id;
@@ -6302,6 +6305,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
if (altmap) {
start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
nr_pages = end_pfn - start_pfn;
+ ntails = pfn_align;
}
for (pfn = start_pfn; pfn < end_pfn; pfn += pfn_align) {
@@ -6315,7 +6319,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
__SetPageHead(page);
- for (i = 1; i < pfn_align; i++) {
+ for (i = 1; i < ntails; i++) {
__init_zone_device_page(page + i, pfn + i, zone_idx,
nid, pgmap);
prep_compound_tail(page, i);
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 10/11] device-dax: compound pagemap support
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (8 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 09/11] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
[not found] ` <CAPcyv4jeY0K7ciWeCLjxXmiWs7NNeM-_zEdZ2XAdYnyZc9PvWA@mail.gmail.com>
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-04-01 9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
dax devices are created with a fixed @align (huge page size) which
is enforced through as well at mmap() of the device. Faults,
consequently happen too at the specified @align specified at the
creation, and those don't change through out dax device lifetime.
MCEs poisons a whole dax huge page, as well as splits occurring at
at the configured page size.
Use the newly added compound pagemap facility which maps the
assigned dax ranges as compound pages at a page size of @align.
Currently, this means, that region/namespace bootstrap would take
considerably less, given that you would initialize considerably less
pages.
On setups with 128G NVDIMMs the initialization with DRAM stored struct pages
improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than
a 1msec with 1G pages.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++----------
1 file changed, 45 insertions(+), 13 deletions(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index db92573c94e8..e3dcc4ad1727 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
}
#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+ unsigned long fault_size,
+ struct address_space *f_mapping)
+{
+ unsigned long i;
+ pgoff_t pgoff;
+
+ pgoff = linear_page_index(vmf->vma, vmf->address
+ & ~(fault_size - 1));
+
+ for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+ struct page *page;
+
+ page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+ if (page->mapping)
+ continue;
+ page->mapping = f_mapping;
+ page->index = pgoff + i;
+ }
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+ unsigned long fault_size,
+ struct address_space *f_mapping)
+{
+ struct page *head;
+
+ head = pfn_to_page(pfn_t_to_pfn(pfn));
+ head = compound_head(head);
+ if (head->mapping)
+ return;
+
+ head->mapping = f_mapping;
+ head->index = linear_page_index(vmf->vma, vmf->address
+ & ~(fault_size - 1));
+}
+
static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
{
@@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
}
if (rc == VM_FAULT_NOPAGE) {
- unsigned long i;
- pgoff_t pgoff;
+ struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap;
/*
* In the device-dax case the only possibility for a
@@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
* mapped. No need to consider the zero page, or racing
* conflicting mappings.
*/
- pgoff = linear_page_index(vmf->vma, vmf->address
- & ~(fault_size - 1));
- for (i = 0; i < fault_size / PAGE_SIZE; i++) {
- struct page *page;
-
- page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
- if (page->mapping)
- continue;
- page->mapping = filp->f_mapping;
- page->index = pgoff + i;
- }
+ if (pgmap->align > PAGE_SIZE)
+ set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+ else
+ set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
}
dax_read_unlock(id);
@@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
}
pgmap->type = MEMORY_DEVICE_GENERIC;
+ if (dev_dax->align > PAGE_SIZE)
+ pgmap->align = dev_dax->align;
+
addr = devm_memremap_pages(dev, pgmap);
if (IS_ERR(addr))
return PTR_ERR(addr);
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (9 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 10/11] device-dax: compound pagemap support Joao Martins
@ 2021-03-25 23:09 ` Joao Martins
2021-06-02 1:05 ` Dan Williams
2021-04-01 9:38 ` [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
11 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-03-25 23:09 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton, Joao Martins
Much like hugetlbfs or THPs, treat device pagemaps with
compound pages like the rest of GUP handling of compound pages.
Rather than incrementing the refcount every 4K, we record
all sub pages and increment by @refs amount *once*.
Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
(pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
[altmap]
(get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
(pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
$ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
(pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
[altmap with -m 127004]
(get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
(pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 32 insertions(+), 20 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index b3e647c8b7ee..514f12157a0f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
}
#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
+
+static int record_subpages(struct page *page, unsigned long addr,
+ unsigned long end, struct page **pages)
+{
+ int nr;
+
+ for (nr = 0; addr != end; addr += PAGE_SIZE)
+ pages[nr++] = page++;
+
+ return nr;
+}
+
#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
static int __gup_device_huge(unsigned long pfn, unsigned long addr,
unsigned long end, unsigned int flags,
struct page **pages, int *nr)
{
- int nr_start = *nr;
+ int refs, nr_start = *nr;
struct dev_pagemap *pgmap = NULL;
do {
- struct page *page = pfn_to_page(pfn);
+ struct page *head, *page = pfn_to_page(pfn);
+ unsigned long next;
pgmap = get_dev_pagemap(pfn, pgmap);
if (unlikely(!pgmap)) {
undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
- SetPageReferenced(page);
- pages[*nr] = page;
- if (unlikely(!try_grab_page(page, flags))) {
- undo_dev_pagemap(nr, nr_start, flags, pages);
+
+ head = compound_head(page);
+ next = PageCompound(head) ? end : addr + PAGE_SIZE;
+ refs = record_subpages(page, addr, next, pages + *nr);
+
+ SetPageReferenced(head);
+ head = try_grab_compound_head(head, refs, flags);
+ if (!head) {
+ if (PageCompound(head)) {
+ ClearPageReferenced(head);
+ put_dev_pagemap(pgmap);
+ } else {
+ undo_dev_pagemap(nr, nr_start, flags, pages);
+ }
return 0;
}
- (*nr)++;
- pfn++;
- } while (addr += PAGE_SIZE, addr != end);
+ *nr += refs;
+ pfn += refs;
+ } while (addr += (refs << PAGE_SHIFT), addr != end);
if (pgmap)
put_dev_pagemap(pgmap);
@@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
}
#endif
-static int record_subpages(struct page *page, unsigned long addr,
- unsigned long end, struct page **pages)
-{
- int nr;
-
- for (nr = 0; addr != end; addr += PAGE_SIZE)
- pages[nr++] = page++;
-
- return nr;
-}
-
#ifdef CONFIG_ARCH_HAS_HUGEPD
static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
unsigned long sz)
--
2.17.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-06-02 1:05 ` Dan Williams
2021-06-07 15:21 ` Joao Martins
0 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2021-06-02 1:05 UTC (permalink / raw)
To: Joao Martins
Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
Muchun Song, Mike Kravetz, Andrew Morton, nvdimm
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Much like hugetlbfs or THPs, treat device pagemaps with
> compound pages like the rest of GUP handling of compound pages.
>
How about:
"Use try_grab_compound_head() for device-dax GUP when configured with
a compound pagemap."
> Rather than incrementing the refcount every 4K, we record
> all sub pages and increment by @refs amount *once*.
"Rather than incrementing the refcount for each page, do one atomic
addition for all the pages to be pinned."
>
> Performance measured by gup_benchmark improves considerably
> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>
> $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> [altmap]
> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
Hmm what is altmap representing here? The altmap case does not support
compound geometry, so this last test is comparing pinning this amount
of memory without compound pages where the memmap is in PMEM to the
speed *with* compound pages and the memmap in DRAM?
>
> $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> [altmap with -m 127004]
> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 32 insertions(+), 20 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index b3e647c8b7ee..514f12157a0f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> }
> #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>
> +
> +static int record_subpages(struct page *page, unsigned long addr,
> + unsigned long end, struct page **pages)
> +{
> + int nr;
> +
> + for (nr = 0; addr != end; addr += PAGE_SIZE)
> + pages[nr++] = page++;
> +
> + return nr;
> +}
> +
> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> unsigned long end, unsigned int flags,
> struct page **pages, int *nr)
> {
> - int nr_start = *nr;
> + int refs, nr_start = *nr;
> struct dev_pagemap *pgmap = NULL;
>
> do {
> - struct page *page = pfn_to_page(pfn);
> + struct page *head, *page = pfn_to_page(pfn);
> + unsigned long next;
>
> pgmap = get_dev_pagemap(pfn, pgmap);
> if (unlikely(!pgmap)) {
> undo_dev_pagemap(nr, nr_start, flags, pages);
> return 0;
> }
> - SetPageReferenced(page);
> - pages[*nr] = page;
> - if (unlikely(!try_grab_page(page, flags))) {
> - undo_dev_pagemap(nr, nr_start, flags, pages);
> +
> + head = compound_head(page);
> + next = PageCompound(head) ? end : addr + PAGE_SIZE;
This looks a tad messy, and makes assumptions that upper layers are
not sending this routine multiple huge pages to map. next should be
set to the next compound page, not end.
> + refs = record_subpages(page, addr, next, pages + *nr);
> +
> + SetPageReferenced(head);
> + head = try_grab_compound_head(head, refs, flags);
> + if (!head) {
> + if (PageCompound(head)) {
@head is NULL here, I think you wanted to rename the result of
try_grab_compound_head() to something like pinned_head so that you
don't undo the work you did above. However I feel like there's one too
PageCompund() checks.
> + ClearPageReferenced(head);
> + put_dev_pagemap(pgmap);
> + } else {
> + undo_dev_pagemap(nr, nr_start, flags, pages);
> + }
> return 0;
> }
> - (*nr)++;
> - pfn++;
> - } while (addr += PAGE_SIZE, addr != end);
> + *nr += refs;
> + pfn += refs;
> + } while (addr += (refs << PAGE_SHIFT), addr != end);
>
> if (pgmap)
> put_dev_pagemap(pgmap);
> @@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> }
> #endif
>
> -static int record_subpages(struct page *page, unsigned long addr,
> - unsigned long end, struct page **pages)
> -{
> - int nr;
> -
> - for (nr = 0; addr != end; addr += PAGE_SIZE)
> - pages[nr++] = page++;
> -
> - return nr;
> -}
> -
> #ifdef CONFIG_ARCH_HAS_HUGEPD
> static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
> unsigned long sz)
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
2021-06-02 1:05 ` Dan Williams
@ 2021-06-07 15:21 ` Joao Martins
2021-06-07 19:22 ` Dan Williams
0 siblings, 1 reply; 23+ messages in thread
From: Joao Martins @ 2021-06-07 15:21 UTC (permalink / raw)
To: Dan Williams
Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
Muchun Song, Mike Kravetz, Andrew Morton, nvdimm
On 6/2/21 2:05 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Much like hugetlbfs or THPs, treat device pagemaps with
>> compound pages like the rest of GUP handling of compound pages.
>>
>
> How about:
>
> "Use try_grab_compound_head() for device-dax GUP when configured with
> a compound pagemap."
>
Yeap, a bit clearer indeed.
>> Rather than incrementing the refcount every 4K, we record
>> all sub pages and increment by @refs amount *once*.
>
> "Rather than incrementing the refcount for each page, do one atomic
> addition for all the pages to be pinned."
>
ACK.
>>
>> Performance measured by gup_benchmark improves considerably
>> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
>>
>> $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
>> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
>> [altmap]
>> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
>> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
>
> Hmm what is altmap representing here? The altmap case does not support
> compound geometry,
It does support compound geometry and so we use compound pages with altmap case.
What altmap doesn't support is the memory savings in the vmemmap that can be
done when using compound pages. That's what is represented here.
> so this last test is comparing pinning this amount
> of memory without compound pages where the memmap is in PMEM to the
> speed *with* compound pages and the memmap in DRAM?
>
The test compares pinning this amount of memory with compound pages placed
in PMEM and in DRAM. It just exposes just how ineficient this can get if huge pages aren't
represented with compound pages.
>>
>> $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
>> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
>> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
>> [altmap with -m 127004]
>> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
>> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
>> 1 file changed, 32 insertions(+), 20 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index b3e647c8b7ee..514f12157a0f 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>> }
>> #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
>>
>> +
>> +static int record_subpages(struct page *page, unsigned long addr,
>> + unsigned long end, struct page **pages)
>> +{
>> + int nr;
>> +
>> + for (nr = 0; addr != end; addr += PAGE_SIZE)
>> + pages[nr++] = page++;
>> +
>> + return nr;
>> +}
>> +
>> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>> unsigned long end, unsigned int flags,
>> struct page **pages, int *nr)
>> {
>> - int nr_start = *nr;
>> + int refs, nr_start = *nr;
>> struct dev_pagemap *pgmap = NULL;
>>
>> do {
>> - struct page *page = pfn_to_page(pfn);
>> + struct page *head, *page = pfn_to_page(pfn);
>> + unsigned long next;
>>
>> pgmap = get_dev_pagemap(pfn, pgmap);
>> if (unlikely(!pgmap)) {
>> undo_dev_pagemap(nr, nr_start, flags, pages);
>> return 0;
>> }
>> - SetPageReferenced(page);
>> - pages[*nr] = page;
>> - if (unlikely(!try_grab_page(page, flags))) {
>> - undo_dev_pagemap(nr, nr_start, flags, pages);
>> +
>> + head = compound_head(page);
>> + next = PageCompound(head) ? end : addr + PAGE_SIZE;
>
> This looks a tad messy, and makes assumptions that upper layers are
> not sending this routine multiple huge pages to map. next should be
> set to the next compound page, not end.
Although for devmap (and same could be said for hugetlbfs), __gup_device_huge() (as called
by __gup_device_huge_{pud,pmd}) would only ever be called on a compound page which
represents the same level, as opposed to many compound pages i.e. @end already represents
the next compound page of the PMD or PUD level.
But of course, should we represent devmap pages in geometries other than the values of
hpagesize/align other than PMD or PUD size then it's true that relying on @end value being
next compound page is fragile. But so as the rest of the surrounding code.
>
>> + refs = record_subpages(page, addr, next, pages + *nr);
>> +
>> + SetPageReferenced(head);
>> + head = try_grab_compound_head(head, refs, flags);
>> + if (!head) {
>> + if (PageCompound(head)) {
>
> @head is NULL here, I think you wanted to rename the result of
> try_grab_compound_head() to something like pinned_head so that you
> don't undo the work you did above.
Yes. pinned_head is what I actually should have written. Let me fix that.
> However I feel like there's one too
> PageCompund() checks.
>
I agree, but I am not fully sure how I can remove them :(
The previous approach was to separate the logic into two distinct helpers namely
__gup_device_huge() and __gup_device_compound_huge(). But that sort of special casing
wasn't a good idea, so I tried merging both cases in __gup_device_huge() solely
differentiating on PageCompound().
I could make this slightly less bad by moving the error case PageCompound checks to
undo_dev_pagemap() and record_subpages().
But we still have the pagemap refcount to be taken until your other series removes the
need for it. So perhaps I should place the remaining PageCompound based check inside
record_subpages to accomodate the PAGE_SIZE geometry case (similarly hinted by Jason in
the previous version but that I didn't fully address).
How does the above sound?
longterm once we stop having devmap use non compound struct pages on PMDs/PUDs and the
pgmap refcount on gup is removed then perhaps we can move to existing regular huge page
path that is not devmap specific.
>
>> + ClearPageReferenced(head);
>> + put_dev_pagemap(pgmap);
>> + } else {
>> + undo_dev_pagemap(nr, nr_start, flags, pages);
>> + }
>> return 0;
>> }
>> - (*nr)++;
>> - pfn++;
>> - } while (addr += PAGE_SIZE, addr != end);
>> + *nr += refs;
>> + pfn += refs;
>> + } while (addr += (refs << PAGE_SHIFT), addr != end);
>>
>> if (pgmap)
>> put_dev_pagemap(pgmap);
>> @@ -2243,17 +2266,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>> }
>> #endif
>>
>> -static int record_subpages(struct page *page, unsigned long addr,
>> - unsigned long end, struct page **pages)
>> -{
>> - int nr;
>> -
>> - for (nr = 0; addr != end; addr += PAGE_SIZE)
>> - pages[nr++] = page++;
>> -
>> - return nr;
>> -}
>> -
>> #ifdef CONFIG_ARCH_HAS_HUGEPD
>> static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>> unsigned long sz)
>> --
>> 2.17.1
>>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages
2021-06-07 15:21 ` Joao Martins
@ 2021-06-07 19:22 ` Dan Williams
0 siblings, 0 replies; 23+ messages in thread
From: Dan Williams @ 2021-06-07 19:22 UTC (permalink / raw)
To: Joao Martins
Cc: Linux MM, Ira Weiny, Matthew Wilcox, Jason Gunthorpe, Jane Chu,
Muchun Song, Mike Kravetz, Andrew Morton, nvdimm
On Mon, Jun 7, 2021 at 8:22 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 6/2/21 2:05 AM, Dan Williams wrote:
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Much like hugetlbfs or THPs, treat device pagemaps with
> >> compound pages like the rest of GUP handling of compound pages.
> >>
> >
> > How about:
> >
> > "Use try_grab_compound_head() for device-dax GUP when configured with
> > a compound pagemap."
> >
> Yeap, a bit clearer indeed.
>
> >> Rather than incrementing the refcount every 4K, we record
> >> all sub pages and increment by @refs amount *once*.
> >
> > "Rather than incrementing the refcount for each page, do one atomic
> > addition for all the pages to be pinned."
> >
> ACK.
>
> >>
> >> Performance measured by gup_benchmark improves considerably
> >> get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:
> >>
> >> $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
> >> (pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
> >> [altmap]
> >> (get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
> >> (pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms
> >
> > Hmm what is altmap representing here? The altmap case does not support
> > compound geometry,
>
> It does support compound geometry and so we use compound pages with altmap case.
> What altmap doesn't support is the memory savings in the vmemmap that can be
> done when using compound pages. That's what is represented here.
Ah, I missed that detail, might be good to mention this in the
Documentation/vm/ overview doc for this capability.
>
> > so this last test is comparing pinning this amount
> > of memory without compound pages where the memmap is in PMEM to the
> > speed *with* compound pages and the memmap in DRAM?
> >
> The test compares pinning this amount of memory with compound pages placed
> in PMEM and in DRAM. It just exposes just how ineficient this can get if huge pages aren't
> represented with compound pages.
Got it.
>
> >>
> >> $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
> >> (get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
> >> (pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
> >> [altmap with -m 127004]
> >> (get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
> >> (pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >> mm/gup.c | 52 ++++++++++++++++++++++++++++++++--------------------
> >> 1 file changed, 32 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index b3e647c8b7ee..514f12157a0f 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -2159,31 +2159,54 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> >> }
> >> #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
> >>
> >> +
> >> +static int record_subpages(struct page *page, unsigned long addr,
> >> + unsigned long end, struct page **pages)
> >> +{
> >> + int nr;
> >> +
> >> + for (nr = 0; addr != end; addr += PAGE_SIZE)
> >> + pages[nr++] = page++;
> >> +
> >> + return nr;
> >> +}
> >> +
> >> #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >> static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >> unsigned long end, unsigned int flags,
> >> struct page **pages, int *nr)
> >> {
> >> - int nr_start = *nr;
> >> + int refs, nr_start = *nr;
> >> struct dev_pagemap *pgmap = NULL;
> >>
> >> do {
> >> - struct page *page = pfn_to_page(pfn);
> >> + struct page *head, *page = pfn_to_page(pfn);
> >> + unsigned long next;
> >>
> >> pgmap = get_dev_pagemap(pfn, pgmap);
> >> if (unlikely(!pgmap)) {
> >> undo_dev_pagemap(nr, nr_start, flags, pages);
> >> return 0;
> >> }
> >> - SetPageReferenced(page);
> >> - pages[*nr] = page;
> >> - if (unlikely(!try_grab_page(page, flags))) {
> >> - undo_dev_pagemap(nr, nr_start, flags, pages);
> >> +
> >> + head = compound_head(page);
> >> + next = PageCompound(head) ? end : addr + PAGE_SIZE;
> >
> > This looks a tad messy, and makes assumptions that upper layers are
> > not sending this routine multiple huge pages to map. next should be
> > set to the next compound page, not end.
>
> Although for devmap (and same could be said for hugetlbfs), __gup_device_huge() (as called
> by __gup_device_huge_{pud,pmd}) would only ever be called on a compound page which
> represents the same level, as opposed to many compound pages i.e. @end already represents
> the next compound page of the PMD or PUD level.
>
> But of course, should we represent devmap pages in geometries other than the values of
> hpagesize/align other than PMD or PUD size then it's true that relying on @end value being
> next compound page is fragile. But so as the rest of the surrounding code.
Ok, for now maybe a:
/* @end is assumed to be limited at most 1 compound page */
...would remind whoever refactors this later about the assumption.
>
> >
> >> + refs = record_subpages(page, addr, next, pages + *nr);
> >> +
> >> + SetPageReferenced(head);
> >> + head = try_grab_compound_head(head, refs, flags);
> >> + if (!head) {
> >> + if (PageCompound(head)) {
> >
> > @head is NULL here, I think you wanted to rename the result of
> > try_grab_compound_head() to something like pinned_head so that you
> > don't undo the work you did above.
>
> Yes. pinned_head is what I actually should have written. Let me fix that.
>
> > However I feel like there's one too
> > PageCompund() checks.
> >
>
> I agree, but I am not fully sure how I can remove them :(
If you fix the bug above that's sufficient for me, I may be wishing
for something more pretty but is not possible in practice...
>
> The previous approach was to separate the logic into two distinct helpers namely
> __gup_device_huge() and __gup_device_compound_huge(). But that sort of special casing
> wasn't a good idea, so I tried merging both cases in __gup_device_huge() solely
> differentiating on PageCompound().
>
> I could make this slightly less bad by moving the error case PageCompound checks to
> undo_dev_pagemap() and record_subpages().
>
> But we still have the pagemap refcount to be taken until your other series removes the
> need for it. So perhaps I should place the remaining PageCompound based check inside
> record_subpages to accomodate the PAGE_SIZE geometry case (similarly hinted by Jason in
> the previous version but that I didn't fully address).
>
> How does the above sound?
Sounds worth a try, but not a hard requirement for this to move
forward from my perspective.
>
> longterm once we stop having devmap use non compound struct pages on PMDs/PUDs and the
> pgmap refcount on gup is removed then perhaps we can move to existing regular huge page
> path that is not devmap specific.
>
Ok.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps
2021-03-25 23:09 [PATCH v1 00/11] mm, sparse-vmemmap: Introduce compound pagemaps Joao Martins
` (10 preceding siblings ...)
2021-03-25 23:09 ` [PATCH v1 11/11] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-04-01 9:38 ` Joao Martins
11 siblings, 0 replies; 23+ messages in thread
From: Joao Martins @ 2021-04-01 9:38 UTC (permalink / raw)
To: linux-mm
Cc: linux-nvdimm, Matthew Wilcox, Jason Gunthorpe, Muchun Song,
Mike Kravetz, Andrew Morton
On 3/25/21 11:09 PM, Joao Martins wrote:
[...]
> Patch 11: Optimize grabbing page refcount changes given that we
> are working with compound pages i.e. we do 1 increment to the head
> page for a given set of N subpages compared as opposed to N individual writes.
> {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently
> improves considerably with DRAM stored struct pages. It also *greatly*
> improves pinning with altmap. Results with gup_test:
>
> before after
> (16G get_user_pages_fast 2M page size) ~59 ms -> ~6.1 ms
> (16G pin_user_pages_fast 2M page size) ~87 ms -> ~6.2 ms
> (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
> (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms
>
> altmap performance gets specially interesting when pinning a pmem dimm:
>
> before after
> (128G get_user_pages_fast 2M page size) ~492 ms -> ~49 ms
> (128G pin_user_pages_fast 2M page size) ~493 ms -> ~50 ms
> (128G get_user_pages_fast altmap 2M page size) ~3.91 ms -> ~70 ms
> (128G pin_user_pages_fast altmap 2M page size) ~3.97 ms -> ~74 ms
>
Quick correction: These last two 3.91 and 3.97 on the left column are in *seconds* not
milliseconds. By mistake I added an extra 'm'. Sorry about that.
> The unpinning improvement patches are in mmotm/linux-next so removed from this
> series.
>
> I have deferred the __get_user_pages() patch to outside this series
> (https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
> as I found an simpler way to address it and that is also applicable to
> THP. But will submit that as a follow up of this.
>
> Patches apply on top of linux-next tag next-20210325 (commit b4f20b70784a).
>
> Comments and suggestions very much appreciated!
>
> Changelog,
>
> RFC -> v1:
> (New patches 1-3, 5-8 but the diffstat is that different)
> * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
> * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
I just noticed that I haven't fully removed the 'we' occurrences. Patches 7 and 8 had
their commit messages rewritten and I mistakenly re-introduced remnants of 'we'. I will
have it fixed for (v2) albeit I'll still wait for comments on this series before following up.
> * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
> * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
> * Remove the gup_device_compound_huge special path and have the same code
> work both ways while special casing when devmap page is compound (Jason, John)
> * Avoid usage of vmemmap_populate_basepages() and introduce a first class
> loop that doesn't care about passing an altmap for memmap reuse. (Dan)
> * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
> hack into passing block across sparse_add_section calls. It's a lot easier to
> follow and more explicit in what it does.
> * Replace the vmemmap refactoring with adding a @pgmap argument and moving
> parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
> * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
> * Improve memmap_init_zone_device() to initialize compound pages when
> struct pages are cache warm. That lead to a even further speed up further
> from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
> as a result (Dan)
> * Remove PGMAP_COMPOUND and use @align as the property to detect whether
> or not to reuse vmemmap areas (Dan)
>
> Thanks,
> Joao
>
> Joao Martins (11):
> memory-failure: fetch compound_head after pgmap_pfn_valid()
> mm/page_alloc: split prep_compound_page into head and tail subparts
> mm/page_alloc: refactor memmap_init_zone_device() page init
> mm/memremap: add ZONE_DEVICE support for compound pages
> mm/sparse-vmemmap: add a pgmap argument to section activation
> mm/sparse-vmemmap: refactor vmemmap_populate_basepages()
> mm/sparse-vmemmap: populate compound pagemaps
> mm/sparse-vmemmap: use hugepages for PUD compound pagemaps
> mm/page_alloc: reuse tail struct pages for compound pagemaps
> device-dax: compound pagemap support
> mm/gup: grab head page refcount once for group of subpages
>
> drivers/dax/device.c | 58 +++++++--
> include/linux/memory_hotplug.h | 5 +-
> include/linux/memremap.h | 13 ++
> include/linux/mm.h | 8 +-
> mm/gup.c | 52 +++++---
> mm/memory-failure.c | 2 +
> mm/memory_hotplug.c | 3 +-
> mm/memremap.c | 9 +-
> mm/page_alloc.c | 126 +++++++++++++------
> mm/sparse-vmemmap.c | 221 +++++++++++++++++++++++++++++----
> mm/sparse.c | 24 ++--
> 11 files changed, 406 insertions(+), 115 deletions(-)
>
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
^ permalink raw reply [flat|nested] 23+ messages in thread