All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] sparse-vmemmap: memory savings for compound devmaps (device-dax)
@ 2022-02-10 19:33 Joao Martins
  2022-02-10 19:33 ` [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Joao Martins @ 2022-02-10 19:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc,
	Joao Martins

Changes since v4[0]:
 * Rebased to next-20220210.
 * Adjust patch 3, given Muchun changes to the comment block, retained
 the Rb tags considering it is still just a move of text.
 * Rename @geometry to @vmemmap_shift in all patches/cover-letter.
 * Rename pgmap_geometry() calls to pgmap_vmemmap_nr().
 * HugeTLB in mmotm now remaps all but the first head vmemmap page
 hence adjust patch 4 to also document how device-dax is slightly different
 in the vmemmap pagetables setup (1 less deduplicate page per hugepage).
 * Remove the last patch that reuses PMD tail pages[1], to be a follow up
 after getting the core improvement first.
 * Rework cover-letter.

[0] https://lore.kernel.org/linux-mm/20210827145819.16471-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210827145819.16471-15-joao.m.martins@oracle.com/

This used to be part of v4, but this was splitted between three subsets:
 1) compound page conversion of device-dax (in v5.17), 2) vmemmap deduplication
 for device-dax (this series) and 3) GUP improvements (to be respinned).

Full changelog (only for this series subset) at the bottom of cover letter.

---

This series, minimizes 'struct page' overhead by pursuing a similar approach as
Muchun Song series "Free some vmemmap pages of hugetlb page" (now merged since
v5.14), but applied to devmap with compound_pages. 

The vmemmap dedpulication original idea (already used in HugeTLB) is to
reuse/deduplicate tail page vmemmap areas, particular the area which only
describes tail pages. So a vmemmap page describes 64 struct pages, and the
first page for a given ZONE_DEVICE vmemmap would contain the head page and 63
tail pages. The second vmemmap page would contain only tail pages, and that's
what gets reused across the rest of the subsection/section. The bigger the page
size, the bigger the savings (2M hpage -> save 6 vmemmap pages;
1G hpage -> save 4094 vmemmap pages). 

This is done for PMEM /specifically only/ on device-dax configured
namespaces, not fsdax. In other words, a devmap with a @vmemmap_shift.

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound devmap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of total memory)

The series is mostly summed up by patch 4, and to summarize what the series does:

Patches 1 - 3: Minor cleanups in preparation for patch 4.  Move the very nice
docs of hugetlb_vmemmap.c into a Documentation/vm/ entry and adjust them to the
device-dax case.

Patch 4: Patch 4 is the one that takes care of the struct page savings (also
referred to here as tail-page/vmemmap deduplication). Much like Muchun series,
we reuse the second PTE tail page vmemmap areas across a given @vmemmap_shift
On important difference though, is that contrary to the hugetlbfs series,
there's no vmemmap for the area because we are late-populating it as opposed to
remapping a system-ram range. IOW no freeing of pages of already initialized
vmemmap like the case for hugetlbfs, which greatly simplifies the logic
(besides not being arch-specific). altmap case unchanged and still goes via
the vmemmap_populate().

[Note that, device-dax is still a little behind HugeTLB in terms of savings.
I have an additional simple patch that reuses the head vmemmap page too,
as a follow-up. That will double the savings and namespaces initialization.]

Patch 5: Initialize fewer struct pages depending on the page size with DRAM backed
struct pages -- because fewer pages are unique and most tail pages (with bigger
vmemmap_shift).

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x
    smaller for 2M and 1G respectivelly. Tested on x86 with 1Tb+ of pmem,

Patches apply on top of linux-next tag next-20220210 (commit 395a61741f7e).

Comments and suggestions very much appreciated!

Older Changelog,

v3[0] -> v4[4]:

 * Collect Dan's Reviewed-by on patches 8,9,11
 * Collect Muchun Reviewed-by on patch 1,2,11
 * Reorder patches to first introduce compound pages in ZONE_DEVICE with
 device-dax (for pmem) as first user (patches 1-8) followed by implementing
 the sparse-vmemmap changes for minimize struct page overhead for devmap (patches 9-14)
 * Eliminate remnant @align references to use @geometry (Dan)
 * Convert mentions of 'compound pagemap' to 'compound devmap' throughout
   the series to avoid confusions of this work conflicting/referring to
   anything Folio or pagemap related.
 * Delete pgmap_pfn_geometry() on patch 4
   and rework other patches to use pgmap_geometry() instead (Dan)
 * Convert @geometry to be a number of pages rather than page size in patch 4 (Dan)
 * Make pgmap_geometry() more readable (Christoph)
 * Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan)
 * Fix up missing return in vmemmap_populate_address() in patch 10
 * Change error handling style in all patches (Dan)
 * Change title of vmemmap_dedup.rst to be more representative of the purpose in patch 12 (Dan)
 * Move some of the section and subsection tail page reuse code into helpers
 reuse_compound_section() and compound_section_tail_page() for readability in patch 12 (Dan)
 * Commit description fixes for clearity in various patches (Dan)
 * Add pgmap_geometry_order() helper and
   drop unneeded geometry_size, order variables in patch 12
 * Drop unneeded byte based computation to be PFN in patch 12
 * Add a compound_nr_pages() helper and use it in memmap_init_zone_device to calculate
 the number of unique struct pages to initialize depending on @altmap existence in patch 13 (Dan)
 * Add compound_section_tail_huge_page() for the tail page PMD reuse in patch 14 (Dan)
 * Reword cover letter.

v2 -> v3[3]:
 * Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun)
 * Rebased to next-20210714

v1[1] -> v2[2]:

 (New patches 7, 10, 11)
 * Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
 * Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
 that it's in preparation for bigger infra in sparse-vmemmap. (Patch 5) [Dan]
 * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
 * Simplify patch 9 as a result of having compound initialization differently [Dan]
 * Rename Subject of patch 6 [Dan]
 * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
 * Add some type-safety to @block and use 'struct page *' rather than
 void, Patch 8 [Dan]
 * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
 * Remove vmemmap lookup function in place of
 pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
 serialization, Patch 8
 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
 * Add docs about device-dax usage of tail dedup technique in newly added
 compound_pagemaps.rst doc entry.
 * Rebased to next-20210617 

 RFC[0] -> v1:
 (New patches 1-3, 5-8 but the diffstat isn't that different)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

[0] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/
[4] https://lore.kernel.org/linux-mm/20210827145819.16471-1-joao.m.martins@oracle.com/

Joao Martins (5):
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to
    helper
  mm/hugetlb_vmemmap: move comment block to Documentation/vm
  mm/sparse-vmemmap: improve memory savings for compound devmaps
  mm/page_alloc: reuse tail struct pages for compound devmaps

 Documentation/vm/index.rst         |   1 +
 Documentation/vm/vmemmap_dedup.rst | 225 +++++++++++++++++++++++++++++
 include/linux/memory_hotplug.h     |   5 +-
 include/linux/mm.h                 |   5 +-
 mm/hugetlb_vmemmap.c               | 168 +--------------------
 mm/memory_hotplug.c                |   3 +-
 mm/memremap.c                      |   1 +
 mm/page_alloc.c                    |  16 +-
 mm/sparse-vmemmap.c                | 196 ++++++++++++++++++++++---
 mm/sparse.c                        |  26 ++--
 10 files changed, 440 insertions(+), 206 deletions(-)
 create mode 100644 Documentation/vm/vmemmap_dedup.rst

-- 
2.17.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation
  2022-02-10 19:33 [PATCH v5 0/5] sparse-vmemmap: memory savings for compound devmaps (device-dax) Joao Martins
@ 2022-02-10 19:33 ` Joao Martins
  2022-02-11  8:03   ` Muchun Song
  2022-02-10 19:33 ` [PATCH v5 2/5] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Joao Martins @ 2022-02-10 19:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc,
	Joao Martins

In support of using compound pages for devmap mappings, plumb the pgmap
down to the vmemmap_populate implementation. Note that while altmap is
retrievable from pgmap the memory hotplug code passes altmap without
pgmap[*], so both need to be independently plumbed.

So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:

	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap

Passing @pgmap allows __populate_section_memmap() to both fetch the
vmemmap_shift in which memmap metadata is created for and also to let
sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
whether to just reuse tail pages from past onlined sections.

While at it, fix the kdoc for @altmap for sparse_add_section().

[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memory_hotplug.h |  5 ++++-
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  3 ++-
 mm/sparse.c                    | 26 ++++++++++++++++----------
 5 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 76bf2de86def..405aa2b4ae4f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -15,6 +15,7 @@ struct memory_block;
 struct memory_group;
 struct resource;
 struct vmem_altmap;
+struct dev_pagemap;
 
 #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
 /*
@@ -122,6 +123,7 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	struct dev_pagemap *pgmap;
 };
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -335,7 +337,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf60d947503e..f6a439582f63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3165,7 +3165,8 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ce68098832aa..79e96b6a9036 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -328,7 +328,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+					 params->pgmap);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8aecd6b3896c..c506f77cff23 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -641,7 +641,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index 952f06d8f373..d2d76d158b39 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -427,7 +427,8 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -524,7 +525,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			break;
 
 		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-				nid, NULL);
+				nid, NULL, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -629,9 +630,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -700,7 +702,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 }
 #else
 struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	return kvmalloc_node(array_size(sizeof(struct page),
 					PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -823,7 +826,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -855,7 +859,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
@@ -869,7 +873,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @nid: The node to add section on
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
- * @altmap: device page map
+ * @altmap: alternate pfns to allocate the memmap backing store
+ * @pgmap: alternate compound page geometry for devmap mappings
  *
  * This is only intended for hotplug.
  *
@@ -883,7 +888,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -894,7 +900,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/5] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
  2022-02-10 19:33 [PATCH v5 0/5] sparse-vmemmap: memory savings for compound devmaps (device-dax) Joao Martins
  2022-02-10 19:33 ` [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2022-02-10 19:33 ` Joao Martins
  2022-02-11  7:54   ` Muchun Song
  2022-02-10 19:33 ` [PATCH v5 3/5] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Joao Martins @ 2022-02-10 19:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc,
	Joao Martins

In preparation for describing a memmap with compound pages, move the
actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages() walk
through all base pages it needs to populate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/sparse-vmemmap.c | 51 ++++++++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c506f77cff23..e7be2ef4454b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -608,33 +608,46 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap)
 {
-	unsigned long addr = start;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	pgd = vmemmap_pgd_populate(addr, node);
+	if (!pgd)
+		return -ENOMEM;
+	p4d = vmemmap_p4d_populate(pgd, addr, node);
+	if (!p4d)
+		return -ENOMEM;
+	pud = vmemmap_pud_populate(p4d, addr, node);
+	if (!pud)
+		return -ENOMEM;
+	pmd = vmemmap_pmd_populate(pud, addr, node);
+	if (!pmd)
+		return -ENOMEM;
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	if (!pte)
+		return -ENOMEM;
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	return 0;
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	unsigned long addr = start;
+	int rc;
+
 	for (; addr < end; addr += PAGE_SIZE) {
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
-			return -ENOMEM;
-		p4d = vmemmap_p4d_populate(pgd, addr, node);
-		if (!p4d)
-			return -ENOMEM;
-		pud = vmemmap_pud_populate(p4d, addr, node);
-		if (!pud)
-			return -ENOMEM;
-		pmd = vmemmap_pmd_populate(pud, addr, node);
-		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
-		if (!pte)
-			return -ENOMEM;
-		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+		rc = vmemmap_populate_address(addr, node, altmap);
+		if (rc)
+			return rc;
+
 	}
 
 	return 0;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 3/5] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2022-02-10 19:33 [PATCH v5 0/5] sparse-vmemmap: memory savings for compound devmaps (device-dax) Joao Martins
  2022-02-10 19:33 ` [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
  2022-02-10 19:33 ` [PATCH v5 2/5] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
@ 2022-02-10 19:33 ` Joao Martins
  2022-02-10 19:33 ` [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps Joao Martins
  2022-02-10 19:33 ` [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages " Joao Martins
  4 siblings, 0 replies; 18+ messages in thread
From: Joao Martins @ 2022-02-10 19:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc,
	Joao Martins

In preparation for device-dax for using hugetlbfs compound page tail
deduplication technique, move the comment block explanation into a
common place in Documentation/vm.

Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/vm/index.rst         |   1 +
 Documentation/vm/vmemmap_dedup.rst | 175 +++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.c               | 168 +--------------------------
 3 files changed, 177 insertions(+), 167 deletions(-)
 create mode 100644 Documentation/vm/vmemmap_dedup.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..2fb612bb72c9 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -37,5 +37,6 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
    transhuge
    unevictable-lru
    vmalloced-kernel-stacks
+   vmemmap_dedup
    z3fold
    zsmalloc
diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
new file mode 100644
index 000000000000..8143b2ce414d
--- /dev/null
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -0,0 +1,175 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _vmemmap_dedup:
+
+==================================
+Free some vmemmap pages of HugeTLB
+==================================
+
+The struct page structures (page structs) are used to describe a physical
+page frame. By default, there is a one-to-one mapping from a page frame to
+it's corresponding page struct.
+
+HugeTLB pages consist of multiple base page size pages and is supported by
+many architectures. See hugetlbpage.rst in the Documentation directory for
+more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
+are currently supported. Since the base page size on x86 is 4KB, a 2MB
+HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
+4096 base pages. For each base page, there is a corresponding page struct.
+
+Within the HugeTLB subsystem, only the first 4 page structs are used to
+contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
+this upper limit. The only 'useful' information in the remaining page structs
+is the compound_head field, and this field is the same for all tail pages.
+
+By removing redundant page structs for HugeTLB pages, memory can be returned
+to the buddy allocator for other uses.
+
+Different architectures support different HugeTLB pages. For example, the
+following table is the HugeTLB page size supported by x86 and arm64
+architectures. Because arm64 supports 4k, 16k, and 64k base pages and
+supports contiguous entries, so it supports many kinds of sizes of HugeTLB
+page.
+
++--------------+-----------+-----------------------------------------------+
+| Architecture | Page Size |                HugeTLB Page Size              |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
+|              +-----------+-----------+-----------+-----------+-----------+
+|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
+|              +-----------+-----------+-----------+-----------+-----------+
+|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+
+When the system boot up, every HugeTLB page has more than one struct page
+structs which size is (unit: pages):
+
+   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+
+Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
+of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
+relationship.
+
+   HugeTLB_Size = n * PAGE_SIZE
+
+Then,
+
+   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+               = n * sizeof(struct page) / PAGE_SIZE
+
+We can use huge mapping at the pud/pmd level for the HugeTLB page.
+
+For the HugeTLB page of the pmd level mapping, then
+
+   struct_size = n * sizeof(struct page) / PAGE_SIZE
+               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
+               = sizeof(struct page) / sizeof(pte_t)
+               = 64 / 8
+               = 8 (pages)
+
+Where n is how many pte entries which one page can contains. So the value of
+n is (PAGE_SIZE / sizeof(pte_t)).
+
+This optimization only supports 64-bit system, so the value of sizeof(pte_t)
+is 8. And this optimization also applicable only when the size of struct page
+is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
+x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
+size of struct page structs of it is 8 page frames which size depends on the
+size of the base page.
+
+For the HugeTLB page of the pud level mapping, then
+
+   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
+               = PAGE_SIZE / 8 * 8 (pages)
+               = PAGE_SIZE (pages)
+
+Where the struct_size(pmd) is the size of the struct page structs of a
+HugeTLB page of the pmd level mapping.
+
+E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
+HugeTLB page consists in 4096.
+
+Next, we take the pmd level mapping of the HugeTLB page as an example to
+show the internal implementation of this optimization. There are 8 pages
+struct page structs associated with a HugeTLB page which is pmd mapped.
+
+Here is how things look before optimization.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | -------------> |     2     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     3     | -------------> |     3     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     4     | -------------> |     4     |
+ |    PMD    |                     +-----------+                +-----------+
+ |   level   |                     |     5     | -------------> |     5     |
+ |  mapping  |                     +-----------+                +-----------+
+ |           |                     |     6     | -------------> |     6     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     7     | -------------> |     7     |
+ |           |                     +-----------+                +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+The value of page->compound_head is the same for all tail pages. The first
+page of page structs (page 0) associated with the HugeTLB page contains the 4
+page structs necessary to describe the HugeTLB. The only use of the remaining
+pages of page structs (page 1 to page 7) is to point to page->compound_head.
+Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs
+will be used for each HugeTLB page. This will allow us to free the remaining
+7 pages to the buddy allocator.
+
+Here is how things look after remapping.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                  | | | | | |
+ |           |                     |     2     | -----------------+ | | | | |
+ |           |                     +-----------+                    | | | | |
+ |           |                     |     3     | -------------------+ | | | |
+ |           |                     +-----------+                      | | | |
+ |           |                     |     4     | ---------------------+ | | |
+ |    PMD    |                     +-----------+                        | | |
+ |   level   |                     |     5     | -----------------------+ | |
+ |  mapping  |                     +-----------+                          | |
+ |           |                     |     6     | -------------------------+ |
+ |           |                     +-----------+                            |
+ |           |                     |     7     | ---------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
+vmemmap pages and restore the previous mapping relationship.
+
+For the HugeTLB page of the pud level mapping. It is similar to the former.
+We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
+
+Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
+(e.g. aarch64) provides a contiguous bit in the translation table entries
+that hints to the MMU to indicate that it is one of a contiguous set of
+entries that can be cached in a single TLB entry.
+
+The contiguous bit is used to increase the mapping size at the pmd and pte
+(last) level. So this type of HugeTLB page can be optimized only when its
+size of the struct page structs is greater than 1 page.
+
+Notice: The head vmemmap page is not freed to the buddy allocator and all
+tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
+more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
+associated with each HugeTLB page. The compound_head() can handle this
+correctly (more details refer to the comment above compound_head()).
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 791626983c2e..dbaa837b19c6 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -6,173 +6,7 @@
  *
  *     Author: Muchun Song <songmuchun@bytedance.com>
  *
- * The struct page structures (page structs) are used to describe a physical
- * page frame. By default, there is a one-to-one mapping from a page frame to
- * it's corresponding page struct.
- *
- * HugeTLB pages consist of multiple base page size pages and is supported by
- * many architectures. See hugetlbpage.rst in the Documentation directory for
- * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
- * are currently supported. Since the base page size on x86 is 4KB, a 2MB
- * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
- * 4096 base pages. For each base page, there is a corresponding page struct.
- *
- * Within the HugeTLB subsystem, only the first 4 page structs are used to
- * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
- * this upper limit. The only 'useful' information in the remaining page structs
- * is the compound_head field, and this field is the same for all tail pages.
- *
- * By removing redundant page structs for HugeTLB pages, memory can be returned
- * to the buddy allocator for other uses.
- *
- * Different architectures support different HugeTLB pages. For example, the
- * following table is the HugeTLB page size supported by x86 and arm64
- * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
- * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
- * page.
- *
- * +--------------+-----------+-----------------------------------------------+
- * | Architecture | Page Size |                HugeTLB Page Size              |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- *
- * When the system boot up, every HugeTLB page has more than one struct page
- * structs which size is (unit: pages):
- *
- *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *
- * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
- * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
- * relationship.
- *
- *    HugeTLB_Size = n * PAGE_SIZE
- *
- * Then,
- *
- *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *                = n * sizeof(struct page) / PAGE_SIZE
- *
- * We can use huge mapping at the pud/pmd level for the HugeTLB page.
- *
- * For the HugeTLB page of the pmd level mapping, then
- *
- *    struct_size = n * sizeof(struct page) / PAGE_SIZE
- *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
- *                = sizeof(struct page) / sizeof(pte_t)
- *                = 64 / 8
- *                = 8 (pages)
- *
- * Where n is how many pte entries which one page can contains. So the value of
- * n is (PAGE_SIZE / sizeof(pte_t)).
- *
- * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
- * is 8. And this optimization also applicable only when the size of struct page
- * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
- * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
- * size of struct page structs of it is 8 page frames which size depends on the
- * size of the base page.
- *
- * For the HugeTLB page of the pud level mapping, then
- *
- *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
- *                = PAGE_SIZE / 8 * 8 (pages)
- *                = PAGE_SIZE (pages)
- *
- * Where the struct_size(pmd) is the size of the struct page structs of a
- * HugeTLB page of the pmd level mapping.
- *
- * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
- * HugeTLB page consists in 4096.
- *
- * Next, we take the pmd level mapping of the HugeTLB page as an example to
- * show the internal implementation of this optimization. There are 8 pages
- * struct page structs associated with a HugeTLB page which is pmd mapped.
- *
- * Here is how things look before optimization.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | -------------> |     1     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     2     | -------------> |     2     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     3     | -------------> |     3     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     4     | -------------> |     4     |
- * |    PMD    |                     +-----------+                +-----------+
- * |   level   |                     |     5     | -------------> |     5     |
- * |  mapping  |                     +-----------+                +-----------+
- * |           |                     |     6     | -------------> |     6     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     7     | -------------> |     7     |
- * |           |                     +-----------+                +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * The value of page->compound_head is the same for all tail pages. The first
- * page of page structs (page 0) associated with the HugeTLB page contains the 4
- * page structs necessary to describe the HugeTLB. The only use of the remaining
- * pages of page structs (page 1 to page 7) is to point to page->compound_head.
- * Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs
- * will be used for each HugeTLB page. This will allow us to free the remaining
- * 7 pages to the buddy allocator.
- *
- * Here is how things look after remapping.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
- * |           |                     +-----------+                  | | | | | |
- * |           |                     |     2     | -----------------+ | | | | |
- * |           |                     +-----------+                    | | | | |
- * |           |                     |     3     | -------------------+ | | | |
- * |           |                     +-----------+                      | | | |
- * |           |                     |     4     | ---------------------+ | | |
- * |    PMD    |                     +-----------+                        | | |
- * |   level   |                     |     5     | -----------------------+ | |
- * |  mapping  |                     +-----------+                          | |
- * |           |                     |     6     | -------------------------+ |
- * |           |                     +-----------+                            |
- * |           |                     |     7     | ---------------------------+
- * |           |                     +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
- * vmemmap pages and restore the previous mapping relationship.
- *
- * For the HugeTLB page of the pud level mapping. It is similar to the former.
- * We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
- *
- * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
- * (e.g. aarch64) provides a contiguous bit in the translation table entries
- * that hints to the MMU to indicate that it is one of a contiguous set of
- * entries that can be cached in a single TLB entry.
- *
- * The contiguous bit is used to increase the mapping size at the pmd and pte
- * (last) level. So this type of HugeTLB page can be optimized only when its
- * size of the struct page structs is greater than 1 page.
- *
- * Notice: The head vmemmap page is not freed to the buddy allocator and all
- * tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
- * more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
- * associated with each HugeTLB page. The compound_head() can handle this
- * correctly (more details refer to the comment above compound_head()).
+ * See Documentation/vm/vmemmap_dedup.rst
  */
 #define pr_fmt(fmt)	"HugeTLB: " fmt
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps
  2022-02-10 19:33 [PATCH v5 0/5] sparse-vmemmap: memory savings for compound devmaps (device-dax) Joao Martins
                   ` (2 preceding siblings ...)
  2022-02-10 19:33 ` [PATCH v5 3/5] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
@ 2022-02-10 19:33 ` Joao Martins
  2022-02-11  7:54   ` Muchun Song
  2022-02-10 19:33 ` [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages " Joao Martins
  4 siblings, 1 reply; 18+ messages in thread
From: Joao Martins @ 2022-02-10 19:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc,
	Joao Martins

A compound devmap is a dev_pagemap with @vmemmap_shift > 0 and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead. Allocate a separate
page for the vmemmap area which contains the head page and separate for
the next 64 pages. The rest of the subsections then reuse this tail
vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound devmap with big enough @vmemmap_shift (e.g.
1G PUD) it may cross multiple sections. The vmemmap code needs to
consult @pgmap so that multiple sections that all map the same tail
data can refer back to the first copy of that data for a given
gigantic page.

On compound devmaps with 2M align, this mechanism lets 6 pages be
saved out of the 8 necessary PFNs necessary to set the subsection's
512 struct pages being mapped. On a 1G compound devmap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().  It
is worth noting that altmap for devmap mappings was there to relieve the
pressure of inordinate amounts of memmap space to map terabytes of pmem.
With compound pages the motivation for altmaps for pmem gets reduced.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/vmemmap_dedup.rst |  56 ++++++++++-
 include/linux/mm.h                 |   2 +-
 mm/memremap.c                      |   1 +
 mm/sparse-vmemmap.c                | 150 +++++++++++++++++++++++++++--
 4 files changed, 197 insertions(+), 12 deletions(-)

diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
index 8143b2ce414d..de958bbbf78c 100644
--- a/Documentation/vm/vmemmap_dedup.rst
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -2,9 +2,12 @@
 
 .. _vmemmap_dedup:
 
-==================================
-Free some vmemmap pages of HugeTLB
-==================================
+=========================================
+A vmemmap diet for HugeTLB and Device DAX
+=========================================
+
+HugeTLB
+=======
 
 The struct page structures (page structs) are used to describe a physical
 page frame. By default, there is a one-to-one mapping from a page frame to
@@ -173,3 +176,50 @@ tail vmemmap pages are mapped to the head vmemmap page frame. So we can see
 more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
 associated with each HugeTLB page. The compound_head() can handle this
 correctly (more details refer to the comment above compound_head()).
+
+Device DAX
+==========
+
+The device-dax interface uses the same tail deduplication technique explained
+in the previous chapter, except when used with the vmemmap in
+the device (altmap).
+
+The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
+PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+
+The differences with HugeTLB are relatively minor.
+
+It only use 3 page structs for storing all information as opposed
+to 4 on HugeTLB pages.
+
+There's no remapping of vmemmap given that device-dax memory is not part of
+System RAM ranges initialized at boot. Thus the tail page deduplication
+happens at a later stage when we populate the sections. HugeTLB reuses the
+the head vmemmap page representing, whereas device-dax reuses the tail
+vmemmap page. This results in only half of the savings compared to HugeTLB.
+
+Deduplicated tail pages are not mapped read-only.
+
+Here's how things look like on device-dax after the sections are populated:
+
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PMD    |                     +-----------+                       | | |
+ |   level   |                     |     5     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     6     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |     7     | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6a439582f63..0b7028b9ff2f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3172,7 +3172,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, struct page *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index 71b8d42d820c..a0ef95f09397 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -323,6 +323,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index e7be2ef4454b..2e2b063ed285 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -533,16 +533,31 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap,
+				       struct page *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
 		void *p;
 
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else {
+			/*
+			 * When a PTE/PMD entry is freed from the init_mm
+			 * there's a a free_pages() call to this page allocated
+			 * above. Thus this get_page() is paired with the
+			 * put_page_testzero() on the freeing path.
+			 * This can only called by certain ZONE_DEVICE path,
+			 * and through vmemmap_populate_compound_pages() when
+			 * slab is available.
+			 */
+			get_page(block);
+			p = page_to_virt(block);
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      struct page *reuse, struct page **page)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -629,11 +645,13 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 
+	if (page)
+		*page = pte_page(*pte);
 	return 0;
 }
 
@@ -644,10 +662,120 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	int rc;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		rc = vmemmap_populate_address(addr, node, altmap);
+		rc = vmemmap_populate_address(addr, node, altmap, NULL, NULL);
 		if (rc)
 			return rc;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, struct page *page)
+{
+	unsigned long addr = start;
+	int rc;
 
+	for (; addr < end; addr += PAGE_SIZE) {
+		rc = vmemmap_populate_address(addr, node, NULL, page, NULL);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  struct page **page)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, page);
+}
+
+/*
+ * For compound pages bigger than section size (e.g. x86 1G compound
+ * pages with 2M subsection size) fill the rest of sections as tail
+ * pages.
+ *
+ * Note that memremap_pages() resets @nr_range value and will increment
+ * it after each range successful onlining. Thus the value or @nr_range
+ * at section memmap populate corresponds to the in-progress range
+ * being onlined here.
+ */
+static bool __meminit reuse_compound_section(unsigned long start_pfn,
+					     struct dev_pagemap *pgmap)
+{
+	unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
+	unsigned long offset = start_pfn -
+		PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
+
+	return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
+}
+
+static struct page * __meminit compound_section_tail_page(unsigned long addr)
+{
+	pte_t *ptep;
+
+	addr -= PAGE_SIZE;
+
+	/*
+	 * Assuming sections are populated sequentially, the previous section's
+	 * page data can be reused.
+	 */
+	ptep = pte_offset_kernel(pmd_off_k(addr), addr);
+	if (!ptep)
+		return NULL;
+
+	return pte_page(*ptep);
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long size, addr;
+
+	if (reuse_compound_section(start_pfn, pgmap)) {
+		struct page *page;
+
+		page = compound_section_tail_page(start);
+		if (!page)
+			return -ENOMEM;
+
+		/*
+		 * Reuse the page that was populated in the prior iteration
+		 * with just tail struct pages.
+		 */
+		return vmemmap_populate_range(start, end, node, page);
+	}
+
+	size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		struct page *block;
+		int rc;
+
+		/* Populate the head page vmemmap page */
+		rc = vmemmap_populate_page(addr, node, NULL);
+		if (rc)
+			return rc;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		rc = vmemmap_populate_page(next, node, &block);
+		if (rc)
+			return rc;
+
+		/*
+		 * Reuse the previous page for the rest of tail pages
+		 * See layout diagram in Documentation/vm/vmemmap_dedup.rst
+		 */
+		next += PAGE_SIZE;
+		rc = vmemmap_populate_range(next, last, node, block);
+		if (rc)
+			return rc;
 	}
 
 	return 0;
@@ -659,12 +787,18 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages for compound devmaps
  2022-02-10 19:33 [PATCH v5 0/5] sparse-vmemmap: memory savings for compound devmaps (device-dax) Joao Martins
                   ` (3 preceding siblings ...)
  2022-02-10 19:33 ` [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps Joao Martins
@ 2022-02-10 19:33 ` Joao Martins
  2022-02-11  5:07   ` Muchun Song
  4 siblings, 1 reply; 18+ messages in thread
From: Joao Martins @ 2022-02-10 19:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Matthew Wilcox, Jason Gunthorpe,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc,
	Joao Martins

Currently memmap_init_zone_device() ends up initializing 32768 pages
when it only needs to initialize 128 given tail page reuse. That
number is worse with 1GB compound pages, 262144 instead of 128. Update
memmap_init_zone_device() to skip redundant initialization, detailed
below.

When a pgmap @vmemmap_shift is set, all pages are mapped at a given
huge page alignment and use compound pages to describe them as opposed
to a struct per 4K.

With @vmemmap_shift > 0 and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of
unique struct pages is a lot smaller that the total amount of struct
pages being mapped.

The altmap path is left alone since it does not support memory savings
based on compound pages devmap.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cface1d38093..c10df2fd0ec2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6666,6 +6666,20 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+/*
+ * With compound page geometry and when struct pages are stored in ram most
+ * tail pages are reused. Consequently, the amount of unique struct pages to
+ * initialize is a lot smaller that the total amount of struct pages being
+ * mapped. This is a paired / mild layering violation with explicit knowledge
+ * of how the sparse_vmemmap internals handle compound pages in the lack
+ * of an altmap. See vmemmap_populate_compound_pages().
+ */
+static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
+					      unsigned long nr_pages)
+{
+	return !altmap ? 2 * (PAGE_SIZE/sizeof(struct page)) : nr_pages;
+}
+
 static void __ref memmap_init_compound(struct page *head,
 				       unsigned long head_pfn,
 				       unsigned long zone_idx, int nid,
@@ -6730,7 +6744,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     pfns_per_compound);
+				     compound_nr_pages(altmap, pfns_per_compound));
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages for compound devmaps
  2022-02-10 19:33 ` [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages " Joao Martins
@ 2022-02-11  5:07   ` Muchun Song
  2022-02-11 12:48     ` Joao Martins
  0 siblings, 1 reply; 18+ messages in thread
From: Muchun Song @ 2022-02-11  5:07 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Currently memmap_init_zone_device() ends up initializing 32768 pages
> when it only needs to initialize 128 given tail page reuse. That
> number is worse with 1GB compound pages, 262144 instead of 128. Update
> memmap_init_zone_device() to skip redundant initialization, detailed
> below.
>
> When a pgmap @vmemmap_shift is set, all pages are mapped at a given
> huge page alignment and use compound pages to describe them as opposed
> to a struct per 4K.
>
> With @vmemmap_shift > 0 and when struct pages are stored in ram
> (!altmap) most tail pages are reused. Consequently, the amount of
> unique struct pages is a lot smaller that the total amount of struct
> pages being mapped.
>
> The altmap path is left alone since it does not support memory savings
> based on compound pages devmap.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/page_alloc.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cface1d38093..c10df2fd0ec2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6666,6 +6666,20 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>         }
>  }
>
> +/*
> + * With compound page geometry and when struct pages are stored in ram most
> + * tail pages are reused. Consequently, the amount of unique struct pages to
> + * initialize is a lot smaller that the total amount of struct pages being
> + * mapped. This is a paired / mild layering violation with explicit knowledge
> + * of how the sparse_vmemmap internals handle compound pages in the lack
> + * of an altmap. See vmemmap_populate_compound_pages().
> + */
> +static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
> +                                             unsigned long nr_pages)
> +{
> +       return !altmap ? 2 * (PAGE_SIZE/sizeof(struct page)) : nr_pages;
> +}
> +

This means only the first 2 pages will be modified, the reset 6 or 4094 pages
do not.  In the HugeTLB case, those tail pages are mapped with read-only
to catch invalid usage on tail pages (e.g. write operations). Quick question:
should we also do similar things on DAX?

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/5] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
  2022-02-10 19:33 ` [PATCH v5 2/5] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
@ 2022-02-11  7:54   ` Muchun Song
  0 siblings, 0 replies; 18+ messages in thread
From: Muchun Song @ 2022-02-11  7:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> In preparation for describing a memmap with compound pages, move the
> actual pte population logic into a separate function
> vmemmap_populate_address() and have vmemmap_populate_basepages() walk
> through all base pages it needs to populate.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  mm/sparse-vmemmap.c | 51 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 32 insertions(+), 19 deletions(-)
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index c506f77cff23..e7be2ef4454b 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -608,33 +608,46 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> -                                        int node, struct vmem_altmap *altmap)
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap)

How about making it return a "pte_t *" instead of int. If it returns NULL
meaning NOMEM. I'll explain the reason in the next patch.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps
  2022-02-10 19:33 ` [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps Joao Martins
@ 2022-02-11  7:54   ` Muchun Song
  2022-02-11 12:37     ` Joao Martins
  0 siblings, 1 reply; 18+ messages in thread
From: Muchun Song @ 2022-02-11  7:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[...]
>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                                      struct vmem_altmap *altmap)
> +                                      struct vmem_altmap *altmap,
> +                                      struct page *block)

Why not use the name of "reuse" instead of "block"?
Seems like "reuse" is more clear.

>  {
>         pte_t *pte = pte_offset_kernel(pmd, addr);
>         if (pte_none(*pte)) {
>                 pte_t entry;
>                 void *p;
>
> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> -               if (!p)
> -                       return NULL;
> +               if (!block) {
> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> +                       if (!p)
> +                               return NULL;
> +               } else {
> +                       /*
> +                        * When a PTE/PMD entry is freed from the init_mm
> +                        * there's a a free_pages() call to this page allocated
> +                        * above. Thus this get_page() is paired with the
> +                        * put_page_testzero() on the freeing path.
> +                        * This can only called by certain ZONE_DEVICE path,
> +                        * and through vmemmap_populate_compound_pages() when
> +                        * slab is available.
> +                        */
> +                       get_page(block);
> +                       p = page_to_virt(block);
> +               }
>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>                 set_pte_at(&init_mm, addr, pte, entry);
>         }
> @@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>  }
>
>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap)
> +                                             struct vmem_altmap *altmap,
> +                                             struct page *reuse, struct page **page)

We can remove the last argument (struct page **page) if we change
the return type to "pte_t *".  More simple, don't you think?

>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -629,11 +645,13 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pmd = vmemmap_pmd_populate(pud, addr, node);
>         if (!pmd)
>                 return -ENOMEM;
> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
>         if (!pte)
>                 return -ENOMEM;
>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>
> +       if (page)
> +               *page = pte_page(*pte);
>         return 0;
>  }
>
> @@ -644,10 +662,120 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>         int rc;
>
>         for (; addr < end; addr += PAGE_SIZE) {
> -               rc = vmemmap_populate_address(addr, node, altmap);
> +               rc = vmemmap_populate_address(addr, node, altmap, NULL, NULL);
>                 if (rc)
>                         return rc;
> +       }
> +
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_range(unsigned long start,
> +                                           unsigned long end,
> +                                           int node, struct page *page)
> +{
> +       unsigned long addr = start;
> +       int rc;
>
> +       for (; addr < end; addr += PAGE_SIZE) {
> +               rc = vmemmap_populate_address(addr, node, NULL, page, NULL);
> +               if (rc)
> +                       return rc;
> +       }
> +
> +       return 0;
> +}
> +
> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
> +                                                 struct page **page)
> +{
> +       return vmemmap_populate_address(addr, node, NULL, NULL, page);
> +}
> +
> +/*
> + * For compound pages bigger than section size (e.g. x86 1G compound
> + * pages with 2M subsection size) fill the rest of sections as tail
> + * pages.
> + *
> + * Note that memremap_pages() resets @nr_range value and will increment
> + * it after each range successful onlining. Thus the value or @nr_range
> + * at section memmap populate corresponds to the in-progress range
> + * being onlined here.
> + */
> +static bool __meminit reuse_compound_section(unsigned long start_pfn,
> +                                            struct dev_pagemap *pgmap)
> +{
> +       unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
> +       unsigned long offset = start_pfn -
> +               PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
> +
> +       return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
> +}
> +
> +static struct page * __meminit compound_section_tail_page(unsigned long addr)
> +{
> +       pte_t *ptep;
> +
> +       addr -= PAGE_SIZE;
> +
> +       /*
> +        * Assuming sections are populated sequentially, the previous section's
> +        * page data can be reused.
> +        */
> +       ptep = pte_offset_kernel(pmd_off_k(addr), addr);
> +       if (!ptep)
> +               return NULL;
> +
> +       return pte_page(*ptep);
> +}
> +
> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
> +                                                    unsigned long start,
> +                                                    unsigned long end, int node,
> +                                                    struct dev_pagemap *pgmap)
> +{
> +       unsigned long size, addr;
> +
> +       if (reuse_compound_section(start_pfn, pgmap)) {
> +               struct page *page;
> +
> +               page = compound_section_tail_page(start);
> +               if (!page)
> +                       return -ENOMEM;
> +
> +               /*
> +                * Reuse the page that was populated in the prior iteration
> +                * with just tail struct pages.
> +                */
> +               return vmemmap_populate_range(start, end, node, page);
> +       }
> +
> +       size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
> +       for (addr = start; addr < end; addr += size) {
> +               unsigned long next = addr, last = addr + size;
> +               struct page *block;
> +               int rc;
> +
> +               /* Populate the head page vmemmap page */
> +               rc = vmemmap_populate_page(addr, node, NULL);
> +               if (rc)
> +                       return rc;
> +
> +               /* Populate the tail pages vmemmap page */
> +               block = NULL;
> +               next = addr + PAGE_SIZE;
> +               rc = vmemmap_populate_page(next, node, &block);
> +               if (rc)
> +                       return rc;
> +
> +               /*
> +                * Reuse the previous page for the rest of tail pages
> +                * See layout diagram in Documentation/vm/vmemmap_dedup.rst
> +                */
> +               next += PAGE_SIZE;
> +               rc = vmemmap_populate_range(next, last, node, block);
> +               if (rc)
> +                       return rc;
>         }
>
>         return 0;
> @@ -659,12 +787,18 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>  {
>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>         unsigned long end = start + nr_pages * sizeof(struct page);
> +       int r;
>
>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>                 return NULL;
>
> -       if (vmemmap_populate(start, end, nid, altmap))
> +       if (pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)

Should we add a judgment like "is_power_of_2(sizeof(struct page))" since
this optimization is only applied when the size of the struct page does not
cross page boundaries?

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation
  2022-02-10 19:33 ` [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2022-02-11  8:03   ` Muchun Song
  2022-02-11 12:37     ` Joao Martins
  0 siblings, 1 reply; 18+ messages in thread
From: Muchun Song @ 2022-02-11  8:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> In support of using compound pages for devmap mappings, plumb the pgmap
> down to the vmemmap_populate implementation. Note that while altmap is
> retrievable from pgmap the memory hotplug code passes altmap without
> pgmap[*], so both need to be independently plumbed.
>
> So in addition to @altmap, pass @pgmap to sparse section populate
> functions namely:
>
>         sparse_add_section
>           section_activate
>             populate_section_memmap
>               __populate_section_memmap
>
> Passing @pgmap allows __populate_section_memmap() to both fetch the
> vmemmap_shift in which memmap metadata is created for and also to let
> sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
> whether to just reuse tail pages from past onlined sections.
>
> While at it, fix the kdoc for @altmap for sparse_add_section().
>
> [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation
  2022-02-11  8:03   ` Muchun Song
@ 2022-02-11 12:37     ` Joao Martins
  0 siblings, 0 replies; 18+ messages in thread
From: Joao Martins @ 2022-02-11 12:37 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On 2/11/22 08:03, Muchun Song wrote:
> On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> In support of using compound pages for devmap mappings, plumb the pgmap
>> down to the vmemmap_populate implementation. Note that while altmap is
>> retrievable from pgmap the memory hotplug code passes altmap without
>> pgmap[*], so both need to be independently plumbed.
>>
>> So in addition to @altmap, pass @pgmap to sparse section populate
>> functions namely:
>>
>>         sparse_add_section
>>           section_activate
>>             populate_section_memmap
>>               __populate_section_memmap
>>
>> Passing @pgmap allows __populate_section_memmap() to both fetch the
>> vmemmap_shift in which memmap metadata is created for and also to let
>> sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
>> whether to just reuse tail pages from past onlined sections.
>>
>> While at it, fix the kdoc for @altmap for sparse_add_section().
>>
>> [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thank you!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps
  2022-02-11  7:54   ` Muchun Song
@ 2022-02-11 12:37     ` Joao Martins
  2022-02-12 10:08       ` Muchun Song
  0 siblings, 1 reply; 18+ messages in thread
From: Joao Martins @ 2022-02-11 12:37 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On 2/11/22 07:54, Muchun Song wrote:
> On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [...]
>>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> -                                      struct vmem_altmap *altmap)
>> +                                      struct vmem_altmap *altmap,
>> +                                      struct page *block)
> 
> Why not use the name of "reuse" instead of "block"?
> Seems like "reuse" is more clear.
> 
Good idea, let me rename that to @reuse.

>>  {
>>         pte_t *pte = pte_offset_kernel(pmd, addr);
>>         if (pte_none(*pte)) {
>>                 pte_t entry;
>>                 void *p;
>>
>> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> -               if (!p)
>> -                       return NULL;
>> +               if (!block) {
>> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> +                       if (!p)
>> +                               return NULL;
>> +               } else {
>> +                       /*
>> +                        * When a PTE/PMD entry is freed from the init_mm
>> +                        * there's a a free_pages() call to this page allocated
>> +                        * above. Thus this get_page() is paired with the
>> +                        * put_page_testzero() on the freeing path.
>> +                        * This can only called by certain ZONE_DEVICE path,
>> +                        * and through vmemmap_populate_compound_pages() when
>> +                        * slab is available.
>> +                        */
>> +                       get_page(block);
>> +                       p = page_to_virt(block);
>> +               }
>>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>>                 set_pte_at(&init_mm, addr, pte, entry);
>>         }
>> @@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>  }
>>
>>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> -                                             struct vmem_altmap *altmap)
>> +                                             struct vmem_altmap *altmap,
>> +                                             struct page *reuse, struct page **page)
> 
> We can remove the last argument (struct page **page) if we change
> the return type to "pte_t *".  More simple, don't you think?
> 

Hmmm, perhaps it is simpler, specially provided the only error code is ENOMEM.

Albeit perhaps what we want is a `struct page *` rather than a pte.

>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>> @@ -629,11 +645,13 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>         pmd = vmemmap_pmd_populate(pud, addr, node);
>>         if (!pmd)
>>                 return -ENOMEM;
>> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
>>         if (!pte)
>>                 return -ENOMEM;
>>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>>
>> +       if (page)
>> +               *page = pte_page(*pte);
>>         return 0;
>>  }
>>
>> @@ -644,10 +662,120 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>         int rc;
>>
>>         for (; addr < end; addr += PAGE_SIZE) {
>> -               rc = vmemmap_populate_address(addr, node, altmap);
>> +               rc = vmemmap_populate_address(addr, node, altmap, NULL, NULL);
>>                 if (rc)
>>                         return rc;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_range(unsigned long start,
>> +                                           unsigned long end,
>> +                                           int node, struct page *page)
>> +{
>> +       unsigned long addr = start;
>> +       int rc;
>>
>> +       for (; addr < end; addr += PAGE_SIZE) {
>> +               rc = vmemmap_populate_address(addr, node, NULL, page, NULL);
>> +               if (rc)
>> +                       return rc;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>> +                                                 struct page **page)
>> +{
>> +       return vmemmap_populate_address(addr, node, NULL, NULL, page);
>> +}
>> +
>> +/*
>> + * For compound pages bigger than section size (e.g. x86 1G compound
>> + * pages with 2M subsection size) fill the rest of sections as tail
>> + * pages.
>> + *
>> + * Note that memremap_pages() resets @nr_range value and will increment
>> + * it after each range successful onlining. Thus the value or @nr_range
>> + * at section memmap populate corresponds to the in-progress range
>> + * being onlined here.
>> + */
>> +static bool __meminit reuse_compound_section(unsigned long start_pfn,
>> +                                            struct dev_pagemap *pgmap)
>> +{
>> +       unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
>> +       unsigned long offset = start_pfn -
>> +               PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
>> +
>> +       return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION;
>> +}
>> +
>> +static struct page * __meminit compound_section_tail_page(unsigned long addr)
>> +{
>> +       pte_t *ptep;
>> +
>> +       addr -= PAGE_SIZE;
>> +
>> +       /*
>> +        * Assuming sections are populated sequentially, the previous section's
>> +        * page data can be reused.
>> +        */
>> +       ptep = pte_offset_kernel(pmd_off_k(addr), addr);
>> +       if (!ptep)
>> +               return NULL;
>> +
>> +       return pte_page(*ptep);
>> +}
>> +
>> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> +                                                    unsigned long start,
>> +                                                    unsigned long end, int node,
>> +                                                    struct dev_pagemap *pgmap)
>> +{
>> +       unsigned long size, addr;
>> +
>> +       if (reuse_compound_section(start_pfn, pgmap)) {
>> +               struct page *page;
>> +
>> +               page = compound_section_tail_page(start);
>> +               if (!page)
>> +                       return -ENOMEM;
>> +
>> +               /*
>> +                * Reuse the page that was populated in the prior iteration
>> +                * with just tail struct pages.
>> +                */
>> +               return vmemmap_populate_range(start, end, node, page);
>> +       }
>> +
>> +       size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page));
>> +       for (addr = start; addr < end; addr += size) {
>> +               unsigned long next = addr, last = addr + size;
>> +               struct page *block;
>> +               int rc;
>> +
>> +               /* Populate the head page vmemmap page */
>> +               rc = vmemmap_populate_page(addr, node, NULL);
>> +               if (rc)
>> +                       return rc;
>> +
>> +               /* Populate the tail pages vmemmap page */
>> +               block = NULL;
>> +               next = addr + PAGE_SIZE;
>> +               rc = vmemmap_populate_page(next, node, &block);
>> +               if (rc)
>> +                       return rc;
>> +
>> +               /*
>> +                * Reuse the previous page for the rest of tail pages
>> +                * See layout diagram in Documentation/vm/vmemmap_dedup.rst
>> +                */
>> +               next += PAGE_SIZE;
>> +               rc = vmemmap_populate_range(next, last, node, block);
>> +               if (rc)
>> +                       return rc;
>>         }
>>
>>         return 0;
>> @@ -659,12 +787,18 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>>  {
>>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>>         unsigned long end = start + nr_pages * sizeof(struct page);
>> +       int r;
>>
>>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>                 return NULL;
>>
>> -       if (vmemmap_populate(start, end, nid, altmap))
>> +       if (pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
> 
> Should we add a judgment like "is_power_of_2(sizeof(struct page))" since
> this optimization is only applied when the size of the struct page does not
> cross page boundaries?

Totally miss that -- let me make that adjustment.

Can I ask which architectures/conditions this happens?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages for compound devmaps
  2022-02-11  5:07   ` Muchun Song
@ 2022-02-11 12:48     ` Joao Martins
  2022-02-12 11:11       ` Muchun Song
  0 siblings, 1 reply; 18+ messages in thread
From: Joao Martins @ 2022-02-11 12:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On 2/11/22 05:07, Muchun Song wrote:
> On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index cface1d38093..c10df2fd0ec2 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6666,6 +6666,20 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
>>         }
>>  }
>>
>> +/*
>> + * With compound page geometry and when struct pages are stored in ram most
>> + * tail pages are reused. Consequently, the amount of unique struct pages to
>> + * initialize is a lot smaller that the total amount of struct pages being
>> + * mapped. This is a paired / mild layering violation with explicit knowledge
>> + * of how the sparse_vmemmap internals handle compound pages in the lack
>> + * of an altmap. See vmemmap_populate_compound_pages().
>> + */
>> +static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
>> +                                             unsigned long nr_pages)
>> +{
>> +       return !altmap ? 2 * (PAGE_SIZE/sizeof(struct page)) : nr_pages;
>> +}
>> +
> 
> This means only the first 2 pages will be modified, the reset 6 or 4094 pages
> do not.  In the HugeTLB case, those tail pages are mapped with read-only
> to catch invalid usage on tail pages (e.g. write operations). Quick question:
> should we also do similar things on DAX?
> 
What's sort of in the way of marking deduplicated pages as read-only is one
particular CONFIG_DEBUG_VM feature, particularly page_init_poison(). HugeTLB
gets its memory from the page allocator of already has pre-populated (at boot)
system RAM sections and needs those to be 'given back' before they can be
hotunplugged. So I guess it never goes through page_init_poison(). Although
device-dax, the sections are populated and dedicated to device-dax when
hotplugged, and then on hotunplug when the last user devdax user drops the page
reference.

So page_init_poison() is called on those two occasions. It actually writes to
whole sections of memmap, not just one page. So either I gate read-only page
protection when CONFIG_DEBUG_VM=n (which feels very wrong), or I detect inside
page_init_poison() that the caller is trying to init compound devmap backed
struct pages that were already watermarked (i.e. essentially when pfn offset
between passed page and head page is bigger than 128).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps
  2022-02-11 12:37     ` Joao Martins
@ 2022-02-12 10:08       ` Muchun Song
  2022-02-12 14:49         ` Muchun Song
  2022-02-14 10:55         ` Joao Martins
  0 siblings, 2 replies; 18+ messages in thread
From: Muchun Song @ 2022-02-12 10:08 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On Fri, Feb 11, 2022 at 8:37 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/11/22 07:54, Muchun Song wrote:
> > On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> > [...]
> >>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> >> -                                      struct vmem_altmap *altmap)
> >> +                                      struct vmem_altmap *altmap,
> >> +                                      struct page *block)
> >
> > Why not use the name of "reuse" instead of "block"?
> > Seems like "reuse" is more clear.
> >
> Good idea, let me rename that to @reuse.
>
> >>  {
> >>         pte_t *pte = pte_offset_kernel(pmd, addr);
> >>         if (pte_none(*pte)) {
> >>                 pte_t entry;
> >>                 void *p;
> >>
> >> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> >> -               if (!p)
> >> -                       return NULL;
> >> +               if (!block) {
> >> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> >> +                       if (!p)
> >> +                               return NULL;
> >> +               } else {
> >> +                       /*
> >> +                        * When a PTE/PMD entry is freed from the init_mm
> >> +                        * there's a a free_pages() call to this page allocated
> >> +                        * above. Thus this get_page() is paired with the
> >> +                        * put_page_testzero() on the freeing path.
> >> +                        * This can only called by certain ZONE_DEVICE path,
> >> +                        * and through vmemmap_populate_compound_pages() when
> >> +                        * slab is available.
> >> +                        */
> >> +                       get_page(block);
> >> +                       p = page_to_virt(block);
> >> +               }
> >>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
> >>                 set_pte_at(&init_mm, addr, pte, entry);
> >>         }
> >> @@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
> >>  }
> >>
> >>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> >> -                                             struct vmem_altmap *altmap)
> >> +                                             struct vmem_altmap *altmap,
> >> +                                             struct page *reuse, struct page **page)
> >
> > We can remove the last argument (struct page **page) if we change
> > the return type to "pte_t *".  More simple, don't you think?
> >
>
> Hmmm, perhaps it is simpler, specially provided the only error code is ENOMEM.
>
> Albeit perhaps what we want is a `struct page *` rather than a pte.

The caller can extract `struct page` from a pte.

[...]

> >> -       if (vmemmap_populate(start, end, nid, altmap))
> >> +       if (pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
> >
> > Should we add a judgment like "is_power_of_2(sizeof(struct page))" since
> > this optimization is only applied when the size of the struct page does not
> > cross page boundaries?
>
> Totally miss that -- let me make that adjustment.
>
> Can I ask which architectures/conditions this happens?

E.g. arm64 when !CONFIG_MEMCG.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages for compound devmaps
  2022-02-11 12:48     ` Joao Martins
@ 2022-02-12 11:11       ` Muchun Song
  0 siblings, 0 replies; 18+ messages in thread
From: Muchun Song @ 2022-02-12 11:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On Fri, Feb 11, 2022 at 8:48 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 2/11/22 05:07, Muchun Song wrote:
> > On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index cface1d38093..c10df2fd0ec2 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -6666,6 +6666,20 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> >>         }
> >>  }
> >>
> >> +/*
> >> + * With compound page geometry and when struct pages are stored in ram most
> >> + * tail pages are reused. Consequently, the amount of unique struct pages to
> >> + * initialize is a lot smaller that the total amount of struct pages being
> >> + * mapped. This is a paired / mild layering violation with explicit knowledge
> >> + * of how the sparse_vmemmap internals handle compound pages in the lack
> >> + * of an altmap. See vmemmap_populate_compound_pages().
> >> + */
> >> +static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
> >> +                                             unsigned long nr_pages)
> >> +{
> >> +       return !altmap ? 2 * (PAGE_SIZE/sizeof(struct page)) : nr_pages;
> >> +}
> >> +
> >
> > This means only the first 2 pages will be modified, the reset 6 or 4094 pages
> > do not.  In the HugeTLB case, those tail pages are mapped with read-only
> > to catch invalid usage on tail pages (e.g. write operations). Quick question:
> > should we also do similar things on DAX?
> >
> What's sort of in the way of marking deduplicated pages as read-only is one
> particular CONFIG_DEBUG_VM feature, particularly page_init_poison(). HugeTLB
> gets its memory from the page allocator of already has pre-populated (at boot)
> system RAM sections and needs those to be 'given back' before they can be
> hotunplugged. So I guess it never goes through page_init_poison(). Although
> device-dax, the sections are populated and dedicated to device-dax when
> hotplugged, and then on hotunplug when the last user devdax user drops the page
> reference.
>
> So page_init_poison() is called on those two occasions. It actually writes to
> whole sections of memmap, not just one page. So either I gate read-only page
> protection when CONFIG_DEBUG_VM=n (which feels very wrong), or I detect inside
> page_init_poison() that the caller is trying to init compound devmap backed
> struct pages that were already watermarked (i.e. essentially when pfn offset
> between passed page and head page is bigger than 128).

Got it. I haven't realized page_init_poison() will poison the struct pages.
I agree with you that mapping with read-only is wrong.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps
  2022-02-12 10:08       ` Muchun Song
@ 2022-02-12 14:49         ` Muchun Song
  2022-02-14 10:57           ` Joao Martins
  2022-02-14 10:55         ` Joao Martins
  1 sibling, 1 reply; 18+ messages in thread
From: Muchun Song @ 2022-02-12 14:49 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On Sat, Feb 12, 2022 at 6:08 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Fri, Feb 11, 2022 at 8:37 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >
> > On 2/11/22 07:54, Muchun Song wrote:
> > > On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> > > [...]
> > >>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> > >> -                                      struct vmem_altmap *altmap)
> > >> +                                      struct vmem_altmap *altmap,
> > >> +                                      struct page *block)
> > >
> > > Why not use the name of "reuse" instead of "block"?
> > > Seems like "reuse" is more clear.
> > >
> > Good idea, let me rename that to @reuse.
> >
> > >>  {
> > >>         pte_t *pte = pte_offset_kernel(pmd, addr);
> > >>         if (pte_none(*pte)) {
> > >>                 pte_t entry;
> > >>                 void *p;
> > >>
> > >> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> > >> -               if (!p)
> > >> -                       return NULL;
> > >> +               if (!block) {
> > >> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> > >> +                       if (!p)
> > >> +                               return NULL;
> > >> +               } else {
> > >> +                       /*
> > >> +                        * When a PTE/PMD entry is freed from the init_mm
> > >> +                        * there's a a free_pages() call to this page allocated
> > >> +                        * above. Thus this get_page() is paired with the
> > >> +                        * put_page_testzero() on the freeing path.
> > >> +                        * This can only called by certain ZONE_DEVICE path,
> > >> +                        * and through vmemmap_populate_compound_pages() when
> > >> +                        * slab is available.
> > >> +                        */
> > >> +                       get_page(block);
> > >> +                       p = page_to_virt(block);
> > >> +               }
> > >>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
> > >>                 set_pte_at(&init_mm, addr, pte, entry);
> > >>         }
> > >> @@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
> > >>  }
> > >>
> > >>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> > >> -                                             struct vmem_altmap *altmap)
> > >> +                                             struct vmem_altmap *altmap,
> > >> +                                             struct page *reuse, struct page **page)
> > >
> > > We can remove the last argument (struct page **page) if we change
> > > the return type to "pte_t *".  More simple, don't you think?
> > >
> >
> > Hmmm, perhaps it is simpler, specially provided the only error code is ENOMEM.
> >
> > Albeit perhaps what we want is a `struct page *` rather than a pte.
>
> The caller can extract `struct page` from a pte.
>
> [...]
>
> > >> -       if (vmemmap_populate(start, end, nid, altmap))
> > >> +       if (pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
> > >
> > > Should we add a judgment like "is_power_of_2(sizeof(struct page))" since
> > > this optimization is only applied when the size of the struct page does not
> > > cross page boundaries?
> >
> > Totally miss that -- let me make that adjustment.
> >
> > Can I ask which architectures/conditions this happens?
>
> E.g. arm64 when !CONFIG_MEMCG.

Plus !CONFIG_SLUB even on x86_64.

>
> Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps
  2022-02-12 10:08       ` Muchun Song
  2022-02-12 14:49         ` Muchun Song
@ 2022-02-14 10:55         ` Joao Martins
  1 sibling, 0 replies; 18+ messages in thread
From: Joao Martins @ 2022-02-14 10:55 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On 2/12/22 10:08, Muchun Song wrote:
> On Fri, Feb 11, 2022 at 8:37 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/11/22 07:54, Muchun Song wrote:
>>> On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> @@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>>>  }
>>>>
>>>>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>>> -                                             struct vmem_altmap *altmap)
>>>> +                                             struct vmem_altmap *altmap,
>>>> +                                             struct page *reuse, struct page **page)
>>>
>>> We can remove the last argument (struct page **page) if we change
>>> the return type to "pte_t *".  More simple, don't you think?
>>
>> Hmmm, perhaps it is simpler, specially provided the only error code is ENOMEM.
>>
>> Albeit perhaps what we want is a `struct page *` rather than a pte.
> 
> The caller can extract `struct page` from a pte.
> 

Yeap, we do that here already. Anyway, I can try switching to the style you suggest
and see how it looks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps
  2022-02-12 14:49         ` Muchun Song
@ 2022-02-14 10:57           ` Joao Martins
  0 siblings, 0 replies; 18+ messages in thread
From: Joao Martins @ 2022-02-14 10:57 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, Dan Williams, Vishal Verma,
	Matthew Wilcox, Jason Gunthorpe, Jane Chu, Mike Kravetz,
	Andrew Morton, Jonathan Corbet, Christoph Hellwig, nvdimm,
	Linux Doc Mailing List

On 2/12/22 14:49, Muchun Song wrote:
> On Sat, Feb 12, 2022 at 6:08 PM Muchun Song <songmuchun@bytedance.com> wrote:
>> On Fri, Feb 11, 2022 at 8:37 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 2/11/22 07:54, Muchun Song wrote:
>>>> On Fri, Feb 11, 2022 at 3:34 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>> -       if (vmemmap_populate(start, end, nid, altmap))
>>>>> +       if (pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap)
>>>>
>>>> Should we add a judgment like "is_power_of_2(sizeof(struct page))" since
>>>> this optimization is only applied when the size of the struct page does not
>>>> cross page boundaries?
>>>
>>> Totally miss that -- let me make that adjustment.
>>>
>>> Can I ask which architectures/conditions this happens?
>>
>> E.g. arm64 when !CONFIG_MEMCG.
> 
> Plus !CONFIG_SLUB even on x86_64.

Oh, thanks for the ref -- hadn't realized that this was
possible on arm64/x86.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-02-14 10:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-10 19:33 [PATCH v5 0/5] sparse-vmemmap: memory savings for compound devmaps (device-dax) Joao Martins
2022-02-10 19:33 ` [PATCH v5 1/5] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2022-02-11  8:03   ` Muchun Song
2022-02-11 12:37     ` Joao Martins
2022-02-10 19:33 ` [PATCH v5 2/5] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2022-02-11  7:54   ` Muchun Song
2022-02-10 19:33 ` [PATCH v5 3/5] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2022-02-10 19:33 ` [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps Joao Martins
2022-02-11  7:54   ` Muchun Song
2022-02-11 12:37     ` Joao Martins
2022-02-12 10:08       ` Muchun Song
2022-02-12 14:49         ` Muchun Song
2022-02-14 10:57           ` Joao Martins
2022-02-14 10:55         ` Joao Martins
2022-02-10 19:33 ` [PATCH v5 5/5] mm/page_alloc: reuse tail struct pages " Joao Martins
2022-02-11  5:07   ` Muchun Song
2022-02-11 12:48     ` Joao Martins
2022-02-12 11:11       ` Muchun Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.