All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax
@ 2021-08-27 14:58 Joao Martins
  2021-08-27 14:58 ` [PATCH v4 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
                   ` (13 more replies)
  0 siblings, 14 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Changes since v3[0]:

 * Collect Dan's Reviewed-by on patches 1-5,8,9,11
 * Collect Muchun Reviewed-by on patch 1,2,11
 * Reorder patches to first introduce compound pages in ZONE_DEVICE with
 device-dax (for pmem) as first user (patches 1-8) followed by implementing
 the sparse-vmemmap changes for minimize struct page overhead for devmap (patches 9-14)
 * Eliminate remnant @align references to use @geometry (Dan)
 * Convert mentions of 'compound pagemap' to 'compound devmap' throughout
   the series to avoid confusions of this work conflicting/referring to
   anything Folio or pagemap related.
 * Delete pgmap_pfn_geometry() on patch 4
   and rework other patches to use pgmap_geometry() instead (Dan)
 * Convert @geometry to be a number of pages rather than page size in patch 4 (Dan)
 * Make pgmap_geometry() more readable (Christoph)
 * Simplify pgmap refcount pfn computation in memremap_pages() (Christoph)
 * Rework memmap_init_compound() in patch 4 to use the same style as
 memmap_init_zone_device i.e. iterating over PFNs, rather than struct pages (Dan)
 * Add comment on devmap prep_compound_head callsite explaining why it needs
 to be used after first+second tail pages have been initialized (Dan, Jane)
 * Initialize tail page refcount to zero in patch 4
 * Make sure pfn_next() iterate over compound pages (rather than base page) in
 patch 4 to tackle the zone_device elevated page refcount.
 [ Note these last two bullet points above are unneeded once this patch is merged:
   https://lore.kernel.org/linux-mm/20210825034828.12927-3-alex.sierra@amd.com/ ]
 * Remove usage of ternary operator when computing @end in gup_device_huge() in patch 8 (Dan)
 * Remove pinned_head variable in patch 8
 * Remove put_dev_pagemap() need for compound case as that is now fixed for the general case
 in patch 8
 * Switch to PageHead() instead of PageCompound() as we only work with either base pages
 or head pages in patch 8 (Matthew)
 * Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan)
 * Fix up missing return in vmemmap_populate_address() in patch 10
 * Change error handling style in all patches (Dan)
 * Change title of vmemmap_dedup.rst to be more representative of the purpose in patch 12 (Dan)
 * Move some of the section and subsection tail page reuse code into helpers
 reuse_compound_section() and compound_section_tail_page() for readability in patch 12 (Dan)
 * Commit description fixes for clearity in various patches (Dan)
 * Add pgmap_geometry_order() helper and
   drop unneeded geometry_size, order variables in patch 12
 * Drop unneeded byte based computation to be PFN in patch 12
 * Handle the dynamic dax region properly when ensuring a stable dev_dax->pgmap in patch 6.
 * Add a compound_nr_pages() helper and use it in memmap_init_zone_device to calculate
 the number of unique struct pages to initialize depending on @altmap existence in patch 13 (Dan)
 * Add compound_section_tail_huge_page() for the tail page PMD reuse in patch 14 (Dan)
 * Reword cover letter.

[0] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/

Full changelog of previous versions at the bottom of cover letter.
---
This series, attempts at minimizing 'struct page' overhead by
pursuing a similar approach as Muchun Song series "Free some vmemmap
pages of hugetlb page"[0] (now merged for v5.14), but applied to devmap/ZONE_DEVICE.

[0] https://lore.kernel.org/linux-mm/20210308102807.59745-1-songmuchun@bytedance.com/

The link above describes it quite nicely, but the idea is to reuse/deduplicate tail
page vmemmap areas, particular the area which only describes tail pages.
So a vmemmap page describes 64 struct pages, and the first page for a given
ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second
vmemmap page would contain only tail pages, and that's what gets reused across
the rest of the subsection/section. The bigger the page size, the bigger the
savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages).

This series also takes one step further on 1GB pages and *also* deduplicate PMD pages
which only contain tail pages which allows to keep parity with current hugepage
based vmemmap. This further let us more than halve the overhead with 1GB pages
(40M -> 16M per Tb).

In terms of savings, per 1Tb of memory, the struct page cost would go down
with compound devmap:

* with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory)
* with 1G pages we lose 16MB instead of 16G (0.0014% instead of 1.5% of total memory)

Along the way I've extended it past 'struct page' overhead *trying* to address a
few performance issues we knew about for pmem, specifically on the
{pin,get}_user_pages_fast with device-dax vmas which are really
slow even of the fast variants. THP is great on -fast variants but all except
hugetlbfs perform rather poorly on non-fast gup. Although I deferred the
__get_user_pages() improvements (in a follow up small series I have stashed as its
ortogonal to device-dax; THP suffers from the same syndrome).

To summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-7: Prepare devmap infra to use compound pages (for device-dax).
Specifically, we split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). Have
memmap_init_zone_device() initialize its metadata as compound pages, thus
introducing a new devmap property known as geometry which
outlines how the vmemmap is structured (defaults to base pages as done today).
Finally enable device-dax usage of devmap @geometry to a value
based on its own @align property. @geometry getter routine returns 1 by default (which
is today's case of base pages in devmap) and the usage of compound devmap is
optional. Starting with device-dax (*not* fsdax) we enable it by default
but fsdax and the rest of devmap users there's no change.

Patch 8: Optimize grabbing page refcount changes given that we
are working with compound pages i.e. we do 1 increment to the head
page for a given set of N subpages compared as opposed to N individual writes.
{get,pin}_user_pages_fast() for zone_device with compound devmap consequently
improves considerably with DRAM stored struct pages. It also *greatly*
improves pinning with altmap. Results with gup_test:

                                                   before     after
    (16G get_user_pages_fast 2M page size)         ~59 ms -> ~6.1 ms
    (16G pin_user_pages_fast 2M page size)         ~87 ms -> ~6.2 ms
    (16G get_user_pages_fast altmap 2M page size) ~494 ms -> ~9 ms
    (16G pin_user_pages_fast altmap 2M page size) ~494 ms -> ~10 ms

    altmap performance gets specially interesting when pinning a pmem dimm:

                                                   before     after
    (128G get_user_pages_fast 2M page size)         ~492 ms -> ~49 ms
    (128G pin_user_pages_fast 2M page size)         ~493 ms -> ~50 ms
    (128G get_user_pages_fast altmap 2M page size)  ~3.91 s -> ~70 ms
    (128G pin_user_pages_fast altmap 2M page size)  ~3.97 s -> ~74 ms

Patches 9-14: These patches take care of the struct page savings (also referred
to here as tail-page/vmemmap deduplication). Much like Muchun series, we reuse
PTE (and PMD) tail page vmemmap areas across a given geometry (which is seede by
 @align property of dax-core code) and enabling of memremap to
initialize the ZONE_DEVICE pages as compound pages of a given @align order. The
main difference though, is that contrary to the hugetlbfs series, there's no
vmemmap for the area, because we are populating it as opposed to remapping it.
IOW no freeing of pages of already initialized vmemmap like the case for
hugetlbfs, which simplifies the logic (besides not being arch-specific). After
these, there's also quite a visible region bootstrap of pmem vmemmap given that we
would initialize fewer struct pages depending on the page size with DRAM backed
struct pages -- because fewer pages are unique and most tail pages (with bigger
geometry sizes). altmap sees no significant difference in bootstrap AFAICT.
Patch 14 comes last as it's an improvement, not mandated for the initial
functionality. Also move the very nice docs of hugetlb_vmemmap.c into a
Documentation/vm/ entry and adjust them to the DAX case.

    NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs
    with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x
    smaller for 2M and 1G respectivally.

Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and
without altmap), alongside gup_test selftests with dynamic dax regions and
static dax regions. Coupled with ndctl unit tests for both kinds of device-dax
regions (static/dynamic) that exercise all of this.
Note: Build-tested arm64 and ppc64.

I have deferred the __get_user_pages() patch to outside this series
(https://lore.kernel.org/linux-mm/20201208172901.17384-11-joao.m.martins@oracle.com/),
as I found an simpler way to address it and that is also applicable to
THP. But will submit that as a follow up of this.

Patches apply on top of linux-next tag next-20210827 (commit 5e63226c7228).

Comments and suggestions very much appreciated!

Older Changelog,

v2 -> v3[3]:
 * Collect Mike's Ack on patch 2 (Mike)
 * Collect Naoya's Reviewed-by on patch 1 (Naoya)
 * Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun)
 * Rebased to next-20210714

v1[1] -> v2[2]:

 (New patches 7, 10, 11)
 * Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
 * Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
 * Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
 that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan]
 * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
 * Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
 tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
 * Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan]
 * Simplify patch 9 as a result of having compound initialization differently [Dan]
 * Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan]
 * Rename Subject of patch 6 [Dan]
 * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
 * Add some type-safety to @block and use 'struct page *' rather than
 void, Patch 8 [Dan]
 * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
 * Remove vmemmap lookup function in place of
 pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
 serialization, Patch 8
 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
 * Add docs about device-dax usage of tail dedup technique in newly added
 compound_pagemaps.rst doc entry.
 * Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan]
 * Add cleanup patch for device-dax for using ALIGN() [Dan]
 * Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan]
 * Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
 * Move PUD compound page to be last patch [Dan]
 * Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan]
 * Rebased to next-20210617 

 RFC[0] -> v1:
 (New patches 1-3, 5-8 but the diffstat isn't that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

[0] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/

Joao Martins (14):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  device-dax: use ALIGN() for determining pgoff
  device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  device-dax: compound devmap support
  mm/gup: grab head page refcount once for group of subpages
  mm/sparse-vmemmap: add a pgmap argument to section activation
  mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to
    helper
  mm/hugetlb_vmemmap: move comment block to Documentation/vm
  mm/sparse-vmemmap: populate compound devmaps
  mm/page_alloc: reuse tail struct pages for compound devmaps
  mm/sparse-vmemmap: improve memory savings for compound pud geometry

 Documentation/vm/index.rst         |   1 +
 Documentation/vm/vmemmap_dedup.rst | 300 +++++++++++++++++++++++++++++
 drivers/dax/bus.c                  |   8 +
 drivers/dax/device.c               |  58 ++++--
 include/linux/memory_hotplug.h     |   5 +-
 include/linux/memremap.h           |  17 ++
 include/linux/mm.h                 |   8 +-
 mm/gup.c                           |  51 +++--
 mm/hugetlb_vmemmap.c               | 162 +---------------
 mm/memory-failure.c                |   6 +
 mm/memory_hotplug.c                |   3 +-
 mm/memremap.c                      |  13 +-
 mm/page_alloc.c                    | 151 +++++++++++----
 mm/sparse-vmemmap.c                | 278 +++++++++++++++++++++++---
 mm/sparse.c                        |  26 ++-
 15 files changed, 803 insertions(+), 284 deletions(-)
 create mode 100644 Documentation/vm/vmemmap_dedup.rst

-- 
2.17.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v4 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()).  For devmap with compound pages fetch the
compound_head in case a tail page memory failure is being handled.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memory-failure.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 60df8fcd0444..beee19a5aa0f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1538,6 +1538,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto out;
 	}
 
+	/*
+	 * Pages instantiated by device-dax (not filesystem-dax)
+	 * may be compound pages.
+	 */
+	page = compound_head(page);
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
  2021-08-27 14:58 ` [PATCH v4 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

This is in preparation for sharing the storage for / deduplicating
compound page metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/page_alloc.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91edb930b8ab..2f735b2ff417 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -728,23 +728,33 @@ void free_compound_page(struct page *page)
 	free_the_page(page, compound_order(page));
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
  2021-08-27 14:58 ` [PATCH v4 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-08-27 14:58 ` [PATCH v4 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Move struct page init to an helper function __init_zone_device_page().

This is in preparation for sharing the storage for / deduplicating
compound page metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2f735b2ff417..57ef05800c06 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6570,6 +6570,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap)
+{
+
+	__init_single_page(page, pfn, zone_idx, nid);
+
+	/*
+	 * Mark page reserved as it will need to wait for onlining
+	 * phase for it to be fully associated with a zone.
+	 *
+	 * We can use the non-atomic __set_bit operation for setting
+	 * the flag as we are still initializing the pages.
+	 */
+	__SetPageReserved(page);
+
+	/*
+	 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
+	 * ever freed or placed on a driver-private list.
+	 */
+	page->pgmap = pgmap;
+	page->zone_device_data = NULL;
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		cond_resched();
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6598,39 +6638,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-		 * because this is done early in section_activate()
-		 */
-		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (2 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 15:33   ` Christoph Hellwig
  2021-08-27 14:58 ` [PATCH v4 05/14] device-dax: use ALIGN() for determining pgoff Joao Martins
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Add a new @geometry property for struct dev_pagemap which specifies that a
devmap is composed of a set of compound pages of size @geometry, instead of
base pages. When a compound page geometry is requested, all but the first
page are initialised as tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page size,
this creates an opportunity to optimize GUP and GUP-fast walkers, treating
it the same way as THP or hugetlb pages.

Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") removed set_page_count() because the
setting of page ref count to zero was redundant. devmap pages don't come
from page allocator though and only head page refcount is used for
compound pages, hence initialize tail page count to zero.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memremap.h | 17 +++++++++++++++++
 mm/memremap.c            | 12 ++++++------
 mm/page_alloc.c          | 37 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 119f130ef8f1..4b78d30c3987 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -99,6 +99,10 @@ struct dev_pagemap_ops {
  * @done: completion for @internal_ref
  * @type: memory type: see MEMORY_* in memory_hotplug.h
  * @flags: PGMAP_* flags to specify defailed behavior
+ * @geometry: structural definition of how the vmemmap metadata is populated.
+ *	A zero or 1 defaults to using base pages as the memmap metadata
+ *	representation. A bigger value will set up compound struct pages
+ *	representative of the requested geometry size.
  * @ops: method table
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
@@ -114,6 +118,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long geometry;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
+{
+	if (pgmap && pgmap->geometry)
+		return pgmap->geometry;
+	return 1;
+}
+
+static inline unsigned long pgmap_geometry_order(struct dev_pagemap *pgmap)
+{
+	return order_base_2(pgmap_geometry(pgmap));
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 84de22c14567..99646082436f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -102,11 +102,11 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
 	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
 
-static unsigned long pfn_next(unsigned long pfn)
+static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
 {
-	if (pfn % 1024 == 0)
+	if (pfn % (1024 << pgmap_geometry_order(pgmap)))
 		cond_resched();
-	return pfn + 1;
+	return pfn + pgmap_geometry(pgmap);
 }
 
 /*
@@ -130,7 +130,7 @@ bool pfn_zone_device_reserved(unsigned long pfn)
 }
 
 #define for_each_device_pfn(pfn, map, i) \
-	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
+	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
 
 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
@@ -315,8 +315,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap));
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 57ef05800c06..c1497a928005 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6610,6 +6610,34 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+static void __ref memmap_init_compound(struct page *head, unsigned long head_pfn,
+				       unsigned long zone_idx, int nid,
+				       struct dev_pagemap *pgmap,
+				       unsigned long nr_pages)
+{
+	unsigned long pfn, end_pfn = head_pfn + nr_pages;
+	unsigned int order = pgmap_geometry_order(pgmap);
+
+	__SetPageHead(head);
+	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+		prep_compound_tail(head, pfn - head_pfn);
+		set_page_count(page, 0);
+
+		/*
+		 * The first tail page stores compound_mapcount_ptr() and
+		 * compound_order() and the second tail page stores
+		 * compound_pincount_ptr(). Call prep_compound_head() after
+		 * the first and second tail pages have been initialized to
+		 * not have the data overwritten.
+		 */
+		if (pfn == head_pfn + 2)
+			prep_compound_head(head, order);
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6618,6 +6646,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfns_per_compound = pgmap_geometry(pgmap);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6635,10 +6664,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
 		struct page *page = pfn_to_page(pfn);
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfns_per_compound == 1)
+			continue;
+
+		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+				     pfns_per_compound);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 05/14] device-dax: use ALIGN() for determining pgoff
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (3 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Rather than calculating @pgoff manually, switch to ALIGN() instead.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index dd8222a42808..0b82159b3564 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -234,8 +234,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
+		pgoff = linear_page_index(vmf->vma,
+				ALIGN(vmf->address, fault_size));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (4 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 05/14] device-dax: use ALIGN() for determining pgoff Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-11-05  0:31   ` Dan Williams
  2021-08-27 14:58 ` [PATCH v4 07/14] device-dax: compound devmap support Joao Martins
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Right now, only static dax regions have a valid @pgmap pointer in its
struct dev_dax. Dynamic dax case however, do not.

In preparation for device-dax compound devmap support, make sure that
dev_dax pgmap field is set after it has been allocated and initialized.

dynamic dax device have the @pgmap is allocated at probe() and it's
managed by devm (contrast to static dax region which a pgmap is provided
and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
the pgmap when the dynamic dax device is released to avoid the same
pgmap ranges to be re-requested across multiple region device reconfigs.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/bus.c    | 8 ++++++++
 drivers/dax/device.c | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 6cc4da4c713d..49dbff9ba609 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -363,6 +363,14 @@ void kill_dev_dax(struct dev_dax *dev_dax)
 
 	kill_dax(dax_dev);
 	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+
+	/*
+	 * Dynamic dax region have the pgmap allocated via dev_kzalloc()
+	 * and thus freed by devm. Clear the pgmap to not have stale pgmap
+	 * ranges on probe() from previous reconfigurations of region devices.
+	 */
+	if (!is_static(dev_dax->region))
+		dev_dax->pgmap = NULL;
 }
 EXPORT_SYMBOL_GPL(kill_dev_dax);
 
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 0b82159b3564..6e348b5f9d45 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	dev_dax->pgmap = pgmap;
+
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 07/14] device-dax: compound devmap support
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (5 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-11-05  0:38   ` Dan Williams
  2021-08-27 14:58 ` [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Use the newly added compound devmap facility which maps the assigned dax
ranges as compound pages at a page size of @align. Currently, this means,
that region/namespace bootstrap would take considerably less, given that
you would initialize considerably less pages.

On setups with 128G NVDIMMs the initialization with DRAM stored struct
pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
than a 1msec with 1G pages.

dax devices are created with a fixed @align (huge page size) which is
enforced through as well at mmap() of the device. Faults, consequently
happen too at the specified @align specified at the creation, and those
don't change through out dax device lifetime. MCEs poisons a whole dax
huge page, as well as splits occurring at the configured page size.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 56 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6e348b5f9d45..5d23128f9a60 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned long fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned long fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head = compound_head(head);
+	if (head->mapping)
+		return;
+
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma,
+			ALIGN(vmf->address, fault_size));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +261,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = dev_dax->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +269,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma,
-				ALIGN(vmf->address, fault_size));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap_geometry(pgmap) > 1)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -426,6 +454,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (dev_dax->align > PAGE_SIZE)
+		pgmap->geometry = dev_dax->align >> PAGE_SHIFT;
 	dev_dax->pgmap = pgmap;
 
 	addr = devm_memremap_pages(dev, pgmap);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (6 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 07/14] device-dax: compound devmap support Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 16:25   ` Jason Gunthorpe
  2021-10-08 11:54   ` Jason Gunthorpe
  2021-08-27 14:58 ` [PATCH v4 09/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Use try_grab_compound_head() for device-dax GUP when configured with a
compound devmap.

Rather than incrementing the refcount for each page, do one atomic
addition for all the pages to be pinned.

Performance measured by gup_benchmark improves considerably
get_user_pages_fast() and pin_user_pages_fast() with NVDIMMs:

 $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~59 ms -> ~6.1 ms
(pin_user_pages_fast 2M pages) ~87 ms -> ~6.2 ms
[altmap]
(get_user_pages_fast 2M pages) ~494 ms -> ~9 ms
(pin_user_pages_fast 2M pages) ~494 ms -> ~10 ms

 $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S [-u,-a] -n 512 -w
(get_user_pages_fast 2M pages) ~492 ms -> ~49 ms
(pin_user_pages_fast 2M pages) ~493 ms -> ~50 ms
[altmap with -m 127004]
(get_user_pages_fast 2M pages) ~3.91 sec -> ~70 ms
(pin_user_pages_fast 2M pages) ~3.97 sec -> ~74 ms

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/gup.c | 51 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7a406d79bd2e..0741d2c0ba5e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2234,17 +2234,30 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
+
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			     unsigned long end, unsigned int flags,
 			     struct page **pages, int *nr)
 {
-	int nr_start = *nr;
+	int refs, nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
 	int ret = 1;
 
 	do {
-		struct page *page = pfn_to_page(pfn);
+		struct page *head, *page = pfn_to_page(pfn);
+		unsigned long next = addr + PAGE_SIZE;
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
@@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			ret = 0;
 			break;
 		}
-		SetPageReferenced(page);
-		pages[*nr] = page;
-		if (unlikely(!try_grab_page(page, flags))) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
+
+		head = compound_head(page);
+		/* @end is assumed to be limited at most one compound page */
+		if (PageHead(head))
+			next = end;
+		refs = record_subpages(page, addr, next, pages + *nr);
+
+		SetPageReferenced(head);
+		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
+			if (PageHead(head))
+				ClearPageReferenced(head);
+			else
+				undo_dev_pagemap(nr, nr_start, flags, pages);
 			ret = 0;
 			break;
 		}
-		(*nr)++;
-		pfn++;
-	} while (addr += PAGE_SIZE, addr != end);
+		*nr += refs;
+		pfn += refs;
+	} while (addr += (refs << PAGE_SHIFT), addr != end);
 
 	put_dev_pagemap(pgmap);
 	return ret;
@@ -2320,17 +2342,6 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-			   unsigned long end, struct page **pages)
-{
-	int nr;
-
-	for (nr = 0; addr != end; addr += PAGE_SIZE)
-		pages[nr++] = page++;
-
-	return nr;
-}
-
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 09/14] mm/sparse-vmemmap: add a pgmap argument to section activation
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (7 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 10/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

In support of using compound pages for devmap mappings, plumb the pgmap
down to the vmemmap_populate implementation. Note that while altmap is
retrievable from pgmap the memory hotplug code passes altmap without
pgmap[*], so both need to be independently plumbed.

So in addition to @altmap, pass @pgmap to sparse section populate
functions namely:

	sparse_add_section
	  section_activate
	    populate_section_memmap
   	      __populate_section_memmap

Passing @pgmap allows __populate_section_memmap() to both fetch the
geometry in which memmap metadata is created for and also to let
sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
whether to just reuse tail pages from past onlined sections.

While at it, fix the kdoc for @altmap for sparse_add_section().

[*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memory_hotplug.h |  5 ++++-
 include/linux/mm.h             |  3 ++-
 mm/memory_hotplug.c            |  3 ++-
 mm/sparse-vmemmap.c            |  3 ++-
 mm/sparse.c                    | 26 ++++++++++++++++----------
 5 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index e5a867c950b2..a5e648ae86e9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -15,6 +15,7 @@ struct memory_block;
 struct memory_group;
 struct resource;
 struct vmem_altmap;
+struct dev_pagemap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 struct page *pfn_to_online_page(unsigned long pfn);
@@ -66,6 +67,7 @@ typedef int __bitwise mhp_t;
 struct mhp_params {
 	struct vmem_altmap *altmap;
 	pgprot_t pgprot;
+	struct dev_pagemap *pgmap;
 };
 
 bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);
@@ -342,7 +344,8 @@ extern void remove_pfn_range_from_zone(struct zone *zone,
 				       unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern int sparse_add_section(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap);
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 extern void sparse_remove_section(struct mem_section *ms,
 		unsigned long pfn, unsigned long nr_pages,
 		unsigned long map_offset, struct vmem_altmap *altmap);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a3cc83d64564..4fca4942c0ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3167,7 +3167,8 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end,
 
 void *sparse_buffer_alloc(unsigned long size);
 struct page * __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap);
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0488eed3327c..81d36de86842 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -335,7 +335,8 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 		/* Select all remaining pages up to the next section boundary */
 		cur_nr_pages = min(end_pfn - pfn,
 				   SECTION_ALIGN_UP(pfn + 1) - pfn);
-		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap);
+		err = sparse_add_section(nid, pfn, cur_nr_pages, altmap,
+					 params->pgmap);
 		if (err)
 			break;
 		cond_resched();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bdce883f9286..80d3ba30d345 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -603,7 +603,8 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 }
 
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
diff --git a/mm/sparse.c b/mm/sparse.c
index 120bc8ea5293..122ac881f27b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -427,7 +427,8 @@ static unsigned long __init section_map_size(void)
 }
 
 struct page __init *__populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long size = section_map_size();
 	struct page *map = sparse_buffer_alloc(size);
@@ -524,7 +525,7 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
 			break;
 
 		map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
-				nid, NULL);
+				nid, NULL, NULL);
 		if (!map) {
 			pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
 			       __func__, nid);
@@ -629,9 +630,10 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
-	return __populate_section_memmap(pfn, nr_pages, nid, altmap);
+	return __populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 }
 
 static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
@@ -700,7 +702,8 @@ static int fill_subsection_map(unsigned long pfn, unsigned long nr_pages)
 }
 #else
 struct page * __meminit populate_section_memmap(unsigned long pfn,
-		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
+		unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	return kvmalloc_node(array_size(sizeof(struct page),
 					PAGES_PER_SECTION), GFP_KERNEL, nid);
@@ -823,7 +826,8 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 }
 
 static struct page * __meminit section_activate(int nid, unsigned long pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	struct mem_section *ms = __pfn_to_section(pfn);
 	struct mem_section_usage *usage = NULL;
@@ -855,7 +859,7 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
 	if (nr_pages < PAGES_PER_SECTION && early_section(ms))
 		return pfn_to_page(pfn);
 
-	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap);
+	memmap = populate_section_memmap(pfn, nr_pages, nid, altmap, pgmap);
 	if (!memmap) {
 		section_deactivate(pfn, nr_pages, altmap);
 		return ERR_PTR(-ENOMEM);
@@ -869,7 +873,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * @nid: The node to add section on
  * @start_pfn: start pfn of the memory range
  * @nr_pages: number of pfns to add in the section
- * @altmap: device page map
+ * @altmap: alternate pfns to allocate the memmap backing store
+ * @pgmap: alternate compound page geometry for devmap mappings
  *
  * This is only intended for hotplug.
  *
@@ -883,7 +888,8 @@ static struct page * __meminit section_activate(int nid, unsigned long pfn,
  * * -ENOMEM	- Out of memory.
  */
 int __meminit sparse_add_section(int nid, unsigned long start_pfn,
-		unsigned long nr_pages, struct vmem_altmap *altmap)
+		unsigned long nr_pages, struct vmem_altmap *altmap,
+		struct dev_pagemap *pgmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct mem_section *ms;
@@ -894,7 +900,7 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	if (ret < 0)
 		return ret;
 
-	memmap = section_activate(nid, start_pfn, nr_pages, altmap);
+	memmap = section_activate(nid, start_pfn, nr_pages, altmap, pgmap);
 	if (IS_ERR(memmap))
 		return PTR_ERR(memmap);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 10/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (8 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 09/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 11/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

In preparation for describing a memmap with compound pages, move the
actual pte population logic into a separate function
vmemmap_populate_address() and have vmemmap_populate_basepages() walk
through all base pages it needs to populate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/sparse-vmemmap.c | 51 ++++++++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 80d3ba30d345..58e8e77bd5b5 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -570,33 +570,46 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
-					 int node, struct vmem_altmap *altmap)
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap)
 {
-	unsigned long addr = start;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 
+	pgd = vmemmap_pgd_populate(addr, node);
+	if (!pgd)
+		return -ENOMEM;
+	p4d = vmemmap_p4d_populate(pgd, addr, node);
+	if (!p4d)
+		return -ENOMEM;
+	pud = vmemmap_pud_populate(p4d, addr, node);
+	if (!pud)
+		return -ENOMEM;
+	pmd = vmemmap_pmd_populate(pud, addr, node);
+	if (!pmd)
+		return -ENOMEM;
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	if (!pte)
+		return -ENOMEM;
+	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	return 0;
+}
+
+int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
+					 int node, struct vmem_altmap *altmap)
+{
+	unsigned long addr = start;
+	int rc;
+
 	for (; addr < end; addr += PAGE_SIZE) {
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
-			return -ENOMEM;
-		p4d = vmemmap_p4d_populate(pgd, addr, node);
-		if (!p4d)
-			return -ENOMEM;
-		pud = vmemmap_pud_populate(p4d, addr, node);
-		if (!pud)
-			return -ENOMEM;
-		pmd = vmemmap_pmd_populate(pud, addr, node);
-		if (!pmd)
-			return -ENOMEM;
-		pte = vmemmap_pte_populate(pmd, addr, node, altmap);
-		if (!pte)
-			return -ENOMEM;
-		vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+		rc = vmemmap_populate_address(addr, node, altmap);
+		if (rc)
+			return rc;
+
 	}
 
 	return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 11/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (9 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 10/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 12/14] mm/sparse-vmemmap: populate compound devmaps Joao Martins
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

In preparation for device-dax for using hugetlbfs compound page tail
deduplication technique, move the comment block explanation into a
common place in Documentation/vm.

Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/vm/index.rst         |   1 +
 Documentation/vm/vmemmap_dedup.rst | 170 +++++++++++++++++++++++++++++
 mm/hugetlb_vmemmap.c               | 162 +--------------------------
 3 files changed, 172 insertions(+), 161 deletions(-)
 create mode 100644 Documentation/vm/vmemmap_dedup.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index b51f0d8992f8..68fe9b953b0a 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -52,5 +52,6 @@ descriptions of data structures and algorithms.
    split_page_table_lock
    transhuge
    unevictable-lru
+   vmemmap_dedup
    z3fold
    zsmalloc
diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
new file mode 100644
index 000000000000..215ae2ef3bce
--- /dev/null
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -0,0 +1,170 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _vmemmap_dedup:
+
+==================================
+Free some vmemmap pages of HugeTLB
+==================================
+
+The struct page structures (page structs) are used to describe a physical
+page frame. By default, there is a one-to-one mapping from a page frame to
+it's corresponding page struct.
+
+HugeTLB pages consist of multiple base page size pages and is supported by
+many architectures. See hugetlbpage.rst in the Documentation directory for
+more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
+are currently supported. Since the base page size on x86 is 4KB, a 2MB
+HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
+4096 base pages. For each base page, there is a corresponding page struct.
+
+Within the HugeTLB subsystem, only the first 4 page structs are used to
+contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
+this upper limit. The only 'useful' information in the remaining page structs
+is the compound_head field, and this field is the same for all tail pages.
+
+By removing redundant page structs for HugeTLB pages, memory can be returned
+to the buddy allocator for other uses.
+
+Different architectures support different HugeTLB pages. For example, the
+following table is the HugeTLB page size supported by x86 and arm64
+architectures. Because arm64 supports 4k, 16k, and 64k base pages and
+supports contiguous entries, so it supports many kinds of sizes of HugeTLB
+page.
+
++--------------+-----------+-----------------------------------------------+
+| Architecture | Page Size |                HugeTLB Page Size              |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
+|              +-----------+-----------+-----------+-----------+-----------+
+|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
+|              +-----------+-----------+-----------+-----------+-----------+
+|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
++--------------+-----------+-----------+-----------+-----------+-----------+
+
+When the system boot up, every HugeTLB page has more than one struct page
+structs which size is (unit: pages):
+
+   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+
+Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
+of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
+relationship.
+
+   HugeTLB_Size = n * PAGE_SIZE
+
+Then,
+
+   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
+               = n * sizeof(struct page) / PAGE_SIZE
+
+We can use huge mapping at the pud/pmd level for the HugeTLB page.
+
+For the HugeTLB page of the pmd level mapping, then
+
+   struct_size = n * sizeof(struct page) / PAGE_SIZE
+               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
+               = sizeof(struct page) / sizeof(pte_t)
+               = 64 / 8
+               = 8 (pages)
+
+Where n is how many pte entries which one page can contains. So the value of
+n is (PAGE_SIZE / sizeof(pte_t)).
+
+This optimization only supports 64-bit system, so the value of sizeof(pte_t)
+is 8. And this optimization also applicable only when the size of struct page
+is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
+x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
+size of struct page structs of it is 8 page frames which size depends on the
+size of the base page.
+
+For the HugeTLB page of the pud level mapping, then
+
+   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
+               = PAGE_SIZE / 8 * 8 (pages)
+               = PAGE_SIZE (pages)
+
+Where the struct_size(pmd) is the size of the struct page structs of a
+HugeTLB page of the pmd level mapping.
+
+E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
+HugeTLB page consists in 4096.
+
+Next, we take the pmd level mapping of the HugeTLB page as an example to
+show the internal implementation of this optimization. There are 8 pages
+struct page structs associated with a HugeTLB page which is pmd mapped.
+
+Here is how things look before optimization.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | -------------> |     2     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     3     | -------------> |     3     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     4     | -------------> |     4     |
+ |    PMD    |                     +-----------+                +-----------+
+ |   level   |                     |     5     | -------------> |     5     |
+ |  mapping  |                     +-----------+                +-----------+
+ |           |                     |     6     | -------------> |     6     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     7     | -------------> |     7     |
+ |           |                     +-----------+                +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+The value of page->compound_head is the same for all tail pages. The first
+page of page structs (page 0) associated with the HugeTLB page contains the 4
+page structs necessary to describe the HugeTLB. The only use of the remaining
+pages of page structs (page 1 to page 7) is to point to page->compound_head.
+Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
+will be used for each HugeTLB page. This will allow us to free the remaining
+6 pages to the buddy allocator.
+
+Here is how things look after remapping.
+
+    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
+ +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
+ |           |                     |     0     | -------------> |     0     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     1     | -------------> |     1     |
+ |           |                     +-----------+                +-----------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | |
+ |           |                     |     3     | ------------------+ | | | |
+ |           |                     +-----------+                     | | | |
+ |           |                     |     4     | --------------------+ | | |
+ |    PMD    |                     +-----------+                       | | |
+ |   level   |                     |     5     | ----------------------+ | |
+ |  mapping  |                     +-----------+                         | |
+ |           |                     |     6     | ------------------------+ |
+ |           |                     +-----------+                           |
+ |           |                     |     7     | --------------------------+
+ |           |                     +-----------+
+ |           |
+ |           |
+ |           |
+ +-----------+
+
+When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
+vmemmap pages and restore the previous mapping relationship.
+
+For the HugeTLB page of the pud level mapping. It is similar to the former.
+We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
+
+Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
+(e.g. aarch64) provides a contiguous bit in the translation table entries
+that hints to the MMU to indicate that it is one of a contiguous set of
+entries that can be cached in a single TLB entry.
+
+The contiguous bit is used to increase the mapping size at the pmd and pte
+(last) level. So this type of HugeTLB page can be optimized only when its
+size of the struct page structs is greater than 2 pages.
+
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c540c21e26f5..e2994e50ddee 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -6,167 +6,7 @@
  *
  *     Author: Muchun Song <songmuchun@bytedance.com>
  *
- * The struct page structures (page structs) are used to describe a physical
- * page frame. By default, there is a one-to-one mapping from a page frame to
- * it's corresponding page struct.
- *
- * HugeTLB pages consist of multiple base page size pages and is supported by
- * many architectures. See hugetlbpage.rst in the Documentation directory for
- * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB
- * are currently supported. Since the base page size on x86 is 4KB, a 2MB
- * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of
- * 4096 base pages. For each base page, there is a corresponding page struct.
- *
- * Within the HugeTLB subsystem, only the first 4 page structs are used to
- * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
- * this upper limit. The only 'useful' information in the remaining page structs
- * is the compound_head field, and this field is the same for all tail pages.
- *
- * By removing redundant page structs for HugeTLB pages, memory can be returned
- * to the buddy allocator for other uses.
- *
- * Different architectures support different HugeTLB pages. For example, the
- * following table is the HugeTLB page size supported by x86 and arm64
- * architectures. Because arm64 supports 4k, 16k, and 64k base pages and
- * supports contiguous entries, so it supports many kinds of sizes of HugeTLB
- * page.
- *
- * +--------------+-----------+-----------------------------------------------+
- * | Architecture | Page Size |                HugeTLB Page Size              |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- * |              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
- * |              +-----------+-----------+-----------+-----------+-----------+
- * |              |   64KB    |    2MB    |  512MB    |    16GB   |           |
- * +--------------+-----------+-----------+-----------+-----------+-----------+
- *
- * When the system boot up, every HugeTLB page has more than one struct page
- * structs which size is (unit: pages):
- *
- *    struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *
- * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
- * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
- * relationship.
- *
- *    HugeTLB_Size = n * PAGE_SIZE
- *
- * Then,
- *
- *    struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- *                = n * sizeof(struct page) / PAGE_SIZE
- *
- * We can use huge mapping at the pud/pmd level for the HugeTLB page.
- *
- * For the HugeTLB page of the pmd level mapping, then
- *
- *    struct_size = n * sizeof(struct page) / PAGE_SIZE
- *                = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
- *                = sizeof(struct page) / sizeof(pte_t)
- *                = 64 / 8
- *                = 8 (pages)
- *
- * Where n is how many pte entries which one page can contains. So the value of
- * n is (PAGE_SIZE / sizeof(pte_t)).
- *
- * This optimization only supports 64-bit system, so the value of sizeof(pte_t)
- * is 8. And this optimization also applicable only when the size of struct page
- * is a power of two. In most cases, the size of struct page is 64 bytes (e.g.
- * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
- * size of struct page structs of it is 8 page frames which size depends on the
- * size of the base page.
- *
- * For the HugeTLB page of the pud level mapping, then
- *
- *    struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
- *                = PAGE_SIZE / 8 * 8 (pages)
- *                = PAGE_SIZE (pages)
- *
- * Where the struct_size(pmd) is the size of the struct page structs of a
- * HugeTLB page of the pmd level mapping.
- *
- * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
- * HugeTLB page consists in 4096.
- *
- * Next, we take the pmd level mapping of the HugeTLB page as an example to
- * show the internal implementation of this optimization. There are 8 pages
- * struct page structs associated with a HugeTLB page which is pmd mapped.
- *
- * Here is how things look before optimization.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | -------------> |     1     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     2     | -------------> |     2     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     3     | -------------> |     3     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     4     | -------------> |     4     |
- * |    PMD    |                     +-----------+                +-----------+
- * |   level   |                     |     5     | -------------> |     5     |
- * |  mapping  |                     +-----------+                +-----------+
- * |           |                     |     6     | -------------> |     6     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     7     | -------------> |     7     |
- * |           |                     +-----------+                +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * The value of page->compound_head is the same for all tail pages. The first
- * page of page structs (page 0) associated with the HugeTLB page contains the 4
- * page structs necessary to describe the HugeTLB. The only use of the remaining
- * pages of page structs (page 1 to page 7) is to point to page->compound_head.
- * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs
- * will be used for each HugeTLB page. This will allow us to free the remaining
- * 6 pages to the buddy allocator.
- *
- * Here is how things look after remapping.
- *
- *    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
- * +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
- * |           |                     |     0     | -------------> |     0     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     1     | -------------> |     1     |
- * |           |                     +-----------+                +-----------+
- * |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
- * |           |                     +-----------+                   | | | | |
- * |           |                     |     3     | ------------------+ | | | |
- * |           |                     +-----------+                     | | | |
- * |           |                     |     4     | --------------------+ | | |
- * |    PMD    |                     +-----------+                       | | |
- * |   level   |                     |     5     | ----------------------+ | |
- * |  mapping  |                     +-----------+                         | |
- * |           |                     |     6     | ------------------------+ |
- * |           |                     +-----------+                           |
- * |           |                     |     7     | --------------------------+
- * |           |                     +-----------+
- * |           |
- * |           |
- * |           |
- * +-----------+
- *
- * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for
- * vmemmap pages and restore the previous mapping relationship.
- *
- * For the HugeTLB page of the pud level mapping. It is similar to the former.
- * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages.
- *
- * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
- * (e.g. aarch64) provides a contiguous bit in the translation table entries
- * that hints to the MMU to indicate that it is one of a contiguous set of
- * entries that can be cached in a single TLB entry.
- *
- * The contiguous bit is used to increase the mapping size at the pmd and pte
- * (last) level. So this type of HugeTLB page can be optimized only when its
- * size of the struct page structs is greater than 2 pages.
+ * See Documentation/vm/vmemmap_dedup.rst
  */
 #define pr_fmt(fmt)	"HugeTLB: " fmt
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 12/14] mm/sparse-vmemmap: populate compound devmaps
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (10 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 11/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 13/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
  2021-08-27 14:58 ` [PATCH v4 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

A compound devmap is a dev_pagemap with @geometry > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead. Allocate a separate
page for the vmemmap area which contains the head page and separate for
the next 64 pages. The rest of the subsections then reuse this tail
vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound devmap with big enough @geometry (e.g. 1G
PUD) it may cross multiple sections. The vmemmap code needs to consult
@pgmap so that multiple sections that all map the same tail data can
refer back to the first copy of that data for a given gigantic page.

On compound devmaps with 2M align, this mechanism lets 6 pages be
saved out of the 8 necessary PFNs necessary to set the subsection's
512 struct pages being mapped. On a 1G compound devmap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().  It
is worth noting that altmap for devmap mappings was there to relieve the
pressure of inordinate amounts of memmap space to map terabytes of pmem.
With compound pages the motivation for altmaps for pmem gets reduced.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/vmemmap_dedup.rst |  27 +++++-
 include/linux/mm.h                 |   2 +-
 mm/memremap.c                      |   1 +
 mm/sparse-vmemmap.c                | 150 +++++++++++++++++++++++++++--
 4 files changed, 168 insertions(+), 12 deletions(-)

diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
index 215ae2ef3bce..faac78bef01c 100644
--- a/Documentation/vm/vmemmap_dedup.rst
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -2,9 +2,12 @@
 
 .. _vmemmap_dedup:
 
-==================================
-Free some vmemmap pages of HugeTLB
-==================================
+=========================================
+A vmemmap diet for HugeTLB and Device DAX
+=========================================
+
+HugeTLB
+=======
 
 The struct page structures (page structs) are used to describe a physical
 page frame. By default, there is a one-to-one mapping from a page frame to
@@ -168,3 +171,21 @@ The contiguous bit is used to increase the mapping size at the pmd and pte
 (last) level. So this type of HugeTLB page can be optimized only when its
 size of the struct page structs is greater than 2 pages.
 
+Device DAX
+==========
+
+The device-dax interface uses the same tail deduplication technique explained
+in the previous chapter, except when used with the vmemmap in the device (altmap).
+
+The differences with HugeTLB are relatively minor.
+
+The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
+PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+
+There's no remapping of vmemmap given that device-dax memory is not part of
+System RAM ranges initialized at boot, hence the tail deduplication happens
+at a later stage when we populate the sections.
+
+It only use 3 page structs for storing all information as opposed
+to 4 on HugeTLB pages. This does not affect memory savings between both.
+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4fca4942c0ab..77eaeae497f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3174,7 +3174,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, struct page *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index 99646082436f..0d4c98722c12 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -338,6 +338,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 58e8e77bd5b5..441bb95edd68 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -495,16 +495,31 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap,
+				       struct page *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
 		void *p;
 
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else {
+			/*
+			 * When a PTE/PMD entry is freed from the init_mm
+			 * there's a a free_pages() call to this page allocated
+			 * above. Thus this get_page() is paired with the
+			 * put_page_testzero() on the freeing path.
+			 * This can only called by certain ZONE_DEVICE path,
+			 * and through vmemmap_populate_compound_pages() when
+			 * slab is available.
+			 */
+			get_page(block);
+			p = page_to_virt(block);
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -571,7 +586,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      struct page *reuse, struct page **page)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -591,11 +607,13 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 
+	if (page)
+		*page = pte_page(*pte);
 	return 0;
 }
 
@@ -606,10 +624,120 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	int rc;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		rc = vmemmap_populate_address(addr, node, altmap);
+		rc = vmemmap_populate_address(addr, node, altmap, NULL, NULL);
 		if (rc)
 			return rc;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, struct page *page)
+{
+	unsigned long addr = start;
+	int rc;
 
+	for (; addr < end; addr += PAGE_SIZE) {
+		rc = vmemmap_populate_address(addr, node, NULL, page, NULL);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  struct page **page)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, page);
+}
+
+/*
+ * For compound pages bigger than section size (e.g. x86 1G compound
+ * pages with 2M subsection size) fill the rest of sections as tail
+ * pages.
+ *
+ * Note that memremap_pages() resets @nr_range value and will increment
+ * it after each range successful onlining. Thus the value or @nr_range
+ * at section memmap populate corresponds to the in-progress range
+ * being onlined here.
+ */
+static bool __meminit reuse_compound_section(unsigned long start_pfn,
+					     struct dev_pagemap *pgmap)
+{
+	unsigned long geometry = pgmap_geometry(pgmap);
+	unsigned long offset = start_pfn -
+		PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
+
+	return !IS_ALIGNED(offset, geometry) && geometry > PAGES_PER_SUBSECTION;
+}
+
+static struct page * __meminit compound_section_tail_page(unsigned long addr)
+{
+	pte_t *ptep;
+
+	addr -= PAGE_SIZE;
+
+	/*
+	 * Assuming sections are populated sequentially, the previous section's
+	 * page data can be reused.
+	 */
+	ptep = pte_offset_kernel(pmd_off_k(addr), addr);
+	if (!ptep)
+		return NULL;
+
+	return pte_page(*ptep);
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long size, addr;
+
+	if (reuse_compound_section(start_pfn, pgmap)) {
+		struct page *page;
+
+		page = compound_section_tail_page(start);
+		if (!page)
+			return -ENOMEM;
+
+		/*
+		 * Reuse the page that was populated in the prior iteration
+		 * with just tail struct pages.
+		 */
+		return vmemmap_populate_range(start, end, node, page);
+	}
+
+	size = min(end - start, pgmap_geometry(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		struct page *block;
+		int rc;
+
+		/* Populate the head page vmemmap page */
+		rc = vmemmap_populate_page(addr, node, NULL);
+		if (rc)
+			return rc;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		rc = vmemmap_populate_page(next, node, &block);
+		if (rc)
+			return rc;
+
+		/*
+		 * Reuse the previous page for the rest of tail pages
+		 * See layout diagram in Documentation/vm/vmemmap_dedup.rst
+		 */
+		next += PAGE_SIZE;
+		rc = vmemmap_populate_range(next, last, node, block);
+		if (rc)
+			return rc;
 	}
 
 	return 0;
@@ -621,12 +749,18 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (pgmap_geometry(pgmap) > 1 && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 13/14] mm/page_alloc: reuse tail struct pages for compound devmaps
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (11 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 12/14] mm/sparse-vmemmap: populate compound devmaps Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  2021-08-27 14:58 ` [PATCH v4 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Currently memmap_init_zone_device() ends up initializing 32768 pages
when it only needs to initialize 128 given tail page reuse. That
number is worse with 1GB compound page geometries, 262144 instead of
128. Update memmap_init_zone_device() to skip redundant
initialization, detailed below.

When a pgmap @geometry is set, all pages are mapped at a given huge page
alignment and use compound pages to describe them as opposed to a
struct per 4K.

With @geometry > PAGE_SIZE and when struct pages are stored in ram
(!altmap) most tail pages are reused. Consequently, the amount of unique
struct pages is a lot smaller that the total amount of struct pages
being mapped.

The altmap path is left alone since it does not support memory savings
based on compound devmap geometries.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/page_alloc.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1497a928005..13464b4188b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6610,6 +6610,20 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+/*
+ * With compound page geometry and when struct pages are stored in ram most
+ * tail pages are reused. Consequently, the amount of unique struct pages to
+ * initialize is a lot smaller that the total amount of struct pages being
+ * mapped. This is a paired / mild layering violation with explicit knowledge
+ * of how the sparse_vmemmap internals handle compound pages in the lack
+ * of an altmap. See vmemmap_populate_compound_pages().
+ */
+static inline unsigned long compound_nr_pages(struct vmem_altmap *altmap,
+					      unsigned long nr_pages)
+{
+	return !altmap ? 2 * (PAGE_SIZE/sizeof(struct page)) : nr_pages;
+}
+
 static void __ref memmap_init_compound(struct page *head, unsigned long head_pfn,
 				       unsigned long zone_idx, int nid,
 				       struct dev_pagemap *pgmap,
@@ -6673,7 +6687,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     pfns_per_compound);
+				     compound_nr_pages(altmap, pfns_per_compound));
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v4 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry
  2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
                   ` (12 preceding siblings ...)
  2021-08-27 14:58 ` [PATCH v4 13/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
@ 2021-08-27 14:58 ` Joao Martins
  13 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-08-27 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Currently, for compound PUD mappings, the implementation consumes 40MB
per TB but it can be optimized to 16MB per TB with the approach
detailed below.

Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that precedes
the memmap being initialized.  This is done when a given memmap
address isn't aligned to the pgmap @geometry (which is safe to do because
@ranges are guaranteed to be aligned to @geometry).

For devmaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages.  Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being  allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD previously populated which only contain
tail pages).

After this scheme for an 1GB devmap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages.  The latter page gets
its PMD reused across future section mapping of the same devmap.

Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. Rather than
requiring 8 PMD page allocations only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++
 include/linux/mm.h                 |   3 +-
 mm/sparse-vmemmap.c                | 108 +++++++++++++++++++++++-----
 3 files changed, 203 insertions(+), 17 deletions(-)

diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst
index faac78bef01c..65aabfa2ca0b 100644
--- a/Documentation/vm/vmemmap_dedup.rst
+++ b/Documentation/vm/vmemmap_dedup.rst
@@ -189,3 +189,112 @@ at a later stage when we populate the sections.
 It only use 3 page structs for storing all information as opposed
 to 4 on HugeTLB pages. This does not affect memory savings between both.
 
+Additionally, it further extends the tail page deduplication with 1GB
+device-dax compound pages.
+
+E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split
+across 8 PMD page frames, with the first PMD having 2 PTE page frames.
+In total this represents a total of 40960 bytes per 1GB page.
+
+Here is how things look after the previously described tail page deduplication
+technique.
+
+   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
+ +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
+ |           |    |    0     |     |     0     | -------------> |      0      |
+ |           |    +----------+     +-----------+                +-------------+
+ |           |                     |     1     | -------------> |      1      |
+ |           |                     +-----------+                +-------------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | | |
+ |           |                     |     3     | ------------------+ | | | | |
+ |           |                     +-----------+                     | | | | |
+ |           |                     |     4     | --------------------+ | | | |
+ |   PMD 0   |                     +-----------+                       | | | |
+ |           |                     |     5     | ----------------------+ | | |
+ |           |                     +-----------+                         | | |
+ |           |                     |     ..    | ------------------------+ | |
+ |           |                     +-----------+                           | |
+ |           |                     |     511   | --------------------------+ |
+ |           |                     +-----------+                             |
+ |           |                                                               |
+ |           |                                                               |
+ |           |                                                               |
+ +-----------+     page frames                                               |
+ +-----------+ -> +----------+ --> +-----------+    mapping to               |
+ |           |    |  1 .. 7  |     |    512    | ----------------------------+
+ |           |    +----------+     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |    PMD    |                     +-----------+                             |
+ |  1 .. 7   |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    ..     | ----------------------------+
+ |           |                     +-----------+                             |
+ |           |                     |    4095   | ----------------------------+
+ +-----------+                     +-----------+
+
+Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame
+that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7
+all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame.
+This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per
+1GB page from 40960 bytes to 16384 bytes.
+
+Here is how things look after PMD tail page deduplication.
+
+   device-dax      page frames   struct pages(4096 pages)     page frame(2 pages)
+ +-----------+ -> +----------+ --> +-----------+   mapping to   +-------------+
+ |           |    |    0     |     |     0     | -------------> |      0      |
+ |           |    +----------+     +-----------+                +-------------+
+ |           |                     |     1     | -------------> |      1      |
+ |           |                     +-----------+                +-------------+
+ |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^ ^
+ |           |                     +-----------+                   | | | | | |
+ |           |                     |     3     | ------------------+ | | | | |
+ |           |                     +-----------+                     | | | | |
+ |           |                     |     4     | --------------------+ | | | |
+ |   PMD 0   |                     +-----------+                       | | | |
+ |           |                     |     5     | ----------------------+ | | |
+ |           |                     +-----------+                         | | |
+ |           |                     |     ..    | ------------------------+ | |
+ |           |                     +-----------+                           | |
+ |           |                     |     511   | --------------------------+ |
+ |           |                     +-----------+                             |
+ |           |                                                               |
+ |           |                                                               |
+ |           |                                                               |
+ +-----------+     page frames                                               |
+ +-----------+ -> +----------+ --> +-----------+    mapping to               |
+ |           |    |    1     |     |    512    | ----------------------------+
+ |           |    +----------+     +-----------+                             |
+ |           |     ^ ^ ^ ^ ^ ^     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |   PMD 1   |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    ..     | ----------------------------+
+ |           |     | | | | | |     +-----------+                             |
+ |           |     | | | | | |     |    4095   | ----------------------------+
+ +-----------+     | | | | | |     +-----------+
+ |   PMD 2   | ----+ | | | | |
+ +-----------+       | | | | |
+ |   PMD 3   | ------+ | | | |
+ +-----------+         | | | |
+ |   PMD 4   | --------+ | | |
+ +-----------+           | | |
+ |   PMD 5   | ----------+ | |
+ +-----------+             | |
+ |   PMD 6   | ------------+ |
+ +-----------+               |
+ |   PMD 7   | --------------+
+ +-----------+
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77eaeae497f9..ff0f7d40c6e6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3172,7 +3172,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+			    struct page *block);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, struct page *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 441bb95edd68..dc3a137ec768 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+				       struct page *block)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-		if (!p)
-			return NULL;
+		void *p;
+
+		if (!block) {
+			p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+			if (!p)
+				return NULL;
+		} else {
+			/* See comment in vmemmap_pte_populate(). */
+			get_page(block);
+			p = page_to_virt(block);
+		}
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      struct page *reuse, struct page **page)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+						  struct vmem_altmap *altmap,
+						  struct page *reuse, pmd_t **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = vmemmap_pgd_populate(addr, node);
 	if (!pgd)
@@ -604,9 +612,26 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pud = vmemmap_pud_populate(p4d, addr, node);
 	if (!pud)
 		return -ENOMEM;
-	pmd = vmemmap_pmd_populate(pud, addr, node);
+	pmd = vmemmap_pmd_populate(pud, addr, node, reuse);
 	if (!pmd)
 		return -ENOMEM;
+	if (ptr)
+		*ptr = pmd;
+	return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap,
+					      struct page *reuse, struct page **page)
+{
+	pmd_t *pmd;
+	pte_t *pte;
+	int rc;
+
+	rc = vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd);
+	if (rc)
+		return rc;
+
 	pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse);
 	if (!pte)
 		return -ENOMEM;
@@ -654,6 +679,22 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
 	return vmemmap_populate_address(addr, node, NULL, NULL, page);
 }
 
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+						unsigned long end,
+						int node, struct page *page)
+{
+	unsigned long addr = start;
+	int rc;
+
+	for (; addr < end; addr += PMD_SIZE) {
+		rc = vmemmap_populate_pmd_address(addr, node, NULL, page, NULL);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
 /*
  * For compound pages bigger than section size (e.g. x86 1G compound
  * pages with 2M subsection size) fill the rest of sections as tail
@@ -665,13 +706,14 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
  * being onlined here.
  */
 static bool __meminit reuse_compound_section(unsigned long start_pfn,
-					     struct dev_pagemap *pgmap)
+					     struct dev_pagemap *pgmap,
+					     unsigned long *offset)
 {
 	unsigned long geometry = pgmap_geometry(pgmap);
-	unsigned long offset = start_pfn -
-		PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
 
-	return !IS_ALIGNED(offset, geometry) && geometry > PAGES_PER_SUBSECTION;
+	*offset = start_pfn - PHYS_PFN(pgmap->ranges[pgmap->nr_range].start);
+
+	return !IS_ALIGNED(*offset, geometry) && geometry > PAGES_PER_SUBSECTION;
 }
 
 static struct page * __meminit compound_section_tail_page(unsigned long addr)
@@ -691,21 +733,55 @@ static struct page * __meminit compound_section_tail_page(unsigned long addr)
 	return pte_page(*ptep);
 }
 
+static struct page * __meminit compound_section_tail_huge_page(unsigned long addr,
+				unsigned long offset, struct dev_pagemap *pgmap)
+{
+	pmd_t *pmdp;
+
+	addr -= PAGE_SIZE;
+
+	/*
+	 * Assuming sections are populated sequentially, the previous section's
+	 * page data can be reused.
+	 */
+	pmdp = pmd_off_k(addr);
+	if (!pmdp)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Reuse the tail pages vmemmap pmd page
+	 * See layout diagram in Documentation/vm/vmemmap_dedup.rst
+	 */
+	if (offset % pgmap_geometry(pgmap) > PAGES_PER_SECTION)
+		return pmd_page(*pmdp);
+
+	/* No reusable PMD, fallback to PTE tail page */
+	return NULL;
+}
+
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 						     unsigned long start,
 						     unsigned long end, int node,
 						     struct dev_pagemap *pgmap)
 {
-	unsigned long size, addr;
+	unsigned long offset, size, addr;
 
-	if (reuse_compound_section(start_pfn, pgmap)) {
-		struct page *page;
+	if (reuse_compound_section(start_pfn, pgmap, &offset)) {
+		struct page *page, *hpage;
+
+		hpage = compound_section_tail_huge_page(start, offset, pgmap);
+		if (IS_ERR(hpage))
+			return -ENOMEM;
+		if (hpage)
+			return vmemmap_populate_pmd_range(start, end, node,
+							  hpage);
 
 		page = compound_section_tail_page(start);
 		if (!page)
 			return -ENOMEM;
 
 		/*
+		 * Populate the tail pages vmemmap pmd page.
 		 * Reuse the page that was populated in the prior iteration
 		 * with just tail struct pages.
 		 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-08-27 14:58 ` [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-08-27 15:33   ` Christoph Hellwig
  2021-08-27 16:00     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-08-27 15:33 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc

On Fri, Aug 27, 2021 at 03:58:09PM +0100, Joao Martins wrote:
> + * @geometry: structural definition of how the vmemmap metadata is populated.
> + *	A zero or 1 defaults to using base pages as the memmap metadata
> + *	representation. A bigger value will set up compound struct pages
> + *	representative of the requested geometry size.
>   * @ops: method table
>   * @owner: an opaque pointer identifying the entity that manages this
>   *	instance.  Used by various helpers to make sure that no
> @@ -114,6 +118,7 @@ struct dev_pagemap {
>  	struct completion done;
>  	enum memory_type type;
>  	unsigned int flags;
> +	unsigned long geometry;

So why not make this a shift as I suggested somewhere deep in the
last thread?  Also geometry sounds a bit strange, even if I can't really
offer anything better offhand.

> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
> +{
> +	if (pgmap && pgmap->geometry)
> +		return pgmap->geometry;

Why does this need to support a NULL pgmap?

> +static void __ref memmap_init_compound(struct page *head, unsigned long head_pfn,

Overly long line.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-08-27 15:33   ` Christoph Hellwig
@ 2021-08-27 16:00     ` Joao Martins
  2021-09-01  9:44       ` Christoph Hellwig
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-08-27 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On 8/27/21 4:33 PM, Christoph Hellwig wrote:
> On Fri, Aug 27, 2021 at 03:58:09PM +0100, Joao Martins wrote:
>> + * @geometry: structural definition of how the vmemmap metadata is populated.
>> + *	A zero or 1 defaults to using base pages as the memmap metadata
>> + *	representation. A bigger value will set up compound struct pages
>> + *	representative of the requested geometry size.
>>   * @ops: method table
>>   * @owner: an opaque pointer identifying the entity that manages this
>>   *	instance.  Used by various helpers to make sure that no
>> @@ -114,6 +118,7 @@ struct dev_pagemap {
>>  	struct completion done;
>>  	enum memory_type type;
>>  	unsigned int flags;
>> +	unsigned long geometry;
> 
> So why not make this a shift as I suggested somewhere deep in the
> last thread? 

So the previous you suggested had the check for pgmap_geometry() > PAGE_SIZE,
but because pgmap_geometry() return 1 by default, then the pfn_end() - pfn
computation wouldn't change for those that don't request a geometry.

So rather than this that you initially suggested:

	refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id);
	if (pgmap_geometry(pgmap) > 1)
		refs /= pgmap_geometry(pgmap);

I would turn into this, because now for users which don't select geometry it's always a
division by 1:

	refs = pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id);
	refs /= pgmap_geometry(pgmap);

So felt like doing it inline straight away inline when calling percpu_ref_get_many():
	
	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap);

I can switch to a shift if you prefer:

	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id))
		<< pgmap_geometry_order(pgmap);

> Also geometry sounds a bit strange, even if I can't really
> offer anything better offhand.
> 
We started with @align (like in device dax core), and then we switched
to @geometry because these are slightly different things (one relates
to vmemmap metadata structure (number of pages) and the other is how
the mmap is aligned to a page size. I couldn't suggest anything else,
besides a more verbose name like vmemmap_align maybe.

>> +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
>> +{
>> +	if (pgmap && pgmap->geometry)
>> +		return pgmap->geometry;
> 
> Why does this need to support a NULL pgmap?
> 
This was mainly a defensive choice, similar to put_dev_pagemap(). There's
only one caller, which is populate_section_memmap with an altmap but without
pgmap.

>> +static void __ref memmap_init_compound(struct page *head, unsigned long head_pfn,
> 
> Overly long line.
> 
Fixed.

Interesting that checkpatch doesn't complain with one character past 80 lines.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-27 14:58 ` [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
@ 2021-08-27 16:25   ` Jason Gunthorpe
  2021-08-27 18:34     ` Joao Martins
  2021-10-08 11:54   ` Jason Gunthorpe
  1 sibling, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-08-27 16:25 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:

>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  			     unsigned long end, unsigned int flags,
>  			     struct page **pages, int *nr)
>  {
> -	int nr_start = *nr;
> +	int refs, nr_start = *nr;
>  	struct dev_pagemap *pgmap = NULL;
>  	int ret = 1;
>  
>  	do {
> -		struct page *page = pfn_to_page(pfn);
> +		struct page *head, *page = pfn_to_page(pfn);
> +		unsigned long next = addr + PAGE_SIZE;
>  
>  		pgmap = get_dev_pagemap(pfn, pgmap);
>  		if (unlikely(!pgmap)) {
> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  			ret = 0;
>  			break;
>  		}
> -		SetPageReferenced(page);
> -		pages[*nr] = page;
> -		if (unlikely(!try_grab_page(page, flags))) {
> -			undo_dev_pagemap(nr, nr_start, flags, pages);
> +
> +		head = compound_head(page);
> +		/* @end is assumed to be limited at most one compound page */
> +		if (PageHead(head))
> +			next = end;
> +		refs = record_subpages(page, addr, next, pages + *nr);
> +
> +		SetPageReferenced(head);
> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
> +			if (PageHead(head))
> +				ClearPageReferenced(head);
> +			else
> +				undo_dev_pagemap(nr, nr_start, flags, pages);
>  			ret = 0;
>  			break;

Why is this special cased for devmap?

Shouldn't everything processing pud/pmds/etc use the same basic loop
that is similar in idea to the 'for_each_compound_head' scheme in
unpin_user_pages_dirty_lock()?

Doesn't that work for all the special page type cases here?

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-27 16:25   ` Jason Gunthorpe
@ 2021-08-27 18:34     ` Joao Martins
  2021-08-30 13:07       ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-08-27 18:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 8/27/21 5:25 PM, Jason Gunthorpe wrote:
> On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
> 
>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>  			     unsigned long end, unsigned int flags,
>>  			     struct page **pages, int *nr)
>>  {
>> -	int nr_start = *nr;
>> +	int refs, nr_start = *nr;
>>  	struct dev_pagemap *pgmap = NULL;
>>  	int ret = 1;
>>  
>>  	do {
>> -		struct page *page = pfn_to_page(pfn);
>> +		struct page *head, *page = pfn_to_page(pfn);
>> +		unsigned long next = addr + PAGE_SIZE;
>>  
>>  		pgmap = get_dev_pagemap(pfn, pgmap);
>>  		if (unlikely(!pgmap)) {
>> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>  			ret = 0;
>>  			break;
>>  		}
>> -		SetPageReferenced(page);
>> -		pages[*nr] = page;
>> -		if (unlikely(!try_grab_page(page, flags))) {
>> -			undo_dev_pagemap(nr, nr_start, flags, pages);
>> +
>> +		head = compound_head(page);
>> +		/* @end is assumed to be limited at most one compound page */
>> +		if (PageHead(head))
>> +			next = end;
>> +		refs = record_subpages(page, addr, next, pages + *nr);
>> +
>> +		SetPageReferenced(head);
>> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
>> +			if (PageHead(head))
>> +				ClearPageReferenced(head);
>> +			else
>> +				undo_dev_pagemap(nr, nr_start, flags, pages);
>>  			ret = 0;
>>  			break;
> 
> Why is this special cased for devmap?
> 
> Shouldn't everything processing pud/pmds/etc use the same basic loop
> that is similar in idea to the 'for_each_compound_head' scheme in
> unpin_user_pages_dirty_lock()?
> 
> Doesn't that work for all the special page type cases here?

We are iterating over PFNs to create an array of base pages (regardless of page table
type), rather than iterating over an array of pages to work on. Given that all these gup
functions already give you the boundary (end of pmd or end of pud, etc) then we just need
to grab the ref to pgmap and head page and save the tails. But sadly we need to handle the
base page case which is why there's this outer loop exists sadly. If it was just head
pages we wouldn't need the outer loop (and hence no for_each_compound_head, like the
hugetlb variant).

But maybe I am being dense and you just mean to replace the outer loop with
for_each_compound_range(). I am a little stuck on the part that I anyways need to record
back the tail pages when iterating over the (single) head page. So I feel that it wouldn't
bring that much improvement, unless I missed your point.

	Joao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-27 18:34     ` Joao Martins
@ 2021-08-30 13:07       ` Jason Gunthorpe
  2021-08-31 12:34         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-08-30 13:07 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Fri, Aug 27, 2021 at 07:34:54PM +0100, Joao Martins wrote:
> On 8/27/21 5:25 PM, Jason Gunthorpe wrote:
> > On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
> > 
> >>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>  			     unsigned long end, unsigned int flags,
> >>  			     struct page **pages, int *nr)
> >>  {
> >> -	int nr_start = *nr;
> >> +	int refs, nr_start = *nr;
> >>  	struct dev_pagemap *pgmap = NULL;
> >>  	int ret = 1;
> >>  
> >>  	do {
> >> -		struct page *page = pfn_to_page(pfn);
> >> +		struct page *head, *page = pfn_to_page(pfn);
> >> +		unsigned long next = addr + PAGE_SIZE;
> >>  
> >>  		pgmap = get_dev_pagemap(pfn, pgmap);
> >>  		if (unlikely(!pgmap)) {
> >> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>  			ret = 0;
> >>  			break;
> >>  		}
> >> -		SetPageReferenced(page);
> >> -		pages[*nr] = page;
> >> -		if (unlikely(!try_grab_page(page, flags))) {
> >> -			undo_dev_pagemap(nr, nr_start, flags, pages);
> >> +
> >> +		head = compound_head(page);
> >> +		/* @end is assumed to be limited at most one compound page */
> >> +		if (PageHead(head))
> >> +			next = end;
> >> +		refs = record_subpages(page, addr, next, pages + *nr);
> >> +
> >> +		SetPageReferenced(head);
> >> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
> >> +			if (PageHead(head))
> >> +				ClearPageReferenced(head);
> >> +			else
> >> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> >>  			ret = 0;
> >>  			break;
> > 
> > Why is this special cased for devmap?
> > 
> > Shouldn't everything processing pud/pmds/etc use the same basic loop
> > that is similar in idea to the 'for_each_compound_head' scheme in
> > unpin_user_pages_dirty_lock()?
> > 
> > Doesn't that work for all the special page type cases here?
> 
> We are iterating over PFNs to create an array of base pages (regardless of page table
> type), rather than iterating over an array of pages to work on. 

That is part of it, yes, but the slow bit here is to minimally find
the head pages and do the atomics on them, much like the
unpin_user_pages_dirty_lock()

I would think this should be designed similar to how things work on
the unpin side.

Sweep the page tables to find a proper start/end - eg even if a
compound is spread across multiple pte/pmd/pud/etc we should find a
linear range of starting PFN (ie starting page*) and npages across as
much of the page tables as we can manage. This is the same as where
things end up in the unpin case where all the contiguous PFNs are
grouped togeher into a range.

Then 'assign' that range to the output array which requires walking
over each compount_head in the range and pinning it, then writing out
the tail pages to the output struct page array.

And this approach should apply universally no matter what is under the
pte's - ie huge pages, THPs and devmaps should all be treated the same
way. Currently each case is different, like above which is unique to
device_huge.

The more we can process groups of pages in bulks the faster the whole
thing will be.

Jason





Given that all these gup
> functions already give you the boundary (end of pmd or end of pud, etc) then we just need
> to grab the ref to pgmap and head page and save the tails. But sadly we need to handle the
> base page case which is why there's this outer loop exists sadly. If it was just head
> pages we wouldn't need the outer loop (and hence no for_each_compound_head, like the
> hugetlb variant).
> 
> But maybe I am being dense and you just mean to replace the outer loop with
> for_each_compound_range(). I am a little stuck on the part that I anyways need to record
> back the tail pages when iterating over the (single) head page. So I feel that it wouldn't
> bring that much improvement, unless I missed your point.
> 
> 	Joao
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-30 13:07       ` Jason Gunthorpe
@ 2021-08-31 12:34         ` Joao Martins
  2021-08-31 17:05           ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-08-31 12:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 8/30/21 2:07 PM, Jason Gunthorpe wrote:
> On Fri, Aug 27, 2021 at 07:34:54PM +0100, Joao Martins wrote:
>> On 8/27/21 5:25 PM, Jason Gunthorpe wrote:
>>> On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
>>>
>>>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>>>  			     unsigned long end, unsigned int flags,
>>>>  			     struct page **pages, int *nr)
>>>>  {
>>>> -	int nr_start = *nr;
>>>> +	int refs, nr_start = *nr;
>>>>  	struct dev_pagemap *pgmap = NULL;
>>>>  	int ret = 1;
>>>>  
>>>>  	do {
>>>> -		struct page *page = pfn_to_page(pfn);
>>>> +		struct page *head, *page = pfn_to_page(pfn);
>>>> +		unsigned long next = addr + PAGE_SIZE;
>>>>  
>>>>  		pgmap = get_dev_pagemap(pfn, pgmap);
>>>>  		if (unlikely(!pgmap)) {
>>>> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>>>  			ret = 0;
>>>>  			break;
>>>>  		}
>>>> -		SetPageReferenced(page);
>>>> -		pages[*nr] = page;
>>>> -		if (unlikely(!try_grab_page(page, flags))) {
>>>> -			undo_dev_pagemap(nr, nr_start, flags, pages);
>>>> +
>>>> +		head = compound_head(page);
>>>> +		/* @end is assumed to be limited at most one compound page */
>>>> +		if (PageHead(head))
>>>> +			next = end;
>>>> +		refs = record_subpages(page, addr, next, pages + *nr);
>>>> +
>>>> +		SetPageReferenced(head);
>>>> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
>>>> +			if (PageHead(head))
>>>> +				ClearPageReferenced(head);
>>>> +			else
>>>> +				undo_dev_pagemap(nr, nr_start, flags, pages);
>>>>  			ret = 0;
>>>>  			break;
>>>
>>> Why is this special cased for devmap?
>>>
>>> Shouldn't everything processing pud/pmds/etc use the same basic loop
>>> that is similar in idea to the 'for_each_compound_head' scheme in
>>> unpin_user_pages_dirty_lock()?
>>>
>>> Doesn't that work for all the special page type cases here?
>>
>> We are iterating over PFNs to create an array of base pages (regardless of page table
>> type), rather than iterating over an array of pages to work on. 
> 
> That is part of it, yes, but the slow bit here is to minimally find
> the head pages and do the atomics on them, much like the
> unpin_user_pages_dirty_lock()
> 
> I would think this should be designed similar to how things work on
> the unpin side.
> 
I don't think it's the same thing. The bit you say 'minimally find the
head pages' carries a visible overhead in unpin_user_pages() as we are
checking each of the pages belongs to the same head page -- because you
can pass an arbritary set of pages. This does have a cost which is not
in gup-fast right now AIUI. Whereas in our gup-fast 'handler' you
already know that you are processing a contiguous chunk of pages.
If anything, we are closer to unpin_user_page_range*()
than unpin_user_pages().

> Sweep the page tables to find a proper start/end - eg even if a
> compound is spread across multiple pte/pmd/pud/etc we should find a
> linear range of starting PFN (ie starting page*) and npages across as
> much of the page tables as we can manage. This is the same as where
> things end up in the unpin case where all the contiguous PFNs are
> grouped togeher into a range.
> 
> Then 'assign' that range to the output array which requires walking
> over each compount_head in the range and pinning it, then writing out
> the tail pages to the output struct page array.
> 
> And this approach should apply universally no matter what is under the
> pte's - ie huge pages, THPs and devmaps should all be treated the same
> way. Currently each case is different, like above which is unique to
> device_huge.
> 
Only devmap gup-fast is different IIUC.

Switching to similar iteration logic to unpin would look something like
this (still untested):

        for_each_compound_range(index, &page, npages, head, refs) {
                pgmap = get_dev_pagemap(pfn + *nr, pgmap);
                if (unlikely(!pgmap)) {
                        undo_dev_pagemap(nr, nr_start, flags, pages);
                        ret = 0;
                        break;
                }

                SetPageReferenced(head);
                if (unlikely(!try_grab_compound_head(head, refs, flags))) {
                        if (PageHead(head))
                                ClearPageReferenced(head);
                        else
                                undo_dev_pagemap(nr, nr_start, flags, pages);
                        ret = 0;
                        break;
                }

                record_subpages(page + *nr, addr,
                                addr + (refs << PAGE_SHIFT), pages + *nr);
                *(nr) += refs;
		addr += (refs << PAGE_SHIFT);
        }


But it looks to be a tidbit more complex and not really aligning with the
rest of gup-fast.

All in all, I am dealing with the fact that 1) devmap pmds/puds may not
be represented with compound pages and 2) we temporarily grab dev_pagemap reference
prior to pinning the page. Those two items is what makes this different than THPs/HugeTLB
(which do have the same logic). And thus it's what lead me to *slightly* improve
gup_device_huge().

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-31 12:34         ` Joao Martins
@ 2021-08-31 17:05           ` Jason Gunthorpe
  2021-09-23 16:51             ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-08-31 17:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Tue, Aug 31, 2021 at 01:34:04PM +0100, Joao Martins wrote:
> On 8/30/21 2:07 PM, Jason Gunthorpe wrote:
> > On Fri, Aug 27, 2021 at 07:34:54PM +0100, Joao Martins wrote:
> >> On 8/27/21 5:25 PM, Jason Gunthorpe wrote:
> >>> On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
> >>>
> >>>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> >>>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>>>  			     unsigned long end, unsigned int flags,
> >>>>  			     struct page **pages, int *nr)
> >>>>  {
> >>>> -	int nr_start = *nr;
> >>>> +	int refs, nr_start = *nr;
> >>>>  	struct dev_pagemap *pgmap = NULL;
> >>>>  	int ret = 1;
> >>>>  
> >>>>  	do {
> >>>> -		struct page *page = pfn_to_page(pfn);
> >>>> +		struct page *head, *page = pfn_to_page(pfn);
> >>>> +		unsigned long next = addr + PAGE_SIZE;
> >>>>  
> >>>>  		pgmap = get_dev_pagemap(pfn, pgmap);
> >>>>  		if (unlikely(!pgmap)) {
> >>>> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>>>  			ret = 0;
> >>>>  			break;
> >>>>  		}
> >>>> -		SetPageReferenced(page);
> >>>> -		pages[*nr] = page;
> >>>> -		if (unlikely(!try_grab_page(page, flags))) {
> >>>> -			undo_dev_pagemap(nr, nr_start, flags, pages);
> >>>> +
> >>>> +		head = compound_head(page);
> >>>> +		/* @end is assumed to be limited at most one compound page */
> >>>> +		if (PageHead(head))
> >>>> +			next = end;
> >>>> +		refs = record_subpages(page, addr, next, pages + *nr);
> >>>> +
> >>>> +		SetPageReferenced(head);
> >>>> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
> >>>> +			if (PageHead(head))
> >>>> +				ClearPageReferenced(head);
> >>>> +			else
> >>>> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> >>>>  			ret = 0;
> >>>>  			break;
> >>>
> >>> Why is this special cased for devmap?
> >>>
> >>> Shouldn't everything processing pud/pmds/etc use the same basic loop
> >>> that is similar in idea to the 'for_each_compound_head' scheme in
> >>> unpin_user_pages_dirty_lock()?
> >>>
> >>> Doesn't that work for all the special page type cases here?
> >>
> >> We are iterating over PFNs to create an array of base pages (regardless of page table
> >> type), rather than iterating over an array of pages to work on. 
> > 
> > That is part of it, yes, but the slow bit here is to minimally find
> > the head pages and do the atomics on them, much like the
> > unpin_user_pages_dirty_lock()
> > 
> > I would think this should be designed similar to how things work on
> > the unpin side.
> > 
> I don't think it's the same thing. The bit you say 'minimally find the
> head pages' carries a visible overhead in unpin_user_pages() as we are
> checking each of the pages belongs to the same head page -- because you
> can pass an arbritary set of pages. This does have a cost which is not
> in gup-fast right now AIUI. Whereas in our gup-fast 'handler' you
> already know that you are processing a contiguous chunk of pages.
> If anything, we are closer to unpin_user_page_range*()
> than unpin_user_pages().

Yes, that is what I mean, it is very similar to the range case as we
don't even know that a single compound spans a pud/pmd. So you end up
doing the same loop to find the compound boundaries.

Under GUP slow we can also aggregate multiple page table entires, eg a
split huge page could be procesed as a single 2M range operation even
if it is broken to 4K PTEs.
> 
> Switching to similar iteration logic to unpin would look something like
> this (still untested):
> 
>         for_each_compound_range(index, &page, npages, head, refs) {
>                 pgmap = get_dev_pagemap(pfn + *nr, pgmap);

I recall talking to DanW about this and we agreed it was unnecessary
here to hold the pgmap and should be deleted.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-08-27 16:00     ` Joao Martins
@ 2021-09-01  9:44       ` Christoph Hellwig
  2021-09-09  9:38         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-01  9:44 UTC (permalink / raw)
  To: Joao Martins
  Cc: Christoph Hellwig, linux-mm, Dan Williams, Vishal Verma,
	Dave Jiang, Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe,
	John Hubbard, Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On Fri, Aug 27, 2021 at 05:00:11PM +0100, Joao Martins wrote:
> So felt like doing it inline straight away inline when calling percpu_ref_get_many():
> 	
> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap);
> 
> I can switch to a shift if you prefer:
> 
> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id))
> 		<< pgmap_geometry_order(pgmap);

Yes.  A shift is less overhead than a branch.

> > Also geometry sounds a bit strange, even if I can't really
> > offer anything better offhand.
> > 
> We started with @align (like in device dax core), and then we switched
> to @geometry because these are slightly different things (one relates
> to vmemmap metadata structure (number of pages) and the other is how
> the mmap is aligned to a page size. I couldn't suggest anything else,
> besides a more verbose name like vmemmap_align maybe.

It for sure isn't an alignment.  I think the term that comes closest
is a granularity.  But something like vmemmap_shift if switching to
a shift might be descriptive enough for the struct member name.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-09-01  9:44       ` Christoph Hellwig
@ 2021-09-09  9:38         ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-09-09  9:38 UTC (permalink / raw)
  To: Christoph Hellwig, Dan Williams
  Cc: linux-mm, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	nvdimm, linux-doc

On 9/1/21 10:44 AM, Christoph Hellwig wrote:
> On Fri, Aug 27, 2021 at 05:00:11PM +0100, Joao Martins wrote:
>> So felt like doing it inline straight away inline when calling percpu_ref_get_many():
>> 	
>> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id)) / pgmap_geometry(pgmap);
>>
>> I can switch to a shift if you prefer:
>>
>> 	(pfn_end(pgmap, range_id) - pfn_first(pgmap, range_id))
>> 		<< pgmap_geometry_order(pgmap);
> 
> Yes.  A shift is less overhead than a branch.
> 
OK.

But FWIW, the overhead of this shift or branch doesn't matter much
compared to the rest of memremap_pages().

>>> Also geometry sounds a bit strange, even if I can't really
>>> offer anything better offhand.
>>>
>> We started with @align (like in device dax core), and then we switched
>> to @geometry because these are slightly different things (one relates
>> to vmemmap metadata structure (number of pages) and the other is how
>> the mmap is aligned to a page size. I couldn't suggest anything else,
>> besides a more verbose name like vmemmap_align maybe.
> 
> It for sure isn't an alignment.  I think the term that comes closest
> is a granularity.  But something like vmemmap_shift if switching to
> a shift might be descriptive enough for the struct member name.
> 
Hmmm, it would be a 'shift related to vmemmap' in the literal sense but
while descriptive of the field it doesn't tell much otherwise. geometry
is probably used more widely for block device term when Dan suggested
the term. Alternatively, vmemmap_compound_nr / vmemmap_npages (if it
represents a nr of pages per compound page) or vmemmap_order (which is
more clear on what it is than vmemmap_shift?). Changing to a 'order'/'shift'
representation also gets the 'being a power of 2' enforcement for free.
And compound pages IIUC always represent a power-of-2 number of pages.

	Joao

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-31 17:05           ` Jason Gunthorpe
@ 2021-09-23 16:51             ` Joao Martins
  2021-09-28 18:01               ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-09-23 16:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: linux-mm, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Christoph Hellwig,
	nvdimm, linux-doc

On 8/31/21 6:05 PM, Jason Gunthorpe wrote:
> On Tue, Aug 31, 2021 at 01:34:04PM +0100, Joao Martins wrote:
>> On 8/30/21 2:07 PM, Jason Gunthorpe wrote:
>>> On Fri, Aug 27, 2021 at 07:34:54PM +0100, Joao Martins wrote:
>>>> On 8/27/21 5:25 PM, Jason Gunthorpe wrote:
>>>>> On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
>>>>>
>>>>>>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>>>>>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>>>>>  			     unsigned long end, unsigned int flags,
>>>>>>  			     struct page **pages, int *nr)
>>>>>>  {
>>>>>> -	int nr_start = *nr;
>>>>>> +	int refs, nr_start = *nr;
>>>>>>  	struct dev_pagemap *pgmap = NULL;
>>>>>>  	int ret = 1;
>>>>>>  
>>>>>>  	do {
>>>>>> -		struct page *page = pfn_to_page(pfn);
>>>>>> +		struct page *head, *page = pfn_to_page(pfn);
>>>>>> +		unsigned long next = addr + PAGE_SIZE;
>>>>>>  
>>>>>>  		pgmap = get_dev_pagemap(pfn, pgmap);
>>>>>>  		if (unlikely(!pgmap)) {
>>>>>> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>>>>>  			ret = 0;
>>>>>>  			break;
>>>>>>  		}
>>>>>> -		SetPageReferenced(page);
>>>>>> -		pages[*nr] = page;
>>>>>> -		if (unlikely(!try_grab_page(page, flags))) {
>>>>>> -			undo_dev_pagemap(nr, nr_start, flags, pages);
>>>>>> +
>>>>>> +		head = compound_head(page);
>>>>>> +		/* @end is assumed to be limited at most one compound page */
>>>>>> +		if (PageHead(head))
>>>>>> +			next = end;
>>>>>> +		refs = record_subpages(page, addr, next, pages + *nr);
>>>>>> +
>>>>>> +		SetPageReferenced(head);
>>>>>> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
>>>>>> +			if (PageHead(head))
>>>>>> +				ClearPageReferenced(head);
>>>>>> +			else
>>>>>> +				undo_dev_pagemap(nr, nr_start, flags, pages);
>>>>>>  			ret = 0;
>>>>>>  			break;
>>>>>
>>>>> Why is this special cased for devmap?
>>>>>
>>>>> Shouldn't everything processing pud/pmds/etc use the same basic loop
>>>>> that is similar in idea to the 'for_each_compound_head' scheme in
>>>>> unpin_user_pages_dirty_lock()?
>>>>>
>>>>> Doesn't that work for all the special page type cases here?
>>>>
>>>> We are iterating over PFNs to create an array of base pages (regardless of page table
>>>> type), rather than iterating over an array of pages to work on. 
>>>
>>> That is part of it, yes, but the slow bit here is to minimally find
>>> the head pages and do the atomics on them, much like the
>>> unpin_user_pages_dirty_lock()
>>>
>>> I would think this should be designed similar to how things work on
>>> the unpin side.
>>>
>> I don't think it's the same thing. The bit you say 'minimally find the
>> head pages' carries a visible overhead in unpin_user_pages() as we are
>> checking each of the pages belongs to the same head page -- because you
>> can pass an arbritary set of pages. This does have a cost which is not
>> in gup-fast right now AIUI. Whereas in our gup-fast 'handler' you
>> already know that you are processing a contiguous chunk of pages.
>> If anything, we are closer to unpin_user_page_range*()
>> than unpin_user_pages().
> 
> Yes, that is what I mean, it is very similar to the range case as we
> don't even know that a single compound spans a pud/pmd. So you end up
> doing the same loop to find the compound boundaries.
> 
> Under GUP slow we can also aggregate multiple page table entires, eg a
> split huge page could be procesed as a single 2M range operation even
> if it is broken to 4K PTEs.

/me nods

FWIW, I have a follow-up patch pursuing similar optimization (to fix
gup-slow case) that I need to put in better shape -- I probably won't wait
until this series is done contrary to what the cover letter says.

>> Switching to similar iteration logic to unpin would look something like
>> this (still untested):
>>
>>         for_each_compound_range(index, &page, npages, head, refs) {
>>                 pgmap = get_dev_pagemap(pfn + *nr, pgmap);
> 
> I recall talking to DanW about this and we agreed it was unnecessary
> here to hold the pgmap and should be deleted.

Yeap, I remember that conversation[0]. It was a long time ago, and I am
not sure what progress was made there since the last posting? Dan, any
thoughts there?

[0]
https://lore.kernel.org/linux-mm/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/

So ... if pgmap accounting was removed from gup-fast then this patch
would be a lot simpler and we could perhaps just fallback to the regular
hugepage case (THP, HugeTLB) like your suggestion at the top. See at the
end below scissors mark as the ballpark of changes.

So far my options seem to be: 1) this patch which leverages the existing
iteration logic or 2) switching to for_each_compound_range() -- see my previous
reply 3) waiting for Dan to remove @pgmap accounting in gup-fast and use
something similar to below scissors mark.

What do you think would be the best course of action?

--->8---

++static int __gup_device_compound(unsigned long addr, unsigned long pfn,
++                               unsigned long mask)
++{
++      pfn += ((addr & ~mask) >> PAGE_SHIFT);
++
++      return PageCompound(pfn_to_page(pfn));
++}
++
  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
                                 unsigned long end, unsigned int flags,
                                 struct page **pages, int *nr)
@@@ -2428,8 -2428,8 +2433,10 @@@ static int gup_huge_pmd(pmd_t orig, pmd
        if (pmd_devmap(orig)) {
                if (unlikely(flags & FOLL_LONGTERM))
                        return 0;
--              return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
--                                           pages, nr);
++
++              if (!__gup_device_compound(addr, pmd_pfn(orig), PMD_MASK))
++                      return __gup_device_huge_pmd(orig, pmdp, addr, end,
++                                                   flags, pages, nr);
        }

        page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
@@@ -2462,8 -2462,8 +2469,10 @@@ static int gup_huge_pud(pud_t orig, pud
        if (pud_devmap(orig)) {
                if (unlikely(flags & FOLL_LONGTERM))
                        return 0;
--              return __gup_device_huge_pud(orig, pudp, addr, end, flags,
--                                           pages, nr);
++
++              if (!__gup_device_compound(addr, pud_pfn(orig), PUD_MASK))
++                      return __gup_device_huge_pud(orig, pudp, addr, end,
++                                                   flags, pages, nr);
        }

        page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-23 16:51             ` Joao Martins
@ 2021-09-28 18:01               ` Jason Gunthorpe
  2021-09-29 11:50                 ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-09-28 18:01 UTC (permalink / raw)
  To: Joao Martins
  Cc: Dan Williams, linux-mm, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Thu, Sep 23, 2021 at 05:51:04PM +0100, Joao Martins wrote:
> On 8/31/21 6:05 PM, Jason Gunthorpe wrote:

> >> Switching to similar iteration logic to unpin would look something like
> >> this (still untested):
> >>
> >>         for_each_compound_range(index, &page, npages, head, refs) {
> >>                 pgmap = get_dev_pagemap(pfn + *nr, pgmap);
> > 
> > I recall talking to DanW about this and we agreed it was unnecessary
> > here to hold the pgmap and should be deleted.
> 
> Yeap, I remember that conversation[0]. It was a long time ago, and I am
> not sure what progress was made there since the last posting? Dan, any
> thoughts there?
> 
> [0]
> https://lore.kernel.org/linux-mm/161604050866.1463742.7759521510383551055.stgit@dwillia2-desk3.amr.corp.intel.com/

I would really like to see that finished :\

> So ... if pgmap accounting was removed from gup-fast then this patch
> would be a lot simpler and we could perhaps just fallback to the regular
> hugepage case (THP, HugeTLB) like your suggestion at the top. See at the
> end below scissors mark as the ballpark of changes.
> 
> So far my options seem to be: 1) this patch which leverages the existing
> iteration logic or 2) switching to for_each_compound_range() -- see my previous
> reply 3) waiting for Dan to remove @pgmap accounting in gup-fast and use
> something similar to below scissors mark.
> 
> What do you think would be the best course of action?

I still think the basic algorithm should be to accumulate physicaly
contiguous addresses when walking the page table and then flush them
back to struct pages once we can't accumulate any more.

That works for both the walkers and all the page types?

If the get_dev_pagemap has to remain then it just means we have to
flush before changing pagemap pointers

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-28 18:01               ` Jason Gunthorpe
@ 2021-09-29 11:50                 ` Joao Martins
  2021-09-29 19:34                   ` Jason Gunthorpe
  2021-10-18 18:37                   ` Jason Gunthorpe
  0 siblings, 2 replies; 48+ messages in thread
From: Joao Martins @ 2021-09-29 11:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, linux-mm, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 9/28/21 19:01, Jason Gunthorpe wrote:
> On Thu, Sep 23, 2021 at 05:51:04PM +0100, Joao Martins wrote:
>> So ... if pgmap accounting was removed from gup-fast then this patch
>> would be a lot simpler and we could perhaps just fallback to the regular
>> hugepage case (THP, HugeTLB) like your suggestion at the top. See at the
>> end below scissors mark as the ballpark of changes.
>>
>> So far my options seem to be: 1) this patch which leverages the existing
>> iteration logic or 2) switching to for_each_compound_range() -- see my previous
>> reply 3) waiting for Dan to remove @pgmap accounting in gup-fast and use
>> something similar to below scissors mark.
>>
>> What do you think would be the best course of action?
> 
> I still think the basic algorithm should be to accumulate physicaly
> contiguous addresses when walking the page table and then flush them
> back to struct pages once we can't accumulate any more.
> 
> That works for both the walkers and all the page types?
> 

The logic already handles all page types -- I was trying to avoid the extra
complexity in regular hugetlb/THP path by not merging the handling of the
oddball case that is devmap (or fundamentally devmap
non-compound case in the future).

In the context of this patch I am think your suggestion that you  wrote
above to ... instead of changing __gup_device_huge() we uplevel/merge it
all in gup_huge_{pud,pmd}() to cover the devmap?

static int __gup_huge_range(orig_head, ...)
{
	...
	page = orig_head + ((addr & ~mask) >> PAGE_SHIFT);
	refs = record_subpages(page, addr, end, pages + *nr);

	head = try_grab_compound_head(orig_head, refs, flags);
	if (!head)
		return 0;

	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
		put_compound_head(head, refs, flags);
		return 0;
	}

	SetPageReferenced(head);
	return 1;
}

static int gup_huge_pmd(...)
{
	...
	for_each_compound_range(index, page, npages, head, refs) {
		if (pud_devmap(orig))
			pgmap = get_dev_pagemap(pmd_pfn(orig), pgmap);
	
	
		if (!__gup_huge_page_range(pmd_page(orig), refs)) {
			undo_dev_pagemap(...);	
			return 0;
		}
	}

	put_dev_pagemap(pgmap);
}

But all this gup_huge_{pmd,pud}() rework is all just for the trouble of
trying to merge the basepage-on-PMD/PUD case of devmap. It feels more
complex (and affecting other page types) compared to leave the devmap
odity siloed like option 1. If the pgmap refcount wasn't there and
there was no users of basepages-on-PMD/PUD but just compound pages
on PMDs/PUDs ... then we would be talking about code removal rather
than added complexity. But I don't know how realistic that is for
other devmap users (beside device-dax).

> If the get_dev_pagemap has to remain then it just means we have to
> flush before changing pagemap pointers
Right -- I don't think we should need it as that discussion on the other
thread goes.

OTOH, using @pgmap might be useful to unblock gup-fast FOLL_LONGTERM
for certain devmap types[0] (like MEMORY_DEVICE_GENERIC [device-dax]
can support it but not MEMORY_DEVICE_FSDAX [fsdax]).

[0] https://lore.kernel.org/linux-mm/6a18179e-65f7-367d-89a9-d5162f10fef0@oracle.com/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-29 11:50                 ` Joao Martins
@ 2021-09-29 19:34                   ` Jason Gunthorpe
  2021-09-30  3:01                     ` Alistair Popple
  2021-10-18 18:37                   ` Jason Gunthorpe
  1 sibling, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-09-29 19:34 UTC (permalink / raw)
  To: Joao Martins, Dan Williams, Alistair Popple, Felix Kuehling
  Cc: linux-mm, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Christoph Hellwig,
	nvdimm, linux-doc

On Wed, Sep 29, 2021 at 12:50:15PM +0100, Joao Martins wrote:

> > If the get_dev_pagemap has to remain then it just means we have to
> > flush before changing pagemap pointers
> Right -- I don't think we should need it as that discussion on the other
> thread goes.
> 
> OTOH, using @pgmap might be useful to unblock gup-fast FOLL_LONGTERM
> for certain devmap types[0] (like MEMORY_DEVICE_GENERIC [device-dax]
> can support it but not MEMORY_DEVICE_FSDAX [fsdax]).

When looking at Logan's patches I think it is pretty clear to me that
page->pgmap must never be a dangling pointer if the caller has a
legitimate refcount on the page.

For instance the migrate and stuff all blindly calls
is_device_private_page() on the struct page expecting a valid
page->pgmap.

This also looks like it is happening, ie

void __put_page(struct page *page)
{
	if (is_zone_device_page(page)) {
		put_dev_pagemap(page->pgmap);

Is indeed putting the pgmap ref back when the page becomes ungettable.

This properly happens when the page refcount goes to zero and so it
should fully interlock with __page_cache_add_speculative():

	if (unlikely(!page_ref_add_unless(page, count, 0))) {

Thus, in gup.c, if we succeed at try_grab_compound_head() then
page->pgmap is a stable pointer with a valid refcount.

So, all the external pgmap stuff in gup.c is completely pointless.
try_grab_compound_head() provides us with an equivalent protection at
lower cost. Remember gup.c doesn't deref the pgmap at all.

Dan/Alistair/Felix do you see any hole in that argument??

So lets just delete it!

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-29 19:34                   ` Jason Gunthorpe
@ 2021-09-30  3:01                     ` Alistair Popple
  2021-09-30 17:54                       ` Joao Martins
  2021-10-18 18:36                       ` Jason Gunthorpe
  0 siblings, 2 replies; 48+ messages in thread
From: Alistair Popple @ 2021-09-30  3:01 UTC (permalink / raw)
  To: Joao Martins, Dan Williams, Felix Kuehling, Jason Gunthorpe
  Cc: linux-mm, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Christoph Hellwig,
	nvdimm, linux-doc

On Thursday, 30 September 2021 5:34:05 AM AEST Jason Gunthorpe wrote:
> On Wed, Sep 29, 2021 at 12:50:15PM +0100, Joao Martins wrote:
> 
> > > If the get_dev_pagemap has to remain then it just means we have to
> > > flush before changing pagemap pointers
> > Right -- I don't think we should need it as that discussion on the other
> > thread goes.
> > 
> > OTOH, using @pgmap might be useful to unblock gup-fast FOLL_LONGTERM
> > for certain devmap types[0] (like MEMORY_DEVICE_GENERIC [device-dax]
> > can support it but not MEMORY_DEVICE_FSDAX [fsdax]).
> 
> When looking at Logan's patches I think it is pretty clear to me that
> page->pgmap must never be a dangling pointer if the caller has a
> legitimate refcount on the page.
> 
> For instance the migrate and stuff all blindly calls
> is_device_private_page() on the struct page expecting a valid
> page->pgmap.
> 
> This also looks like it is happening, ie
> 
> void __put_page(struct page *page)
> {
> 	if (is_zone_device_page(page)) {
> 		put_dev_pagemap(page->pgmap);
> 
> Is indeed putting the pgmap ref back when the page becomes ungettable.
> 
> This properly happens when the page refcount goes to zero and so it
> should fully interlock with __page_cache_add_speculative():
> 
> 	if (unlikely(!page_ref_add_unless(page, count, 0))) {
> 
> Thus, in gup.c, if we succeed at try_grab_compound_head() then
> page->pgmap is a stable pointer with a valid refcount.
> 
> So, all the external pgmap stuff in gup.c is completely pointless.
> try_grab_compound_head() provides us with an equivalent protection at
> lower cost. Remember gup.c doesn't deref the pgmap at all.
> 
> Dan/Alistair/Felix do you see any hole in that argument??

As background note that device pages are currently considered free when
refcount == 1 but the pgmap reference is dropped when the refcount transitions
1->0. The final pgmap reference is typically dropped when a driver calls
memunmap_pages() and put_page() drops the last page reference:

void memunmap_pages(struct dev_pagemap *pgmap)
{
        unsigned long pfn;
        int i;

        dev_pagemap_kill(pgmap);
        for (i = 0; i < pgmap->nr_range; i++)
                for_each_device_pfn(pfn, pgmap, i)
                        put_page(pfn_to_page(pfn));
        dev_pagemap_cleanup(pgmap);

If there are still pgmap references dev_pagemap_cleanup(pgmap) will block until
the final reference is dropped. So I think your argument holds at least for
DEVICE_PRIVATE and DEVICE_GENERIC. DEVICE_FS_DAX defines it's own pagemap
cleanup but I can't see why the same argument wouldn't hold there - if a page
has a valid refcount it must have a reference on the pagemap too.

 - Alistair

> So lets just delete it!
> 
> Jason
> 





^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-30  3:01                     ` Alistair Popple
@ 2021-09-30 17:54                       ` Joao Martins
  2021-09-30 21:55                         ` Jason Gunthorpe
  2021-10-18 18:36                       ` Jason Gunthorpe
  1 sibling, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-09-30 17:54 UTC (permalink / raw)
  To: Alistair Popple, Dan Williams, Felix Kuehling, Jason Gunthorpe
  Cc: linux-mm, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Christoph Hellwig,
	nvdimm, linux-doc

On 9/30/21 04:01, Alistair Popple wrote:
> On Thursday, 30 September 2021 5:34:05 AM AEST Jason Gunthorpe wrote:
>> On Wed, Sep 29, 2021 at 12:50:15PM +0100, Joao Martins wrote:
>>
>>>> If the get_dev_pagemap has to remain then it just means we have to
>>>> flush before changing pagemap pointers
>>> Right -- I don't think we should need it as that discussion on the other
>>> thread goes.
>>>
>>> OTOH, using @pgmap might be useful to unblock gup-fast FOLL_LONGTERM
>>> for certain devmap types[0] (like MEMORY_DEVICE_GENERIC [device-dax]
>>> can support it but not MEMORY_DEVICE_FSDAX [fsdax]).
>>
>> When looking at Logan's patches I think it is pretty clear to me that
>> page->pgmap must never be a dangling pointer if the caller has a
>> legitimate refcount on the page.
>>
>> For instance the migrate and stuff all blindly calls
>> is_device_private_page() on the struct page expecting a valid
>> page->pgmap.
>>
>> This also looks like it is happening, ie
>>
>> void __put_page(struct page *page)
>> {
>> 	if (is_zone_device_page(page)) {
>> 		put_dev_pagemap(page->pgmap);
>>
>> Is indeed putting the pgmap ref back when the page becomes ungettable.
>>
>> This properly happens when the page refcount goes to zero and so it
>> should fully interlock with __page_cache_add_speculative():
>>
>> 	if (unlikely(!page_ref_add_unless(page, count, 0))) {
>>
>> Thus, in gup.c, if we succeed at try_grab_compound_head() then
>> page->pgmap is a stable pointer with a valid refcount.
>>
>> So, all the external pgmap stuff in gup.c is completely pointless.
>> try_grab_compound_head() provides us with an equivalent protection at
>> lower cost. Remember gup.c doesn't deref the pgmap at all.
>>
>> Dan/Alistair/Felix do you see any hole in that argument??
> 
> As background note that device pages are currently considered free when
> refcount == 1 but the pgmap reference is dropped when the refcount transitions
> 1->0. The final pgmap reference is typically dropped when a driver calls
> memunmap_pages() and put_page() drops the last page reference:
> 
> void memunmap_pages(struct dev_pagemap *pgmap)
> {
>         unsigned long pfn;
>         int i;
> 
>         dev_pagemap_kill(pgmap);
>         for (i = 0; i < pgmap->nr_range; i++)
>                 for_each_device_pfn(pfn, pgmap, i)
>                         put_page(pfn_to_page(pfn));
>         dev_pagemap_cleanup(pgmap);
> 
> If there are still pgmap references dev_pagemap_cleanup(pgmap) will block until
> the final reference is dropped. So I think your argument holds at least for
> DEVICE_PRIVATE and DEVICE_GENERIC. DEVICE_FS_DAX defines it's own pagemap
> cleanup but I can't see why the same argument wouldn't hold there - if a page
> has a valid refcount it must have a reference on the pagemap too.

IIUC Dan's reasoning was that fsdax wasn't able to deal with surprise removal [1] so his
patches were to ensure fsdax (or the pmem block device) poisons/kills the pages as a way
to notify filesystem/dm that the page was to be kept unmapped:

[1]
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/

But if fsdax doesn't wait for all the pgmap references[*] on its pagemap cleanup callback
then what's the pgmap ref in __gup_device_huge() pairs/protects us up against that is
specific to fsdax?

I am not sure I follow how both the fsdax specific issue ties in with this pgmap ref being
there above.

	Joao

[*] or at least fsdax_pagemap_ops doesn't suggest the contrary ... compared to
dev_pagemap_{kill,cleanup}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-30 17:54                       ` Joao Martins
@ 2021-09-30 21:55                         ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2021-09-30 21:55 UTC (permalink / raw)
  To: Joao Martins
  Cc: Alistair Popple, Dan Williams, Felix Kuehling, linux-mm,
	Vishal Verma, Dave Jiang, Naoya Horiguchi, Matthew Wilcox,
	John Hubbard, Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc

On Thu, Sep 30, 2021 at 06:54:05PM +0100, Joao Martins wrote:
> On 9/30/21 04:01, Alistair Popple wrote:
> > On Thursday, 30 September 2021 5:34:05 AM AEST Jason Gunthorpe wrote:
> >> On Wed, Sep 29, 2021 at 12:50:15PM +0100, Joao Martins wrote:
> >>
> >>>> If the get_dev_pagemap has to remain then it just means we have to
> >>>> flush before changing pagemap pointers
> >>> Right -- I don't think we should need it as that discussion on the other
> >>> thread goes.
> >>>
> >>> OTOH, using @pgmap might be useful to unblock gup-fast FOLL_LONGTERM
> >>> for certain devmap types[0] (like MEMORY_DEVICE_GENERIC [device-dax]
> >>> can support it but not MEMORY_DEVICE_FSDAX [fsdax]).
> >>
> >> When looking at Logan's patches I think it is pretty clear to me that
> >> page->pgmap must never be a dangling pointer if the caller has a
> >> legitimate refcount on the page.
> >>
> >> For instance the migrate and stuff all blindly calls
> >> is_device_private_page() on the struct page expecting a valid
> >> page->pgmap.
> >>
> >> This also looks like it is happening, ie
> >>
> >> void __put_page(struct page *page)
> >> {
> >> 	if (is_zone_device_page(page)) {
> >> 		put_dev_pagemap(page->pgmap);
> >>
> >> Is indeed putting the pgmap ref back when the page becomes ungettable.
> >>
> >> This properly happens when the page refcount goes to zero and so it
> >> should fully interlock with __page_cache_add_speculative():
> >>
> >> 	if (unlikely(!page_ref_add_unless(page, count, 0))) {
> >>
> >> Thus, in gup.c, if we succeed at try_grab_compound_head() then
> >> page->pgmap is a stable pointer with a valid refcount.
> >>
> >> So, all the external pgmap stuff in gup.c is completely pointless.
> >> try_grab_compound_head() provides us with an equivalent protection at
> >> lower cost. Remember gup.c doesn't deref the pgmap at all.
> >>
> >> Dan/Alistair/Felix do you see any hole in that argument??
> > 
> > As background note that device pages are currently considered free when
> > refcount == 1 but the pgmap reference is dropped when the refcount transitions
> > 1->0. The final pgmap reference is typically dropped when a driver calls
> > memunmap_pages() and put_page() drops the last page reference:
> > 
> > void memunmap_pages(struct dev_pagemap *pgmap)
> > {
> >         unsigned long pfn;
> >         int i;
> > 
> >         dev_pagemap_kill(pgmap);
> >         for (i = 0; i < pgmap->nr_range; i++)
> >                 for_each_device_pfn(pfn, pgmap, i)
> >                         put_page(pfn_to_page(pfn));
> >         dev_pagemap_cleanup(pgmap);
> > 
> > If there are still pgmap references dev_pagemap_cleanup(pgmap) will block until
> > the final reference is dropped. So I think your argument holds at least for
> > DEVICE_PRIVATE and DEVICE_GENERIC. DEVICE_FS_DAX defines it's own pagemap
> > cleanup but I can't see why the same argument wouldn't hold there - if a page
> > has a valid refcount it must have a reference on the pagemap too.
>
> IIUC Dan's reasoning was that fsdax wasn't able to deal with
> surprise removal [1] so his patches were to ensure fsdax (or the
> pmem block device) poisons/kills the pages as a way to notify
> filesystem/dm that the page was to be kept unmapped:

Sure, but that has nothing to do with GUP, that is between the
filesytem and fsdax
 
> But if fsdax doesn't wait for all the pgmap references[*] on its
> pagemap cleanup callback then what's the pgmap ref in
> __gup_device_huge() pairs/protects us up against that is specific to
> fsdax?

It does wait for refs

It sets the pgmap.ref to:

	pmem->pgmap.ref = &q->q_usage_counter;

And that ref is incr'd by the struct page lifetime - the unincr is in
__put_page() above

fsdax_pagemap_ops does pmem_pagemap_kill() which calls
blk_freeze_queue_start() which does percpu_ref_kill(). Then the
pmem_pagemap_cleanup() eventually does blk_mq_freeze_queue_wait()
which will sleep until the prefcpu ref reaches zero.

In other words fsdax cannot pass cleanup while a struct page exists
with a non-zero refcount, which answers Alistair's question about how
fsdax's cleanup work.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-08-27 14:58 ` [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
  2021-08-27 16:25   ` Jason Gunthorpe
@ 2021-10-08 11:54   ` Jason Gunthorpe
  2021-10-11 15:53     ` Joao Martins
  1 sibling, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-10-08 11:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  			ret = 0;
>  			break;
>  		}
> -		SetPageReferenced(page);
> -		pages[*nr] = page;
> -		if (unlikely(!try_grab_page(page, flags))) {
> -			undo_dev_pagemap(nr, nr_start, flags, pages);
> +
> +		head = compound_head(page);
> +		/* @end is assumed to be limited at most one compound page */
> +		if (PageHead(head))
> +			next = end;
> +		refs = record_subpages(page, addr, next, pages + *nr);
> +
> +		SetPageReferenced(head);
> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {

I was thinking about this some more, and this ordering doesn't seem
like a good idea. We shouldn't be looking at any part of the struct
page without holding the refcount, certainly not the compound_head()

The only optimization that might work here is to grab the head, then
compute the extent of tail pages and amalgamate them. Holding a ref on
the head also secures the tails.

Which also means most of what I was suggesting isn't going to work
anyhow.

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-10-08 11:54   ` Jason Gunthorpe
@ 2021-10-11 15:53     ` Joao Martins
  2021-10-13 17:41       ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-10-11 15:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 10/8/21 12:54, Jason Gunthorpe wrote:
> On Fri, Aug 27, 2021 at 03:58:13PM +0100, Joao Martins wrote:
>> @@ -2252,16 +2265,25 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>  			ret = 0;
>>  			break;
>>  		}
>> -		SetPageReferenced(page);
>> -		pages[*nr] = page;
>> -		if (unlikely(!try_grab_page(page, flags))) {
>> -			undo_dev_pagemap(nr, nr_start, flags, pages);
>> +
>> +		head = compound_head(page);
>> +		/* @end is assumed to be limited at most one compound page */
>> +		if (PageHead(head))
>> +			next = end;
>> +		refs = record_subpages(page, addr, next, pages + *nr);
>> +
>> +		SetPageReferenced(head);
>> +		if (unlikely(!try_grab_compound_head(head, refs, flags))) {
> 
> I was thinking about this some more, and this ordering doesn't seem
> like a good idea. We shouldn't be looking at any part of the struct
> page without holding the refcount, certainly not the compound_head()
> 
> The only optimization that might work here is to grab the head, then
> compute the extent of tail pages and amalgamate them. Holding a ref on
> the head also secures the tails.
> 

How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
as an added @head argument. While keeping the same structure of counting tail pages
between @addr .. @end if we have a head page.

Albeit this lingers on whether it's OK to call PageHead() .. The PageHead policy is for
any page (PF_ANY) so no hidden calls to compound_head() when testing that page flag. but
in the end it accesses struct page flags which is well, still struct page data.

We could also check pgmap for a non-zero geometry (which tells us that pmd_page(orig) does
represent a head page). And that would save us from looking at struct page data today, but
it would introduce problems later whenever we remove the pgmap ref grab in gup_device_huge().

So the only viable might be to do the grab, count tails and fixup-ref like you suggest
above, and take the perf hit of one extra atomic op :(

It's interesting how THP (in gup_huge_pmd()) unilaterally computes tails assuming
pmd_page(orig) is the head page.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-10-11 15:53     ` Joao Martins
@ 2021-10-13 17:41       ` Jason Gunthorpe
  2021-10-13 19:18         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-10-13 17:41 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote:
> On 10/8/21 12:54, Jason Gunthorpe wrote:

> > The only optimization that might work here is to grab the head, then
> > compute the extent of tail pages and amalgamate them. Holding a ref on
> > the head also secures the tails.
> 
> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
> as an added @head argument. While keeping the same structure of counting tail pages
> between @addr .. @end if we have a head page.

The right logic is what everything else does:

	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
	refs = record_subpages(page, addr, end, pages + *nr);
	head = try_grab_compound_head(pud_page(orig), refs, flags);

If you can use this, or not, depends entirely on answering the
question of why does  __gup_device_huge() exist at all.

This I don't fully know:

1) As discussed quite a few times now, the entire get_dev_pagemap
   stuff looks usless and should just be deleted. If you care about
   optimizing this I would persue doing that as it will give the
   biggest single win.

2) It breaks up the PUD/PMD into tail pages and scans them all
   Why? Can devmap combine multiple compound_head's into the same PTE?
   Does devmap guarentee that the PUD/PMD points to the head page? (I
   assume no)

But I'm looking at this some more and I see try_get_compound_head() is
reading the compound_head with no locking, just READ_ONCE, so that
must be OK under GUP.

It still seems to me the same generic algorithm should work
everywhere, once we get rid of the get_dev_pagemap

  start_pfn = pud/pmd_pfn() + pud/pmd_page_offset(addr)
  end_pfn = start_pfn + (end - addr) // fixme
  if (THP)
     refs = end_pfn - start_pfn
  if (devmap)
     refs = 1

  do {
     page = pfn_to_page(start_pfn)
     head_page = try_grab_compound_head(page, refs, flags)
     if (pud/pmd_val() != orig)
        err

     npages = 1 << compound_order(head_page)
     npages = min(npages, end_pfn - start_pfn)
     for (i = 0, iter = page; i != npages) {
     	 *pages++ = iter;
         mem_map_next(iter, page, i)
     }

     if (devmap && npages > 2)
         grab_compound_head(head_page, npages - 1, flags)
     start_pfn += npages;
  } while (start_pfn != end_pfn)

Above needs to be cleaned up quite a bit, but is the general idea.

There is an further optimization we can put in where we can know that
'page' is still in a currently grab'd compound (eg last_page+1 = page,
not past compound_order) and defer the refcount work.

> It's interesting how THP (in gup_huge_pmd()) unilaterally computes
> tails assuming pmd_page(orig) is the head page.

I think this is an integral property of THP, probably not devmap/dax
though?

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-10-13 17:41       ` Jason Gunthorpe
@ 2021-10-13 19:18         ` Joao Martins
  2021-10-13 19:43           ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-10-13 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 10/13/21 18:41, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote:
>> On 10/8/21 12:54, Jason Gunthorpe wrote:
> 
>>> The only optimization that might work here is to grab the head, then
>>> compute the extent of tail pages and amalgamate them. Holding a ref on
>>> the head also secures the tails.
>>
>> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
>> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
>> as an added @head argument. While keeping the same structure of counting tail pages
>> between @addr .. @end if we have a head page.
> 
> The right logic is what everything else does:
> 
> 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> 	refs = record_subpages(page, addr, end, pages + *nr);
> 	head = try_grab_compound_head(pud_page(orig), refs, flags);
> 
> If you can use this, or not, depends entirely on answering the
> question of why does  __gup_device_huge() exist at all.
> 
So for device-dax it seems to be an untackled oversight[*], probably
inherited from when fsdax devmap was introduced. What I don't know
is the other devmap users :(

[*] it has all the same properties as hugetlbfs AFAIU (after this series)

Certainly if any devmap PMD/PUD was represented in a single compound
page like THP/hugetlbfs then this patch would be a matter of removing
pgmap ref grab (and nuke the __gup_device_huge function existence as I
suggested earlier).

> This I don't fully know:
> 
> 1) As discussed quite a few times now, the entire get_dev_pagemap
>    stuff looks usless and should just be deleted. If you care about
>    optimizing this I would persue doing that as it will give the
>    biggest single win.
> 
I am not questioning the well-deserved improvement -- but from a pure
optimization perspective the get_dev_pagemap() cost is not
visible and quite negligeble. It is done once and only once and
subsequent calls to get_dev_pagemap with a non-NULL pgmap don't alter
the refcount and just return the pgmap object. And the xarray storing
the ranges -> pgmap won't be that big ... perhaps maybe 12 pgmaps on
a large >1T pmem system depending on your DIMM size.

The refcount update of the individual 4K page is what introduces
a seriously prohibite cost: I am seeing 10x the cost with DRAM
located struct pages (pmem located struct pages is even more ludicrous).

> 2) It breaks up the PUD/PMD into tail pages and scans them all
>    Why? Can devmap combine multiple compound_head's into the same PTE?


I am not aware of any other usage of compound pages for devmap struct pages
than this series. At least I haven't seen device-dax or fsdax using this.
Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA
doesn't seem to be (yet?) caught by any of the GUP path at least before
Logan's series lands. Or am I misunderstanding things here?

>    Does devmap guarentee that the PUD/PMD points to the head page? (I
>    assume no)
> 
For device-dax yes.


> But I'm looking at this some more and I see try_get_compound_head() is
> reading the compound_head with no locking, just READ_ONCE, so that
> must be OK under GUP.
> 
I suppose one other way to get around the double atomic op would be to fail
the try_get_compound_head() by comparing the first tail page compound_head()
after grabbing the head with @refs. That is instead of comparing against
grabbed head page and the passed page argument.

> It still seems to me the same generic algorithm should work
> everywhere, once we get rid of the get_dev_pagemap
> 
>   start_pfn = pud/pmd_pfn() + pud/pmd_page_offset(addr)
>   end_pfn = start_pfn + (end - addr) // fixme
>   if (THP)
>      refs = end_pfn - start_pfn
>   if (devmap)
>      refs = 1
> 
>   do {
>      page = pfn_to_page(start_pfn)
>      head_page = try_grab_compound_head(page, refs, flags)
>      if (pud/pmd_val() != orig)
>         err
> 
>      npages = 1 << compound_order(head_page)
>      npages = min(npages, end_pfn - start_pfn)
>      for (i = 0, iter = page; i != npages) {
>      	 *pages++ = iter;
>          mem_map_next(iter, page, i)
>      }
> 
>      if (devmap && npages > 2)
>          grab_compound_head(head_page, npages - 1, flags)
>      start_pfn += npages;
>   } while (start_pfn != end_pfn)
> 
> Above needs to be cleaned up quite a bit, but is the general idea.
> 
> There is an further optimization we can put in where we can know that
> 'page' is still in a currently grab'd compound (eg last_page+1 = page,
> not past compound_order) and defer the refcount work.
> 
I was changing __gup_device_huge() with similar to the above, but yeah
it follows that algorithm as inside your do { } while() (thanks!). I can
turn __gup_device_huge() into another (renamed to like try_grab_pages())
helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs
equivalent handling.

>> It's interesting how THP (in gup_huge_pmd()) unilaterally computes
>> tails assuming pmd_page(orig) is the head page.
> 
> I think this is an integral property of THP, probably not devmap/dax
> though?

I think the right answer is "depends on the devmap" type. device-dax with
PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same
rules as hugetlbfs. fsdax not so much (as you say above) but it would
follow up changes to perhaps adhere to similar scheme (not exactly sure
how do deal with holes). HMM I am not sure what the rules are there.
P2PDMA seems not applicable?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-10-13 19:18         ` Joao Martins
@ 2021-10-13 19:43           ` Jason Gunthorpe
  2021-10-14 17:56             ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2021-10-13 19:43 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Wed, Oct 13, 2021 at 08:18:08PM +0100, Joao Martins wrote:
> On 10/13/21 18:41, Jason Gunthorpe wrote:
> > On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote:
> >> On 10/8/21 12:54, Jason Gunthorpe wrote:
> > 
> >>> The only optimization that might work here is to grab the head, then
> >>> compute the extent of tail pages and amalgamate them. Holding a ref on
> >>> the head also secures the tails.
> >>
> >> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
> >> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
> >> as an added @head argument. While keeping the same structure of counting tail pages
> >> between @addr .. @end if we have a head page.
> > 
> > The right logic is what everything else does:
> > 
> > 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> > 	refs = record_subpages(page, addr, end, pages + *nr);
> > 	head = try_grab_compound_head(pud_page(orig), refs, flags);
> > 
> > If you can use this, or not, depends entirely on answering the
> > question of why does  __gup_device_huge() exist at all.
> > 
> So for device-dax it seems to be an untackled oversight[*], probably
> inherited from when fsdax devmap was introduced. What I don't know
> is the other devmap users :(

devmap generic infrastructure waits until all page refcounts go to
zero, and it should wait until any fast GUP is serialized as part of
the TLB shootdown - otherwise it is leaking access to the memory it
controls beyond it's shutdown

So, I don't think the different devmap users can break this?

> > This I don't fully know:
> > 
> > 1) As discussed quite a few times now, the entire get_dev_pagemap
> >    stuff looks usless and should just be deleted. If you care about
> >    optimizing this I would persue doing that as it will give the
> >    biggest single win.
> 
> I am not questioning the well-deserved improvement -- but from a pure
> optimization perspective the get_dev_pagemap() cost is not
> visible and quite negligeble.

You are doing large enough GUPs then that the expensive xarray seach
is small compared to the rest?

> > 2) It breaks up the PUD/PMD into tail pages and scans them all
> >    Why? Can devmap combine multiple compound_head's into the same PTE?
> 
> I am not aware of any other usage of compound pages for devmap struct pages
> than this series. At least I haven't seen device-dax or fsdax using
> this.

Let me ask this question differently, is this assertion OK?

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -808,8 +808,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
        }
 
        entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
-       if (pfn_t_devmap(pfn))
+       if (pfn_t_devmap(pfn)) {
+               struct page *pfn_to_page(pfn);
+
+               WARN_ON(compound_head(page) != page);
+               WARN_ON(compound_order(page) != PMD_SHIFT);
                entry = pmd_mkdevmap(entry);
+       }
        if (write) {
                entry = pmd_mkyoung(pmd_mkdirty(entry));
                entry = maybe_pmd_mkwrite(entry, vma);

(and same for insert_pfn_pud)

You said it is OK for device/dax/device.c?

And not for fs/dax.c?


> Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA
> doesn't seem to be (yet?) caught by any of the GUP path at least before
> Logan's series lands. Or am I misunderstanding things here?

Of the places that call the insert_pfn_pmd/pud call chains I only see
device/dax/device.c and fs/dax.c as being linked to devmap. So other
devmap users don't use this stuff.

> I was changing __gup_device_huge() with similar to the above, but yeah
> it follows that algorithm as inside your do { } while() (thanks!). I can
> turn __gup_device_huge() into another (renamed to like try_grab_pages())
> helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs
> equivalent handling.

I suppose it should be some #define because the (pmd_val != orig) logic
is not sharable

But, yes, a general call that the walker should make at any level to
record a pfn -> npages range efficiently.

> I think the right answer is "depends on the devmap" type. device-dax with
> PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same
> rules as hugetlbfs. fsdax not so much (as you say above) but it would
> follow up changes to perhaps adhere to similar scheme (not exactly sure
> how do deal with holes). HMM I am not sure what the rules are there.
> P2PDMA seems not applicable?

I would say, not "depends on the devmap", but what are the rules for
calling insert_pfn_pmd in the first place.

If users are allowed the create pmds that span many compound_head's
then we need logic as I showed. Otherwise we do not.

And I would document this relationship in the GUP side "This do/while
is required because insert_pfn_pmd/pud() is used with compound pages
smaller than the PUD/PMD size" so it isn't so confused with just
"devmap"

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-10-13 19:43           ` Jason Gunthorpe
@ 2021-10-14 17:56             ` Joao Martins
  2021-10-14 18:06               ` Jason Gunthorpe
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-10-14 17:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 10/13/21 20:43, Jason Gunthorpe wrote:
> On Wed, Oct 13, 2021 at 08:18:08PM +0100, Joao Martins wrote:
>> On 10/13/21 18:41, Jason Gunthorpe wrote:
>>> On Mon, Oct 11, 2021 at 04:53:29PM +0100, Joao Martins wrote:
>>>> On 10/8/21 12:54, Jason Gunthorpe wrote:
>>>
>>>>> The only optimization that might work here is to grab the head, then
>>>>> compute the extent of tail pages and amalgamate them. Holding a ref on
>>>>> the head also secures the tails.
>>>>
>>>> How about pmd_page(orig) / pud_page(orig) like what the rest of hugetlb/thp
>>>> checks do? i.e. we would pass pmd_page(orig)/pud_page(orig) to __gup_device_huge()
>>>> as an added @head argument. While keeping the same structure of counting tail pages
>>>> between @addr .. @end if we have a head page.
>>>
>>> The right logic is what everything else does:
>>>
>>> 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
>>> 	refs = record_subpages(page, addr, end, pages + *nr);
>>> 	head = try_grab_compound_head(pud_page(orig), refs, flags);
>>>
>>> If you can use this, or not, depends entirely on answering the
>>> question of why does  __gup_device_huge() exist at all.
>>>
>> So for device-dax it seems to be an untackled oversight[*], probably
>> inherited from when fsdax devmap was introduced. What I don't know
>> is the other devmap users :(
> 
> devmap generic infrastructure waits until all page refcounts go to
> zero, and it should wait until any fast GUP is serialized as part of
> the TLB shootdown - otherwise it is leaking access to the memory it
> controls beyond it's shutdown
> 
> So, I don't think the different devmap users can break this?
> 
maybe fsdax may not honor that after removing the pgmap ref count
as we talked earlier in the thread without the memory failures stuff.

But my point earlier on "oversight" was about the fact that we describe
PMD/PUDs with base pages. And hence why we can't do this:

 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 	head = try_grab_compound_head(pud_page(orig), refs, flags);

... For devmap.

>>> This I don't fully know:
>>>
>>> 1) As discussed quite a few times now, the entire get_dev_pagemap
>>>    stuff looks usless and should just be deleted. If you care about
>>>    optimizing this I would persue doing that as it will give the
>>>    biggest single win.
>>
>> I am not questioning the well-deserved improvement -- but from a pure
>> optimization perspective the get_dev_pagemap() cost is not
>> visible and quite negligeble.
> 
> You are doing large enough GUPs then that the expensive xarray seach
> is small compared to the rest?
> 

The tests I showed are on 16G and 128G (I usually test with > 1Tb).
But we are comparing 1 pgmap lookup, and 512 refcount atomic updates for
a PMD sized pin (2M).

It depends on what you think small is. Maybe for a 4K-16K of memory pinned,
probably the pgmap lookup is more expensive.

A couple months ago I remember measuring xarray lookups and that would be
around ~150ns per lookup with few entries excluding the ref update. I can
be pedantic and give you a more concrete measurement for get_dev_pagemap()
but quite honestly don't think that it will be close with we pin a hugepage.

But this is orthogonal to the pgmap ref existence, I was merely commenting on
the performance aspect.

>>> 2) It breaks up the PUD/PMD into tail pages and scans them all
>>>    Why? Can devmap combine multiple compound_head's into the same PTE?
>>
>> I am not aware of any other usage of compound pages for devmap struct pages
>> than this series. At least I haven't seen device-dax or fsdax using
>> this.
> 
> Let me ask this question differently, is this assertion OK?
> 
I understood the question. I was more trying to learn from you  on HMM/P2PDMA
use-cases as you sounded like compound pages exist elsewhere in devmap :)

> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -808,8 +808,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
>         }
>  
>         entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
> -       if (pfn_t_devmap(pfn))
> +       if (pfn_t_devmap(pfn)) {
> +               struct page *pfn_to_page(pfn);
> +
> +               WARN_ON(compound_head(page) != page);
> +               WARN_ON(compound_order(page) != PMD_SHIFT);
>                 entry = pmd_mkdevmap(entry);
> +       }
>         if (write) {
>                 entry = pmd_mkyoung(pmd_mkdirty(entry));
>                 entry = maybe_pmd_mkwrite(entry, vma);
> 
> (and same for insert_pfn_pud)
> 
> You said it is OK for device/dax/device.c?
> 
Yes, correct. (I also verified with similar snippet above to be
extra-pedantic)

I would like to emphasize that it is only OK *after this series*.

Without this series there is neither compound order or head, nor any
compound page whatsoever.

> And not for fs/dax.c?
> 
IIUC, Correct. fs/dax code is a little fuzzy on the sector to PFN
translation but still:

There's no compound pages in fsdax (neither do I add them). So
compound_order() doesn't exist (hence order 0) and there's no head
which also means that compound_head(@page) returns @page (which is
a tad misleading on being a head definition as a PMD subpage would
not return @page).

> 
>> Unless HMM does this stuff, or some sort of devmap page migration? P2PDMA
>> doesn't seem to be (yet?) caught by any of the GUP path at least before
>> Logan's series lands. Or am I misunderstanding things here?
> 
> Of the places that call the insert_pfn_pmd/pud call chains I only see
> device/dax/device.c and fs/dax.c as being linked to devmap. So other
> devmap users don't use this stuff.
> 
>> I was changing __gup_device_huge() with similar to the above, but yeah
>> it follows that algorithm as inside your do { } while() (thanks!). I can
>> turn __gup_device_huge() into another (renamed to like try_grab_pages())
>> helper and replace the callsites of gup_huge_{pud,pmd} for the THP/hugetlbfs
>> equivalent handling.
> 
> I suppose it should be some #define because the (pmd_val != orig) logic
> is not sharable
> 
I wasn't going to have that pmd_val() in there. Just this refcount part:

 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 	head = try_grab_compound_head(pud_page(orig), refs, flags);

and __gup_device_huge() melded in.

> But, yes, a general call that the walker should make at any level to
> record a pfn -> npages range efficiently.
> 
>> I think the right answer is "depends on the devmap" type. device-dax with
>> PMD/PUDs (i.e. 2m pagesize PMEM or 1G pagesize pmem) works with the same
>> rules as hugetlbfs. fsdax not so much (as you say above) but it would
>> follow up changes to perhaps adhere to similar scheme (not exactly sure
>> how do deal with holes). HMM I am not sure what the rules are there.
>> P2PDMA seems not applicable?
> 
> I would say, not "depends on the devmap", but what are the rules for
> calling insert_pfn_pmd in the first place.
> 
I am gonna translate that into HMM and P2PDMA so far aren't affected as you
also said earlier above. Even after Logan's P2PDMA NVME series it appears is
always PTEs. So fsdax and device-dax are the only pmd_devmap users we
should care, and for pud_devmap() only device-dax.

> If users are allowed the create pmds that span many compound_head's
> then we need logic as I showed. Otherwise we do not.
> 
/me nods.

> And I would document this relationship in the GUP side "This do/while
> is required because insert_pfn_pmd/pud() is used with compound pages
> smaller than the PUD/PMD size" so it isn't so confused with just
> "devmap"

Also, it's not that PMDs span compound heads, it's that PMDs/PUDs use
just base pages. Compound pages/head in devmap are only introduced by
series and for device-dax only.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-10-14 17:56             ` Joao Martins
@ 2021-10-14 18:06               ` Jason Gunthorpe
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2021-10-14 18:06 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Thu, Oct 14, 2021 at 06:56:51PM +0100, Joao Martins wrote:

> > And I would document this relationship in the GUP side "This do/while
> > is required because insert_pfn_pmd/pud() is used with compound pages
> > smaller than the PUD/PMD size" so it isn't so confused with just
> > "devmap"
> 
> Also, it's not that PMDs span compound heads, it's that PMDs/PUDs use
> just base pages. Compound pages/head in devmap are only introduced by
> series and for device-dax only.

I think it all makes sense, I just want to clarify what I mean by
compound_head:

  compound_head(base_page) == base_page

Ie a PMD consisting only of order 0 pages would have N 'compound_heads'
within it even though nothing is a compound page.

Which is relavent because the GUP algorithms act on the compound_head

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-30  3:01                     ` Alistair Popple
  2021-09-30 17:54                       ` Joao Martins
@ 2021-10-18 18:36                       ` Jason Gunthorpe
  1 sibling, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2021-10-18 18:36 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Joao Martins, Dan Williams, Felix Kuehling, linux-mm,
	Vishal Verma, Dave Jiang, Naoya Horiguchi, Matthew Wilcox,
	John Hubbard, Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc

On Thu, Sep 30, 2021 at 01:01:14PM +1000, Alistair Popple wrote:
> On Thursday, 30 September 2021 5:34:05 AM AEST Jason Gunthorpe wrote:
> > On Wed, Sep 29, 2021 at 12:50:15PM +0100, Joao Martins wrote:
> > 
> > > > If the get_dev_pagemap has to remain then it just means we have to
> > > > flush before changing pagemap pointers
> > > Right -- I don't think we should need it as that discussion on the other
> > > thread goes.
> > > 
> > > OTOH, using @pgmap might be useful to unblock gup-fast FOLL_LONGTERM
> > > for certain devmap types[0] (like MEMORY_DEVICE_GENERIC [device-dax]
> > > can support it but not MEMORY_DEVICE_FSDAX [fsdax]).
> > 
> > When looking at Logan's patches I think it is pretty clear to me that
> > page->pgmap must never be a dangling pointer if the caller has a
> > legitimate refcount on the page.
> > 
> > For instance the migrate and stuff all blindly calls
> > is_device_private_page() on the struct page expecting a valid
> > page->pgmap.
> > 
> > This also looks like it is happening, ie
> > 
> > void __put_page(struct page *page)
> > {
> > 	if (is_zone_device_page(page)) {
> > 		put_dev_pagemap(page->pgmap);
> > 
> > Is indeed putting the pgmap ref back when the page becomes ungettable.
> > 
> > This properly happens when the page refcount goes to zero and so it
> > should fully interlock with __page_cache_add_speculative():
> > 
> > 	if (unlikely(!page_ref_add_unless(page, count, 0))) {
> > 
> > Thus, in gup.c, if we succeed at try_grab_compound_head() then
> > page->pgmap is a stable pointer with a valid refcount.
> > 
> > So, all the external pgmap stuff in gup.c is completely pointless.
> > try_grab_compound_head() provides us with an equivalent protection at
> > lower cost. Remember gup.c doesn't deref the pgmap at all.
> > 
> > Dan/Alistair/Felix do you see any hole in that argument??
> 
> As background note that device pages are currently considered free when
> refcount == 1 but the pgmap reference is dropped when the refcount transitions
> 1->0. The final pgmap reference is typically dropped when a driver calls
> memunmap_pages() and put_page() drops the last page reference:
> 
> void memunmap_pages(struct dev_pagemap *pgmap)
> {
>         unsigned long pfn;
>         int i;
> 
>         dev_pagemap_kill(pgmap);
>         for (i = 0; i < pgmap->nr_range; i++)
>                 for_each_device_pfn(pfn, pgmap, i)
>                         put_page(pfn_to_page(pfn));
>         dev_pagemap_cleanup(pgmap);
> 
> If there are still pgmap references dev_pagemap_cleanup(pgmap) will block until
> the final reference is dropped. So I think your argument holds at least for
> DEVICE_PRIVATE and DEVICE_GENERIC. DEVICE_FS_DAX defines it's own pagemap
> cleanup but I can't see why the same argument wouldn't hold there - if a page
> has a valid refcount it must have a reference on the pagemap too.

To close this circle - the issue is use after free on the struct page
* entry while it has a zero ref.

memunmap_pages() does wait for the refcount to go to zero, but it then
goes on to free the memory under the struct pages.

However there are possibly still untracked references to this memory
in the page tables.

This is the bug Dan has been working on - to shootdown page table
mappings before getting to memunmap_pages()

Getting the page map ref will make the use-after-free never crash,
just be a silent security problem. :(

Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages
  2021-09-29 11:50                 ` Joao Martins
  2021-09-29 19:34                   ` Jason Gunthorpe
@ 2021-10-18 18:37                   ` Jason Gunthorpe
  1 sibling, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2021-10-18 18:37 UTC (permalink / raw)
  To: Joao Martins
  Cc: Dan Williams, linux-mm, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Wed, Sep 29, 2021 at 12:50:15PM +0100, Joao Martins wrote:
> On 9/28/21 19:01, Jason Gunthorpe wrote:
> > On Thu, Sep 23, 2021 at 05:51:04PM +0100, Joao Martins wrote:
> >> So ... if pgmap accounting was removed from gup-fast then this patch
> >> would be a lot simpler and we could perhaps just fallback to the regular
> >> hugepage case (THP, HugeTLB) like your suggestion at the top. See at the
> >> end below scissors mark as the ballpark of changes.
> >>
> >> So far my options seem to be: 1) this patch which leverages the existing
> >> iteration logic or 2) switching to for_each_compound_range() -- see my previous
> >> reply 3) waiting for Dan to remove @pgmap accounting in gup-fast and use
> >> something similar to below scissors mark.
> >>
> >> What do you think would be the best course of action?
> > 
> > I still think the basic algorithm should be to accumulate physicaly
> > contiguous addresses when walking the page table and then flush them
> > back to struct pages once we can't accumulate any more.
> > 
> > That works for both the walkers and all the page types?
> > 
> 
> The logic already handles all page types -- I was trying to avoid the extra
> complexity in regular hugetlb/THP path by not merging the handling of the
> oddball case that is devmap (or fundamentally devmap
> non-compound case in the future).

FYI, this untested thing is what I came to when I tried to make
something like this:

/*
 * A large page entry such as PUD/PMD can point to a struct page. In cases like
 * THP this struct page will be a compound page of the same order as the page
 * table level. However, in cases like DAX or more generally pgmap ZONE_DEVICE,
 * the PUD/PMD may point at the first pfn in a string of pages.
 *
 * This helper iterates over all head pages or all the non-compound base pages.
 */
static pt_entry_iter_state
{
	struct page *head;
	unsigned long compound_nr;
	unsigned long pfn;
	unsigned long end_pfn;
};

static inline struct page *__pt_start_iter(struct iter_state *state,
					   struct page *page, unsigned long pfn,
					   unsigned int entry_size)
{
	state->head = compound_head(page);
	state->compound_nr = compound_nr(page);
	state->pfn = pfn & (~(state->compound_nr - 1));
	state->end_pfn = pfn + entry_size / PAGE_SIZE;
	return state->head;
}

static inline struct page *__pt_next_page(struct iter_state *state)
{
	state->pfn += state->compound_nr;
	if (state->end_pfn <= state->pfn)
		return NULL;
	state->head = pfn_to_page(state->pfn);
	state->compound_nr = compound_nr(page);
	return state->head;
}

#define for_each_page_in_pt_entry(state, page, pfn, entry_size)                \
	for (page = __pt_start_iter(state, page, pfn, entry_size); page;       \
	     page = __pt_next_page(&state))

static bool remove_pages_from_page_table(struct vm_area_struct *vma,
					 struct page *page, unsigned long pfn,
					 unsigned int entry_size, bool is_dirty,
					 bool is_young)
{
	struct iter_state state;

	for_each_page_in_pt_entry(&state, page, pfn, entry_size)
		remove_page_from_page_table(vma, page, is_dirty, is_young);
}


Jason

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-08-27 14:58 ` [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
@ 2021-11-05  0:31   ` Dan Williams
  2021-11-05 12:09     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2021-11-05  0:31 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Right now, only static dax regions have a valid @pgmap pointer in its
> struct dev_dax. Dynamic dax case however, do not.
>
> In preparation for device-dax compound devmap support, make sure that
> dev_dax pgmap field is set after it has been allocated and initialized.
>
> dynamic dax device have the @pgmap is allocated at probe() and it's
> managed by devm (contrast to static dax region which a pgmap is provided
> and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
> the pgmap when the dynamic dax device is released to avoid the same
> pgmap ranges to be re-requested across multiple region device reconfigs.
>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/dax/bus.c    | 8 ++++++++
>  drivers/dax/device.c | 2 ++
>  2 files changed, 10 insertions(+)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6cc4da4c713d..49dbff9ba609 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -363,6 +363,14 @@ void kill_dev_dax(struct dev_dax *dev_dax)
>
>         kill_dax(dax_dev);
>         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
> +
> +       /*
> +        * Dynamic dax region have the pgmap allocated via dev_kzalloc()
> +        * and thus freed by devm. Clear the pgmap to not have stale pgmap
> +        * ranges on probe() from previous reconfigurations of region devices.
> +        */
> +       if (!is_static(dev_dax->region))
> +               dev_dax->pgmap = NULL;
>  }
>  EXPORT_SYMBOL_GPL(kill_dev_dax);
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 0b82159b3564..6e348b5f9d45 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>         }
>
>         pgmap->type = MEMORY_DEVICE_GENERIC;
> +       dev_dax->pgmap = pgmap;

So I think I'd rather see a bigger patch that replaces some of the
implicit dev_dax->pgmap == NULL checks with explicit is_static()
checks. Something like the following only compile and boot tested...
Note the struct_size() change probably wants to be its own cleanup,
and the EXPORT_SYMBOL_NS_GPL(..., DAX) probably wants to be its own
patch converting over the entirety of drivers/dax/. Thoughts?


diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 6cc4da4c713d..67ab7e05b340 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -134,6 +134,12 @@ static bool is_static(struct dax_region *dax_region)
        return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }

+bool static_dev_dax(struct dev_dax *dev_dax)
+{
+       return is_static(dev_dax->region);
+}
+EXPORT_SYMBOL_NS_GPL(static_dev_dax, DAX);
+
 static u64 dev_dax_size(struct dev_dax *dev_dax)
 {
        u64 size = 0;
@@ -363,6 +369,8 @@ void kill_dev_dax(struct dev_dax *dev_dax)

        kill_dax(dax_dev);
        unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+       if (static_dev_dax(dev_dax))
+               dev_dax->pgmap = NULL;
 }
 EXPORT_SYMBOL_GPL(kill_dev_dax);

diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 1e946ad7780a..4acdfee7dd59 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -48,6 +48,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
        __dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
 void dax_driver_unregister(struct dax_device_driver *dax_drv);
 void kill_dev_dax(struct dev_dax *dev_dax);
+bool static_dev_dax(struct dev_dax *dev_dax);

 #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
 int dev_dax_probe(struct dev_dax *dev_dax);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index dd8222a42808..87507aff2b10 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -398,31 +398,43 @@ int dev_dax_probe(struct dev_dax *dev_dax)
        void *addr;
        int rc, i;

-       pgmap = dev_dax->pgmap;
-       if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
-                       "static pgmap / multi-range device conflict\n"))
+       if (static_dev_dax(dev_dax) && dev_dax->nr_range > 1) {
+               dev_warn(dev, "static pgmap / multi-range device conflict\n");
                return -EINVAL;
+       }

-       if (!pgmap) {
-               pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
-                               * (dev_dax->nr_range - 1), GFP_KERNEL);
+       if (static_dev_dax(dev_dax)) {
+               pgmap = dev_dax->pgmap;
+       } else {
+               if (dev_dax->pgmap) {
+                       dev_warn(dev,
+                                "dynamic-dax with pre-populated page map!?\n");
+                       return -EINVAL;
+               }
+               pgmap = devm_kzalloc(
+                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
+                       GFP_KERNEL);
                if (!pgmap)
                        return -ENOMEM;
                pgmap->nr_range = dev_dax->nr_range;
+               dev_dax->pgmap = pgmap;
+               for (i = 0; i < dev_dax->nr_range; i++) {
+                       struct range *range = &dev_dax->ranges[i].range;
+
+                       pgmap->ranges[i] = *range;
+               }
        }

        for (i = 0; i < dev_dax->nr_range; i++) {
                struct range *range = &dev_dax->ranges[i].range;

-               if (!devm_request_mem_region(dev, range->start,
-                                       range_len(range), dev_name(dev))) {
-                       dev_warn(dev, "mapping%d: %#llx-%#llx could
not reserve range\n",
-                                       i, range->start, range->end);
-                       return -EBUSY;
-               }
-               /* don't update the range for static pgmap */
-               if (!dev_dax->pgmap)
-                       pgmap->ranges[i] = *range;
+               if (devm_request_mem_region(dev, range->start, range_len(range),
+                                           dev_name(dev)))
+                       continue;
+               dev_warn(dev,
+                        "mapping%d: %#llx-%#llx could not reserve range\n", i,
+                        range->start, range->end);
+               return -EBUSY;
        }

        pgmap->type = MEMORY_DEVICE_GENERIC;
@@ -473,3 +485,4 @@ MODULE_LICENSE("GPL v2");
 module_init(dax_init);
 module_exit(dax_exit);
 MODULE_ALIAS_DAX_DEVICE(0);
+MODULE_IMPORT_NS(DAX);



> +
>         addr = devm_memremap_pages(dev, pgmap);
>         if (IS_ERR(addr))
>                 return PTR_ERR(addr);
> --
> 2.17.1
>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 07/14] device-dax: compound devmap support
  2021-08-27 14:58 ` [PATCH v4 07/14] device-dax: compound devmap support Joao Martins
@ 2021-11-05  0:38   ` Dan Williams
  2021-11-05 14:10     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Dan Williams @ 2021-11-05  0:38 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Use the newly added compound devmap facility which maps the assigned dax
> ranges as compound pages at a page size of @align. Currently, this means,
> that region/namespace bootstrap would take considerably less, given that
> you would initialize considerably less pages.
>
> On setups with 128G NVDIMMs the initialization with DRAM stored struct
> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
> than a 1msec with 1G pages.
>
> dax devices are created with a fixed @align (huge page size) which is
> enforced through as well at mmap() of the device. Faults, consequently
> happen too at the specified @align specified at the creation, and those
> don't change through out dax device lifetime.

s/through out/throughout/

> MCEs poisons a whole dax huge page, as well as splits occurring at the configured page size.

A clarification here, MCEs trigger memory_failure() to *unmap* a whole
dax huge page, the poison stays limited to a single cacheline.

Otherwise the patch looks good to me.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  drivers/dax/device.c | 56 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 43 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 6e348b5f9d45..5d23128f9a60 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  }
>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>
> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
> +                            unsigned long fault_size,
> +                            struct address_space *f_mapping)
> +{
> +       unsigned long i;
> +       pgoff_t pgoff;
> +
> +       pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
> +
> +       for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> +               struct page *page;
> +
> +               page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> +               if (page->mapping)
> +                       continue;
> +               page->mapping = f_mapping;
> +               page->index = pgoff + i;
> +       }
> +}
> +
> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
> +                                unsigned long fault_size,
> +                                struct address_space *f_mapping)
> +{
> +       struct page *head;
> +
> +       head = pfn_to_page(pfn_t_to_pfn(pfn));
> +       head = compound_head(head);
> +       if (head->mapping)
> +               return;
> +
> +       head->mapping = f_mapping;
> +       head->index = linear_page_index(vmf->vma,
> +                       ALIGN(vmf->address, fault_size));
> +}
> +
>  static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>                 enum page_entry_size pe_size)
>  {
> @@ -225,8 +261,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>         }
>
>         if (rc == VM_FAULT_NOPAGE) {
> -               unsigned long i;
> -               pgoff_t pgoff;
> +               struct dev_pagemap *pgmap = dev_dax->pgmap;
>
>                 /*
>                  * In the device-dax case the only possibility for a
> @@ -234,17 +269,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>                  * mapped. No need to consider the zero page, or racing
>                  * conflicting mappings.
>                  */
> -               pgoff = linear_page_index(vmf->vma,
> -                               ALIGN(vmf->address, fault_size));
> -               for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> -                       struct page *page;
> -
> -                       page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> -                       if (page->mapping)
> -                               continue;
> -                       page->mapping = filp->f_mapping;
> -                       page->index = pgoff + i;
> -               }
> +               if (pgmap_geometry(pgmap) > 1)
> +                       set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
> +               else
> +                       set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
>         }
>         dax_read_unlock(id);
>
> @@ -426,6 +454,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>         }
>
>         pgmap->type = MEMORY_DEVICE_GENERIC;
> +       if (dev_dax->align > PAGE_SIZE)
> +               pgmap->geometry = dev_dax->align >> PAGE_SHIFT;
>         dev_dax->pgmap = pgmap;
>
>         addr = devm_memremap_pages(dev, pgmap);
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-11-05  0:31   ` Dan Williams
@ 2021-11-05 12:09     ` Joao Martins
  2021-11-05 16:14       ` Joao Martins
  2021-11-05 16:46       ` Dan Williams
  0 siblings, 2 replies; 48+ messages in thread
From: Joao Martins @ 2021-11-05 12:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On 11/5/21 00:31, Dan Williams wrote:
> On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Right now, only static dax regions have a valid @pgmap pointer in its
>> struct dev_dax. Dynamic dax case however, do not.
>>
>> In preparation for device-dax compound devmap support, make sure that
>> dev_dax pgmap field is set after it has been allocated and initialized.
>>
>> dynamic dax device have the @pgmap is allocated at probe() and it's
>> managed by devm (contrast to static dax region which a pgmap is provided
>> and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
>> the pgmap when the dynamic dax device is released to avoid the same
>> pgmap ranges to be re-requested across multiple region device reconfigs.
>>
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  drivers/dax/bus.c    | 8 ++++++++
>>  drivers/dax/device.c | 2 ++
>>  2 files changed, 10 insertions(+)
>>
>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>> index 6cc4da4c713d..49dbff9ba609 100644
>> --- a/drivers/dax/bus.c
>> +++ b/drivers/dax/bus.c
>> @@ -363,6 +363,14 @@ void kill_dev_dax(struct dev_dax *dev_dax)
>>
>>         kill_dax(dax_dev);
>>         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>> +
>> +       /*
>> +        * Dynamic dax region have the pgmap allocated via dev_kzalloc()
>> +        * and thus freed by devm. Clear the pgmap to not have stale pgmap
>> +        * ranges on probe() from previous reconfigurations of region devices.
>> +        */
>> +       if (!is_static(dev_dax->region))
>> +               dev_dax->pgmap = NULL;
>>  }
>>  EXPORT_SYMBOL_GPL(kill_dev_dax);
>>
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index 0b82159b3564..6e348b5f9d45 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>         }
>>
>>         pgmap->type = MEMORY_DEVICE_GENERIC;
>> +       dev_dax->pgmap = pgmap;
> 
> So I think I'd rather see a bigger patch that replaces some of the
> implicit dev_dax->pgmap == NULL checks with explicit is_static()
> checks. Something like the following only compile and boot tested...
> Note the struct_size() change probably wants to be its own cleanup,
> and the EXPORT_SYMBOL_NS_GPL(..., DAX) probably wants to be its own
> patch converting over the entirety of drivers/dax/. Thoughts?
> 
It's a good idea. Certainly the implicit pgmap == NULL made it harder
than the necessary to find where the problem was. So turning those checks
into explicit checks that differentiate static vs dynamic dax will help

With respect to this series converting those pgmap == NULL is going to need
to made me export the symbol (provided dax core and dax device can be built
as modules). So I don't know how this can be a patch converting entirety of
dax. Perhaps you mean that I would just EXPORT_SYMBOL() and then a bigger
patch introduces the MODULE_NS_IMPORT() And EXPORT_SYMBOL_NS*() separately.

The struct_size, yeah, should be a separate patch much like commit 7d18dd75a8af
("device-dax/kmem: use struct_size()").

minor comment below on your snippet.

> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6cc4da4c713d..67ab7e05b340 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -134,6 +134,12 @@ static bool is_static(struct dax_region *dax_region)
>         return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>  }
> 
> +bool static_dev_dax(struct dev_dax *dev_dax)
> +{
> +       return is_static(dev_dax->region);
> +}
> +EXPORT_SYMBOL_NS_GPL(static_dev_dax, DAX);
> +
>  static u64 dev_dax_size(struct dev_dax *dev_dax)
>  {
>         u64 size = 0;
> @@ -363,6 +369,8 @@ void kill_dev_dax(struct dev_dax *dev_dax)
> 
>         kill_dax(dax_dev);
>         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
> +       if (static_dev_dax(dev_dax))
> +               dev_dax->pgmap = NULL;
>  }

Here you probably meant !static_dev_dax() per my patch.

>  EXPORT_SYMBOL_GPL(kill_dev_dax);
> 
> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> index 1e946ad7780a..4acdfee7dd59 100644
> --- a/drivers/dax/bus.h
> +++ b/drivers/dax/bus.h
> @@ -48,6 +48,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>         __dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
>  void dax_driver_unregister(struct dax_device_driver *dax_drv);
>  void kill_dev_dax(struct dev_dax *dev_dax);
> +bool static_dev_dax(struct dev_dax *dev_dax);
> 
>  #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
>  int dev_dax_probe(struct dev_dax *dev_dax);
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index dd8222a42808..87507aff2b10 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -398,31 +398,43 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>         void *addr;
>         int rc, i;
> 
> -       pgmap = dev_dax->pgmap;
> -       if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
> -                       "static pgmap / multi-range device conflict\n"))
> +       if (static_dev_dax(dev_dax) && dev_dax->nr_range > 1) {
> +               dev_warn(dev, "static pgmap / multi-range device conflict\n");
>                 return -EINVAL;
> +       }
> 
> -       if (!pgmap) {
> -               pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
> -                               * (dev_dax->nr_range - 1), GFP_KERNEL);
> +       if (static_dev_dax(dev_dax)) {
> +               pgmap = dev_dax->pgmap;
> +       } else {
> +               if (dev_dax->pgmap) {
> +                       dev_warn(dev,
> +                                "dynamic-dax with pre-populated page map!?\n");
> +                       return -EINVAL;
> +               }
> +               pgmap = devm_kzalloc(
> +                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> +                       GFP_KERNEL);
>                 if (!pgmap)
>                         return -ENOMEM;
>                 pgmap->nr_range = dev_dax->nr_range;
> +               dev_dax->pgmap = pgmap;
> +               for (i = 0; i < dev_dax->nr_range; i++) {
> +                       struct range *range = &dev_dax->ranges[i].range;
> +
> +                       pgmap->ranges[i] = *range;
> +               }
>         }
> 
This code move is probably not needed unless your point is to have a more clear
separation on what's initialization versus the mem region request (that's
applicable to both dynamic and static).

>         for (i = 0; i < dev_dax->nr_range; i++) {
>                 struct range *range = &dev_dax->ranges[i].range;
> 
> -               if (!devm_request_mem_region(dev, range->start,
> -                                       range_len(range), dev_name(dev))) {
> -                       dev_warn(dev, "mapping%d: %#llx-%#llx could
> not reserve range\n",
> -                                       i, range->start, range->end);
> -                       return -EBUSY;
> -               }
> -               /* don't update the range for static pgmap */
> -               if (!dev_dax->pgmap)
> -                       pgmap->ranges[i] = *range;
> +               if (devm_request_mem_region(dev, range->start, range_len(range),
> +                                           dev_name(dev)))
> +                       continue;
> +               dev_warn(dev,
> +                        "mapping%d: %#llx-%#llx could not reserve range\n", i,
> +                        range->start, range->end);
> +               return -EBUSY;
>         }
> 
>         pgmap->type = MEMORY_DEVICE_GENERIC;
> @@ -473,3 +485,4 @@ MODULE_LICENSE("GPL v2");
>  module_init(dax_init);
>  module_exit(dax_exit);
>  MODULE_ALIAS_DAX_DEVICE(0);
> +MODULE_IMPORT_NS(DAX);


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 07/14] device-dax: compound devmap support
  2021-11-05  0:38   ` Dan Williams
@ 2021-11-05 14:10     ` Joao Martins
  2021-11-05 16:41       ` Dan Williams
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2021-11-05 14:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On 11/5/21 00:38, Dan Williams wrote:
> On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Use the newly added compound devmap facility which maps the assigned dax
>> ranges as compound pages at a page size of @align. Currently, this means,
>> that region/namespace bootstrap would take considerably less, given that
>> you would initialize considerably less pages.
>>
>> On setups with 128G NVDIMMs the initialization with DRAM stored struct
>> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
>> than a 1msec with 1G pages.
>>
>> dax devices are created with a fixed @align (huge page size) which is
>> enforced through as well at mmap() of the device. Faults, consequently
>> happen too at the specified @align specified at the creation, and those
>> don't change through out dax device lifetime.
> 
> s/through out/throughout/
> 
>> MCEs poisons a whole dax huge page, as well as splits occurring at the configured page size.
> 
> A clarification here, MCEs trigger memory_failure() to *unmap* a whole
> dax huge page, the poison stays limited to a single cacheline.
> 
Ah, yes. I'll fix it for v5.

> Otherwise the patch looks good to me.
> 
Thanks!

Btw, does 'looks good' == Reviewed-by (with the commit message clarification above) or is
it that 'should be good with the ammend above and you get the tag in the next round' ?

Asking as IIRC you mentioned this too some other time(s) (in the simpler sparse-vmemmap
patches) hence just clarifying to understand your expected 'process' better.

Also, I will be splitting this series as mentioned in the other discussion ...

https://lore.kernel.org/linux-mm/20211019160136.GH3686969@ziepe.ca/

... So this patch and the previous one should be the last two patches of the series
and the rest (gup, sparse-vmemmmap) will go in parallel.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-11-05 12:09     ` Joao Martins
@ 2021-11-05 16:14       ` Joao Martins
  2021-11-05 16:46       ` Dan Williams
  1 sibling, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-11-05 16:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On 11/5/21 12:09, Joao Martins wrote:
> On 11/5/21 00:31, Dan Williams wrote:
>> On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>> Right now, only static dax regions have a valid @pgmap pointer in its
>>> struct dev_dax. Dynamic dax case however, do not.
>>>
>>> In preparation for device-dax compound devmap support, make sure that
>>> dev_dax pgmap field is set after it has been allocated and initialized.
>>>
>>> dynamic dax device have the @pgmap is allocated at probe() and it's
>>> managed by devm (contrast to static dax region which a pgmap is provided
>>> and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
>>> the pgmap when the dynamic dax device is released to avoid the same
>>> pgmap ranges to be re-requested across multiple region device reconfigs.
>>>
>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>  drivers/dax/bus.c    | 8 ++++++++
>>>  drivers/dax/device.c | 2 ++
>>>  2 files changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>>> index 6cc4da4c713d..49dbff9ba609 100644
>>> --- a/drivers/dax/bus.c
>>> +++ b/drivers/dax/bus.c
>>> @@ -363,6 +363,14 @@ void kill_dev_dax(struct dev_dax *dev_dax)
>>>
>>>         kill_dax(dax_dev);
>>>         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>>> +
>>> +       /*
>>> +        * Dynamic dax region have the pgmap allocated via dev_kzalloc()
>>> +        * and thus freed by devm. Clear the pgmap to not have stale pgmap
>>> +        * ranges on probe() from previous reconfigurations of region devices.
>>> +        */
>>> +       if (!is_static(dev_dax->region))
>>> +               dev_dax->pgmap = NULL;
>>>  }
>>>  EXPORT_SYMBOL_GPL(kill_dev_dax);
>>>
>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>> index 0b82159b3564..6e348b5f9d45 100644
>>> --- a/drivers/dax/device.c
>>> +++ b/drivers/dax/device.c
>>> @@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>>         }
>>>
>>>         pgmap->type = MEMORY_DEVICE_GENERIC;
>>> +       dev_dax->pgmap = pgmap;
>>
>> So I think I'd rather see a bigger patch that replaces some of the
>> implicit dev_dax->pgmap == NULL checks with explicit is_static()
>> checks. Something like the following only compile and boot tested...
>> Note the struct_size() change probably wants to be its own cleanup,
>> and the EXPORT_SYMBOL_NS_GPL(..., DAX) probably wants to be its own
>> patch converting over the entirety of drivers/dax/. Thoughts?
>>
> It's a good idea. Certainly the implicit pgmap == NULL made it harder
> than the necessary to find where the problem was. So turning those checks
> into explicit checks that differentiate static vs dynamic dax will help
> 
> With respect to this series converting those pgmap == NULL is going to need
> to made me export the symbol (provided dax core and dax device can be built
> as modules). So I don't know how this can be a patch converting entirety of
> dax. Perhaps you mean that I would just EXPORT_SYMBOL() and then a bigger
> patch introduces the MODULE_NS_IMPORT() And EXPORT_SYMBOL_NS*() separately.
> 
To be clear by "then a bigger patch" I actually meant that the
EXPORT_SYMBOL_NS()/MODULE_NS_IMPORT() would be a separate patch not included
this series. For this series I would just use EXPORT_SYMBOL().

> The struct_size, yeah, should be a separate patch much like commit 7d18dd75a8af
> ("device-dax/kmem: use struct_size()").
> 
This cleanup would still be included as a predecessor patch, as I will be
touching this area in this patch.

> minor comment below on your snippet.
> 
>>
>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>> index 6cc4da4c713d..67ab7e05b340 100644
>> --- a/drivers/dax/bus.c
>> +++ b/drivers/dax/bus.c
>> @@ -134,6 +134,12 @@ static bool is_static(struct dax_region *dax_region)
>>         return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
>>  }
>>
>> +bool static_dev_dax(struct dev_dax *dev_dax)
>> +{
>> +       return is_static(dev_dax->region);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(static_dev_dax, DAX);
>> +
>>  static u64 dev_dax_size(struct dev_dax *dev_dax)
>>  {
>>         u64 size = 0;
>> @@ -363,6 +369,8 @@ void kill_dev_dax(struct dev_dax *dev_dax)
>>
>>         kill_dax(dax_dev);
>>         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>> +       if (static_dev_dax(dev_dax))
>> +               dev_dax->pgmap = NULL;
>>  }
> 
> Here you probably meant !static_dev_dax() per my patch.
> 
>>  EXPORT_SYMBOL_GPL(kill_dev_dax);
>>
>> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
>> index 1e946ad7780a..4acdfee7dd59 100644
>> --- a/drivers/dax/bus.h
>> +++ b/drivers/dax/bus.h
>> @@ -48,6 +48,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>>         __dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
>>  void dax_driver_unregister(struct dax_device_driver *dax_drv);
>>  void kill_dev_dax(struct dev_dax *dev_dax);
>> +bool static_dev_dax(struct dev_dax *dev_dax);
>>
>>  #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
>>  int dev_dax_probe(struct dev_dax *dev_dax);
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index dd8222a42808..87507aff2b10 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -398,31 +398,43 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>         void *addr;
>>         int rc, i;
>>
>> -       pgmap = dev_dax->pgmap;
>> -       if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
>> -                       "static pgmap / multi-range device conflict\n"))
>> +       if (static_dev_dax(dev_dax) && dev_dax->nr_range > 1) {
>> +               dev_warn(dev, "static pgmap / multi-range device conflict\n");
>>                 return -EINVAL;
>> +       }
>>
>> -       if (!pgmap) {
>> -               pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
>> -                               * (dev_dax->nr_range - 1), GFP_KERNEL);
>> +       if (static_dev_dax(dev_dax)) {
>> +               pgmap = dev_dax->pgmap;
>> +       } else {
>> +               if (dev_dax->pgmap) {
>> +                       dev_warn(dev,
>> +                                "dynamic-dax with pre-populated page map!?\n");
>> +                       return -EINVAL;
>> +               }
>> +               pgmap = devm_kzalloc(
>> +                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
>> +                       GFP_KERNEL);
>>                 if (!pgmap)
>>                         return -ENOMEM;
>>                 pgmap->nr_range = dev_dax->nr_range;
>> +               dev_dax->pgmap = pgmap;
>> +               for (i = 0; i < dev_dax->nr_range; i++) {
>> +                       struct range *range = &dev_dax->ranges[i].range;
>> +
>> +                       pgmap->ranges[i] = *range;
>> +               }
>>         }
>>
> This code move is probably not needed unless your point is to have a more clear
> separation on what's initialization versus the mem region request (that's
> applicable to both dynamic and static).
> 
>>         for (i = 0; i < dev_dax->nr_range; i++) {
>>                 struct range *range = &dev_dax->ranges[i].range;
>>
>> -               if (!devm_request_mem_region(dev, range->start,
>> -                                       range_len(range), dev_name(dev))) {
>> -                       dev_warn(dev, "mapping%d: %#llx-%#llx could
>> not reserve range\n",
>> -                                       i, range->start, range->end);
>> -                       return -EBUSY;
>> -               }
>> -               /* don't update the range for static pgmap */
>> -               if (!dev_dax->pgmap)
>> -                       pgmap->ranges[i] = *range;
>> +               if (devm_request_mem_region(dev, range->start, range_len(range),
>> +                                           dev_name(dev)))
>> +                       continue;
>> +               dev_warn(dev,
>> +                        "mapping%d: %#llx-%#llx could not reserve range\n", i,
>> +                        range->start, range->end);
>> +               return -EBUSY;
>>         }
>>
>>         pgmap->type = MEMORY_DEVICE_GENERIC;
>> @@ -473,3 +485,4 @@ MODULE_LICENSE("GPL v2");
>>  module_init(dax_init);
>>  module_exit(dax_exit);
>>  MODULE_ALIAS_DAX_DEVICE(0);
>> +MODULE_IMPORT_NS(DAX);
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 07/14] device-dax: compound devmap support
  2021-11-05 14:10     ` Joao Martins
@ 2021-11-05 16:41       ` Dan Williams
  0 siblings, 0 replies; 48+ messages in thread
From: Dan Williams @ 2021-11-05 16:41 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On Fri, Nov 5, 2021 at 7:10 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 11/5/21 00:38, Dan Williams wrote:
> > On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Use the newly added compound devmap facility which maps the assigned dax
> >> ranges as compound pages at a page size of @align. Currently, this means,
> >> that region/namespace bootstrap would take considerably less, given that
> >> you would initialize considerably less pages.
> >>
> >> On setups with 128G NVDIMMs the initialization with DRAM stored struct
> >> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
> >> than a 1msec with 1G pages.
> >>
> >> dax devices are created with a fixed @align (huge page size) which is
> >> enforced through as well at mmap() of the device. Faults, consequently
> >> happen too at the specified @align specified at the creation, and those
> >> don't change through out dax device lifetime.
> >
> > s/through out/throughout/
> >
> >> MCEs poisons a whole dax huge page, as well as splits occurring at the configured page size.
> >
> > A clarification here, MCEs trigger memory_failure() to *unmap* a whole
> > dax huge page, the poison stays limited to a single cacheline.
> >
> Ah, yes. I'll fix it for v5.
>
> > Otherwise the patch looks good to me.
> >
> Thanks!
>
> Btw, does 'looks good' == Reviewed-by (with the commit message clarification above) or is
> it that 'should be good with the ammend above and you get the tag in the next round' ?

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

> Asking as IIRC you mentioned this too some other time(s) (in the simpler sparse-vmemmap
> patches) hence just clarifying to understand your expected 'process' better.
>
> Also, I will be splitting this series as mentioned in the other discussion ...
>
> https://lore.kernel.org/linux-mm/20211019160136.GH3686969@ziepe.ca/
>
> ... So this patch and the previous one should be the last two patches of the series
> and the rest (gup, sparse-vmemmmap) will go in parallel.

Sounds good.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-11-05 12:09     ` Joao Martins
  2021-11-05 16:14       ` Joao Martins
@ 2021-11-05 16:46       ` Dan Williams
  2021-11-05 18:11         ` Joao Martins
  1 sibling, 1 reply; 48+ messages in thread
From: Dan Williams @ 2021-11-05 16:46 UTC (permalink / raw)
  To: Joao Martins
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On Fri, Nov 5, 2021 at 5:10 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 11/5/21 00:31, Dan Williams wrote:
> > On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Right now, only static dax regions have a valid @pgmap pointer in its
> >> struct dev_dax. Dynamic dax case however, do not.
> >>
> >> In preparation for device-dax compound devmap support, make sure that
> >> dev_dax pgmap field is set after it has been allocated and initialized.
> >>
> >> dynamic dax device have the @pgmap is allocated at probe() and it's
> >> managed by devm (contrast to static dax region which a pgmap is provided
> >> and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
> >> the pgmap when the dynamic dax device is released to avoid the same
> >> pgmap ranges to be re-requested across multiple region device reconfigs.
> >>
> >> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  drivers/dax/bus.c    | 8 ++++++++
> >>  drivers/dax/device.c | 2 ++
> >>  2 files changed, 10 insertions(+)
> >>
> >> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> >> index 6cc4da4c713d..49dbff9ba609 100644
> >> --- a/drivers/dax/bus.c
> >> +++ b/drivers/dax/bus.c
> >> @@ -363,6 +363,14 @@ void kill_dev_dax(struct dev_dax *dev_dax)
> >>
> >>         kill_dax(dax_dev);
> >>         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
> >> +
> >> +       /*
> >> +        * Dynamic dax region have the pgmap allocated via dev_kzalloc()
> >> +        * and thus freed by devm. Clear the pgmap to not have stale pgmap
> >> +        * ranges on probe() from previous reconfigurations of region devices.
> >> +        */
> >> +       if (!is_static(dev_dax->region))
> >> +               dev_dax->pgmap = NULL;
> >>  }
> >>  EXPORT_SYMBOL_GPL(kill_dev_dax);
> >>
> >> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> >> index 0b82159b3564..6e348b5f9d45 100644
> >> --- a/drivers/dax/device.c
> >> +++ b/drivers/dax/device.c
> >> @@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
> >>         }
> >>
> >>         pgmap->type = MEMORY_DEVICE_GENERIC;
> >> +       dev_dax->pgmap = pgmap;
> >
> > So I think I'd rather see a bigger patch that replaces some of the
> > implicit dev_dax->pgmap == NULL checks with explicit is_static()
> > checks. Something like the following only compile and boot tested...
> > Note the struct_size() change probably wants to be its own cleanup,
> > and the EXPORT_SYMBOL_NS_GPL(..., DAX) probably wants to be its own
> > patch converting over the entirety of drivers/dax/. Thoughts?
> >
> It's a good idea. Certainly the implicit pgmap == NULL made it harder
> than the necessary to find where the problem was. So turning those checks
> into explicit checks that differentiate static vs dynamic dax will help
>
> With respect to this series converting those pgmap == NULL is going to need
> to made me export the symbol (provided dax core and dax device can be built
> as modules). So I don't know how this can be a patch converting entirety of
> dax. Perhaps you mean that I would just EXPORT_SYMBOL() and then a bigger
> patch introduces the MODULE_NS_IMPORT() And EXPORT_SYMBOL_NS*() separately.

Yeah, either a lead-in patch to do the conversion, or a follow on to
convert everything after the fact. Either way works for me, but I have
a small preference for the lead-in patch.

>
> The struct_size, yeah, should be a separate patch much like commit 7d18dd75a8af
> ("device-dax/kmem: use struct_size()").

Yeah.

>
> minor comment below on your snippet.
>
> >
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 6cc4da4c713d..67ab7e05b340 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -134,6 +134,12 @@ static bool is_static(struct dax_region *dax_region)
> >         return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
> >  }
> >
> > +bool static_dev_dax(struct dev_dax *dev_dax)
> > +{
> > +       return is_static(dev_dax->region);
> > +}
> > +EXPORT_SYMBOL_NS_GPL(static_dev_dax, DAX);
> > +
> >  static u64 dev_dax_size(struct dev_dax *dev_dax)
> >  {
> >         u64 size = 0;
> > @@ -363,6 +369,8 @@ void kill_dev_dax(struct dev_dax *dev_dax)
> >
> >         kill_dax(dax_dev);
> >         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
> > +       if (static_dev_dax(dev_dax))
> > +               dev_dax->pgmap = NULL;
> >  }
>
> Here you probably meant !static_dev_dax() per my patch.

Oops, yes.

>
> >  EXPORT_SYMBOL_GPL(kill_dev_dax);
> >
> > diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
> > index 1e946ad7780a..4acdfee7dd59 100644
> > --- a/drivers/dax/bus.h
> > +++ b/drivers/dax/bus.h
> > @@ -48,6 +48,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
> >         __dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
> >  void dax_driver_unregister(struct dax_device_driver *dax_drv);
> >  void kill_dev_dax(struct dev_dax *dev_dax);
> > +bool static_dev_dax(struct dev_dax *dev_dax);
> >
> >  #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
> >  int dev_dax_probe(struct dev_dax *dev_dax);
> > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > index dd8222a42808..87507aff2b10 100644
> > --- a/drivers/dax/device.c
> > +++ b/drivers/dax/device.c
> > @@ -398,31 +398,43 @@ int dev_dax_probe(struct dev_dax *dev_dax)
> >         void *addr;
> >         int rc, i;
> >
> > -       pgmap = dev_dax->pgmap;
> > -       if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
> > -                       "static pgmap / multi-range device conflict\n"))
> > +       if (static_dev_dax(dev_dax) && dev_dax->nr_range > 1) {
> > +               dev_warn(dev, "static pgmap / multi-range device conflict\n");
> >                 return -EINVAL;
> > +       }
> >
> > -       if (!pgmap) {
> > -               pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
> > -                               * (dev_dax->nr_range - 1), GFP_KERNEL);
> > +       if (static_dev_dax(dev_dax)) {
> > +               pgmap = dev_dax->pgmap;
> > +       } else {
> > +               if (dev_dax->pgmap) {
> > +                       dev_warn(dev,
> > +                                "dynamic-dax with pre-populated page map!?\n");
> > +                       return -EINVAL;
> > +               }
> > +               pgmap = devm_kzalloc(
> > +                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> > +                       GFP_KERNEL);
> >                 if (!pgmap)
> >                         return -ENOMEM;
> >                 pgmap->nr_range = dev_dax->nr_range;
> > +               dev_dax->pgmap = pgmap;
> > +               for (i = 0; i < dev_dax->nr_range; i++) {
> > +                       struct range *range = &dev_dax->ranges[i].range;
> > +
> > +                       pgmap->ranges[i] = *range;
> > +               }
> >         }
> >
> This code move is probably not needed unless your point is to have a more clear
> separation on what's initialization versus the mem region request (that's
> applicable to both dynamic and static).

It was more of an RFC cleanup idea and yes, should be its own patch if
you think it helps make the init path clearer.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-11-05 16:46       ` Dan Williams
@ 2021-11-05 18:11         ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2021-11-05 18:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Linux MM, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, Linux NVDIMM, Linux Doc Mailing List

On 11/5/21 16:46, Dan Williams wrote:
> On Fri, Nov 5, 2021 at 5:10 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> On 11/5/21 00:31, Dan Williams wrote:
>>> On Fri, Aug 27, 2021 at 7:59 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> Right now, only static dax regions have a valid @pgmap pointer in its
>>>> struct dev_dax. Dynamic dax case however, do not.
>>>>
>>>> In preparation for device-dax compound devmap support, make sure that
>>>> dev_dax pgmap field is set after it has been allocated and initialized.
>>>>
>>>> dynamic dax device have the @pgmap is allocated at probe() and it's
>>>> managed by devm (contrast to static dax region which a pgmap is provided
>>>> and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
>>>> the pgmap when the dynamic dax device is released to avoid the same
>>>> pgmap ranges to be re-requested across multiple region device reconfigs.
>>>>
>>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  drivers/dax/bus.c    | 8 ++++++++
>>>>  drivers/dax/device.c | 2 ++
>>>>  2 files changed, 10 insertions(+)
>>>>
>>>> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>>>> index 6cc4da4c713d..49dbff9ba609 100644
>>>> --- a/drivers/dax/bus.c
>>>> +++ b/drivers/dax/bus.c
>>>> @@ -363,6 +363,14 @@ void kill_dev_dax(struct dev_dax *dev_dax)
>>>>
>>>>         kill_dax(dax_dev);
>>>>         unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>>>> +
>>>> +       /*
>>>> +        * Dynamic dax region have the pgmap allocated via dev_kzalloc()
>>>> +        * and thus freed by devm. Clear the pgmap to not have stale pgmap
>>>> +        * ranges on probe() from previous reconfigurations of region devices.
>>>> +        */
>>>> +       if (!is_static(dev_dax->region))
>>>> +               dev_dax->pgmap = NULL;
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(kill_dev_dax);
>>>>
>>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>>> index 0b82159b3564..6e348b5f9d45 100644
>>>> --- a/drivers/dax/device.c
>>>> +++ b/drivers/dax/device.c
>>>> @@ -426,6 +426,8 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>>>         }
>>>>
>>>>         pgmap->type = MEMORY_DEVICE_GENERIC;
>>>> +       dev_dax->pgmap = pgmap;
>>>
>>> So I think I'd rather see a bigger patch that replaces some of the
>>> implicit dev_dax->pgmap == NULL checks with explicit is_static()
>>> checks. Something like the following only compile and boot tested...
>>> Note the struct_size() change probably wants to be its own cleanup,
>>> and the EXPORT_SYMBOL_NS_GPL(..., DAX) probably wants to be its own
>>> patch converting over the entirety of drivers/dax/. Thoughts?
>>>
>> It's a good idea. Certainly the implicit pgmap == NULL made it harder
>> than the necessary to find where the problem was. So turning those checks
>> into explicit checks that differentiate static vs dynamic dax will help
>>
>> With respect to this series converting those pgmap == NULL is going to need
>> to made me export the symbol (provided dax core and dax device can be built
>> as modules). So I don't know how this can be a patch converting entirety of
>> dax. Perhaps you mean that I would just EXPORT_SYMBOL() and then a bigger
>> patch introduces the MODULE_NS_IMPORT() And EXPORT_SYMBOL_NS*() separately.
> 
> Yeah, either a lead-in patch to do the conversion, or a follow on to
> convert everything after the fact. Either way works for me, but I have
> a small preference for the lead-in patch.
> 

The one reason I am leaning towards the after the fact conversion, is that the
addition of a new exported symbol looks unrelated to a drivers/dax wide conversion
to namespaced symbols. Looks like a separate cleanup on top of this series, as
opposed to a dependency cleanup.

>>>  EXPORT_SYMBOL_GPL(kill_dev_dax);
>>>
>>> diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
>>> index 1e946ad7780a..4acdfee7dd59 100644
>>> --- a/drivers/dax/bus.h
>>> +++ b/drivers/dax/bus.h
>>> @@ -48,6 +48,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
>>>         __dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
>>>  void dax_driver_unregister(struct dax_device_driver *dax_drv);
>>>  void kill_dev_dax(struct dev_dax *dev_dax);
>>> +bool static_dev_dax(struct dev_dax *dev_dax);
>>>
>>>  #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
>>>  int dev_dax_probe(struct dev_dax *dev_dax);
>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>> index dd8222a42808..87507aff2b10 100644
>>> --- a/drivers/dax/device.c
>>> +++ b/drivers/dax/device.c
>>> @@ -398,31 +398,43 @@ int dev_dax_probe(struct dev_dax *dev_dax)
>>>         void *addr;
>>>         int rc, i;
>>>
>>> -       pgmap = dev_dax->pgmap;
>>> -       if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
>>> -                       "static pgmap / multi-range device conflict\n"))
>>> +       if (static_dev_dax(dev_dax) && dev_dax->nr_range > 1) {
>>> +               dev_warn(dev, "static pgmap / multi-range device conflict\n");
>>>                 return -EINVAL;
>>> +       }
>>>
>>> -       if (!pgmap) {
>>> -               pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
>>> -                               * (dev_dax->nr_range - 1), GFP_KERNEL);
>>> +       if (static_dev_dax(dev_dax)) {
>>> +               pgmap = dev_dax->pgmap;
>>> +       } else {
>>> +               if (dev_dax->pgmap) {
>>> +                       dev_warn(dev,
>>> +                                "dynamic-dax with pre-populated page map!?\n");
>>> +                       return -EINVAL;
>>> +               }
>>> +               pgmap = devm_kzalloc(
>>> +                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
>>> +                       GFP_KERNEL);
>>>                 if (!pgmap)
>>>                         return -ENOMEM;
>>>                 pgmap->nr_range = dev_dax->nr_range;
>>> +               dev_dax->pgmap = pgmap;
>>> +               for (i = 0; i < dev_dax->nr_range; i++) {
>>> +                       struct range *range = &dev_dax->ranges[i].range;
>>> +
>>> +                       pgmap->ranges[i] = *range;
>>> +               }
>>>         }
>>>
>> This code move is probably not needed unless your point is to have a more clear
>> separation on what's initialization versus the mem region request (that's
>> applicable to both dynamic and static).
> 
> It was more of an RFC cleanup idea and yes, should be its own patch if
> you think it helps make the init path clearer.
> 
OK.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-11-05 18:12 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-27 14:58 [PATCH v4 00/14] mm, sparse-vmemmap: Introduce compound devmaps for device-dax Joao Martins
2021-08-27 14:58 ` [PATCH v4 01/14] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-08-27 14:58 ` [PATCH v4 02/14] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-08-27 14:58 ` [PATCH v4 03/14] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-08-27 14:58 ` [PATCH v4 04/14] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-08-27 15:33   ` Christoph Hellwig
2021-08-27 16:00     ` Joao Martins
2021-09-01  9:44       ` Christoph Hellwig
2021-09-09  9:38         ` Joao Martins
2021-08-27 14:58 ` [PATCH v4 05/14] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-08-27 14:58 ` [PATCH v4 06/14] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-11-05  0:31   ` Dan Williams
2021-11-05 12:09     ` Joao Martins
2021-11-05 16:14       ` Joao Martins
2021-11-05 16:46       ` Dan Williams
2021-11-05 18:11         ` Joao Martins
2021-08-27 14:58 ` [PATCH v4 07/14] device-dax: compound devmap support Joao Martins
2021-11-05  0:38   ` Dan Williams
2021-11-05 14:10     ` Joao Martins
2021-11-05 16:41       ` Dan Williams
2021-08-27 14:58 ` [PATCH v4 08/14] mm/gup: grab head page refcount once for group of subpages Joao Martins
2021-08-27 16:25   ` Jason Gunthorpe
2021-08-27 18:34     ` Joao Martins
2021-08-30 13:07       ` Jason Gunthorpe
2021-08-31 12:34         ` Joao Martins
2021-08-31 17:05           ` Jason Gunthorpe
2021-09-23 16:51             ` Joao Martins
2021-09-28 18:01               ` Jason Gunthorpe
2021-09-29 11:50                 ` Joao Martins
2021-09-29 19:34                   ` Jason Gunthorpe
2021-09-30  3:01                     ` Alistair Popple
2021-09-30 17:54                       ` Joao Martins
2021-09-30 21:55                         ` Jason Gunthorpe
2021-10-18 18:36                       ` Jason Gunthorpe
2021-10-18 18:37                   ` Jason Gunthorpe
2021-10-08 11:54   ` Jason Gunthorpe
2021-10-11 15:53     ` Joao Martins
2021-10-13 17:41       ` Jason Gunthorpe
2021-10-13 19:18         ` Joao Martins
2021-10-13 19:43           ` Jason Gunthorpe
2021-10-14 17:56             ` Joao Martins
2021-10-14 18:06               ` Jason Gunthorpe
2021-08-27 14:58 ` [PATCH v4 09/14] mm/sparse-vmemmap: add a pgmap argument to section activation Joao Martins
2021-08-27 14:58 ` [PATCH v4 10/14] mm/sparse-vmemmap: refactor core of vmemmap_populate_basepages() to helper Joao Martins
2021-08-27 14:58 ` [PATCH v4 11/14] mm/hugetlb_vmemmap: move comment block to Documentation/vm Joao Martins
2021-08-27 14:58 ` [PATCH v4 12/14] mm/sparse-vmemmap: populate compound devmaps Joao Martins
2021-08-27 14:58 ` [PATCH v4 13/14] mm/page_alloc: reuse tail struct pages for " Joao Martins
2021-08-27 14:58 ` [PATCH v4 14/14] mm/sparse-vmemmap: improve memory savings for compound pud geometry Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.