nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap
@ 2021-11-12 15:08 Joao Martins
  2021-11-12 15:08 ` [PATCH v5 1/8] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Changes since v4[4]:

* Remove patches 8-14 as they will go in 2 separate (parallel) series;
* Rename @geometry to @vmemmap_shift (Christoph Hellwig)
* Make @vmemmap_shift an order rather than nr of pages (Christoph Hellwig)
* Consequently remove helper pgmap_geometry_order() as it's no longer
needed, in place of accessing directly the structure member [Patch 4 and 8]
* Rename pgmap_geometry() to pgmap_vmemmap_nr() in patches 4 and 8;
* Remove usage of pgmap_geometry() in favour for testing
  @vmemmap_shift for non-zero directly directly in patch 8;
* Patch 5 is new for using `struct_size()` (Dan Williams)
* Add a 'static_dev_dax()' helper for testing pgmap == NULL handling
for dynamic dax devices.
* Expand patch 6 to be explicitly on those !pgmap cases, and replace
those with static_dev_dax().
* Add performance numbers on patch 8 on gup/pin_user_pages() numbers with
this series.
* Massage commit description to remove mentions of @geometry.
* Add Dan's Reviewed-by on patch 8 (Dan Williams)

---

This series converts device-dax to use compound pages, and moves away from the
'struct page per basepage on PMD/PUD' that is done today. Doing so, unlocks a
few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case
4x times faster in pinning (numbers below and in last patch).

I've split the compound pages on devmap part from the rest based on recent
discussions on devmap pending and future work planned[5][6]. There is consensus
that device-dax should be using compound pages to represent its PMD/PUDs just
like HugeTLB and THP, that could lead to less specialization of the dax parts.
I will pursue the rest of the work in parallel once this part is merged,
particular the GUP-{slow,fast} improvements [7] and the tail struct page
deduplication memory savings part[8].

To summarize what the series does:

Patch 1: Prepare hwpoisoning to work with dax compound pages.

Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). This is used
when initializing zone device when we bring up device-dax namespaces.

Patches 4-7: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and it
introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done today).
The property describe the page order of the metadata essentially.
Finally enable device-dax usage of devmap @vmemmap_shift to a value
based on its own @align property. @vmemmap_shift returns 0 by default (which
is today's case of base pages in devmap, like fsdax or the others) and the
usage of compound devmap is optional. Starting with device-dax (*not* fsdax) we
enable it by default. There are a few pinning improvements particular on the
unpinning case and altmap, as well as unpin_user_page_range_dirty_lock() being
just as effective as THP/hugetlb[0] pages.

    $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
    [altmap]
    (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
    
     $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
    (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
    [altmap with -m 127004]
    (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms

Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and
without altmap), alongside gup_test selftests with dynamic dax regions and
static dax regions. Coupled with ndctl unit tests for dynamic dax devices
that exercise all of this. Note, for dynamic dax regions I had to revert
commit 8aa83e6395 ("x86/setup: Call early_reserve_memory() earlier"), it
is a known issue that this commit broke efi_fake_mem=.

Patches apply on top of linux-next tag next-20211112 (commit f2e19fd15bd7).

Thanks for all the review.

Comments and suggestions very much appreciated!

Older Changelog,

v3[3] -> v4[4]:

 * Collect Dan's Reviewed-by on patches 1-5,8,9,11
 * Collect Muchun Reviewed-by on patch 1,2,11
 * Reorder patches to first introduce compound pages in ZONE_DEVICE with
 device-dax (for pmem) as first user (patches 1-8) followed by implementing
 the sparse-vmemmap changes for minimize struct page overhead for devmap (patches 9-14)
 * Eliminate remnant @align references to use @geometry (Dan)
 * Convert mentions of 'compound pagemap' to 'compound devmap' throughout
   the series to avoid confusions of this work conflicting/referring to
   anything Folio or pagemap related.
 * Delete pgmap_pfn_geometry() on patch 4
   and rework other patches to use pgmap_geometry() instead (Dan)
 * Convert @geometry to be a number of pages rather than page size in patch 4 (Dan)
 * Make pgmap_geometry() more readable (Christoph)
 * Simplify pgmap refcount pfn computation in memremap_pages() (Christoph)
 * Rework memmap_init_compound() in patch 4 to use the same style as
 memmap_init_zone_device i.e. iterating over PFNs, rather than struct pages (Dan)
 * Add comment on devmap prep_compound_head callsite explaining why it needs
 to be used after first+second tail pages have been initialized (Dan, Jane)
 * Initialize tail page refcount to zero in patch 4
 * Make sure pfn_next() iterate over compound pages (rather than base page) in
 patch 4 to tackle the zone_device elevated page refcount.
 [ Note these last two bullet points above are unneeded once this patch is merged:
   https://lore.kernel.org/linux-mm/20210825034828.12927-3-alex.sierra@amd.com/ ]
 * Remove usage of ternary operator when computing @end in gup_device_huge() in patch 8 (Dan)
 * Remove pinned_head variable in patch 8
 * Remove put_dev_pagemap() need for compound case as that is now fixed for the general case
 in patch 8
 * Switch to PageHead() instead of PageCompound() as we only work with either base pages
 or head pages in patch 8 (Matthew)
 * Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan)
 * Fix up missing return in vmemmap_populate_address() in patch 10
 * Change error handling style in all patches (Dan)
 * Change title of vmemmap_dedup.rst to be more representative of the purpose in patch 12 (Dan)
 * Move some of the section and subsection tail page reuse code into helpers
 reuse_compound_section() and compound_section_tail_page() for readability in patch 12 (Dan)
 * Commit description fixes for clearity in various patches (Dan)
 * Add pgmap_geometry_order() helper and
   drop unneeded geometry_size, order variables in patch 12
 * Drop unneeded byte based computation to be PFN in patch 12
 * Handle the dynamic dax region properly when ensuring a stable dev_dax->pgmap in patch 6.
 * Add a compound_nr_pages() helper and use it in memmap_init_zone_device to calculate
 the number of unique struct pages to initialize depending on @altmap existence in patch 13 (Dan)
 * Add compound_section_tail_huge_page() for the tail page PMD reuse in patch 14 (Dan)
 * Reword cover letter.

v2 -> v3[3]:
 * Collect Mike's Ack on patch 2 (Mike)
 * Collect Naoya's Reviewed-by on patch 1 (Naoya)
 * Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun)
 * Rebased to next-20210714

v1[1] -> v2[2]:

 (New patches 7, 10, 11)
 * Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
 * Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
 * Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
 that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan]
 * Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
 * Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
 tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
 * Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan]
 * Simplify patch 9 as a result of having compound initialization differently [Dan]
 * Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan]
 * Rename Subject of patch 6 [Dan]
 * Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
 * Add some type-safety to @block and use 'struct page *' rather than
 void, Patch 8 [Dan]
 * Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
 * Remove vmemmap lookup function in place of
 pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
 serialization, Patch 8
 * Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
 * Add docs about device-dax usage of tail dedup technique in newly added
 compound_pagemaps.rst doc entry.
 * Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan]
 * Add cleanup patch for device-dax for using ALIGN() [Dan]
 * Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan]
 * Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
 * Move PUD compound page to be last patch [Dan]
 * Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan]
 * Rebased to next-20210617 

 RFC[0] -> v1:
 (New patches 1-3, 5-8 but the diffstat isn't that different)
 * Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
 * Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
 * Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
 * Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
 * Remove the gup_device_compound_huge special path and have the same code
   work both ways while special casing when devmap page is compound (Jason, John)
 * Avoid usage of vmemmap_populate_basepages() and introduce a first class
   loop that doesn't care about passing an altmap for memmap reuse. (Dan)
 * Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
   hack into passing block across sparse_add_section calls. It's a lot easier to
   follow and more explicit in what it does.
 * Replace the vmemmap refactoring with adding a @pgmap argument and moving
   parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
 * Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
 * Improve memmap_init_zone_device() to initialize compound pages when
   struct pages are cache warm. That lead to a even further speed up further
   from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
   as a result (Dan)
 * Remove PGMAP_COMPOUND and use @align as the property to detect whether
   or not to reuse vmemmap areas (Dan)

[0] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210714193542.21857-1-joao.m.martins@oracle.com/
[4] https://lore.kernel.org/linux-mm/20210827145819.16471-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/linux-mm/20211018182559.GC3686969@ziepe.ca/
[6] https://lore.kernel.org/linux-mm/499043a0-b3d8-7a42-4aee-84b81f5b633f@oracle.com/
[7] https://lore.kernel.org/linux-mm/20210827145819.16471-9-joao.m.martins@oracle.com/
[8] https://lore.kernel.org/linux-mm/20210827145819.16471-13-joao.m.martins@oracle.com/

Joao Martins (8):
  memory-failure: fetch compound_head after pgmap_pfn_valid()
  mm/page_alloc: split prep_compound_page into head and tail subparts
  mm/page_alloc: refactor memmap_init_zone_device() page init
  mm/memremap: add ZONE_DEVICE support for compound pages
  device-dax: use ALIGN() for determining pgoff
  device-dax: use struct_size()
  device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  device-dax: compound devmap support

 drivers/dax/bus.c        |  14 ++++
 drivers/dax/bus.h        |   1 +
 drivers/dax/device.c     |  88 ++++++++++++++++++-------
 include/linux/memremap.h |  11 ++++
 mm/memory-failure.c      |   6 ++
 mm/memremap.c            |  12 ++--
 mm/page_alloc.c          | 138 +++++++++++++++++++++++++++------------
 7 files changed, 200 insertions(+), 70 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v5 1/8] memory-failure: fetch compound_head after pgmap_pfn_valid()
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-12 15:08 ` [PATCH v5 2/8] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

memory_failure_dev_pagemap() at the moment assumes base pages (e.g.
dax_lock_page()).  For devmap with compound pages fetch the
compound_head in case a tail page memory failure is being handled.

Currently this is a nop, but in the advent of compound pages in
dev_pagemap it allows memory_failure_dev_pagemap() to keep working.

Reported-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memory-failure.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f64ebb6226cb..0c4b08da1a02 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1565,6 +1565,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		goto out;
 	}
 
+	/*
+	 * Pages instantiated by device-dax (not filesystem-dax)
+	 * may be compound pages.
+	 */
+	page = compound_head(page);
+
 	/*
 	 * Prevent the inode from being freed while we are interrogating
 	 * the address_space, typically this would be handled by
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 2/8] mm/page_alloc: split prep_compound_page into head and tail subparts
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
  2021-11-12 15:08 ` [PATCH v5 1/8] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-12 15:08 ` [PATCH v5 3/8] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Split the utility function prep_compound_page() into head and tail
counterparts, and use them accordingly.

This is in preparation for sharing the storage for compound page
metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/page_alloc.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5952749ad40..20b9db0cf97c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -726,23 +726,33 @@ void free_compound_page(struct page *page)
 	free_the_page(page, compound_order(page));
 }
 
+static void prep_compound_head(struct page *page, unsigned int order)
+{
+	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
+	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+}
+
+static void prep_compound_tail(struct page *head, int tail_idx)
+{
+	struct page *p = head + tail_idx;
+
+	p->mapping = TAIL_MAPPING;
+	set_compound_head(p, head);
+}
+
 void prep_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 
 	__SetPageHead(page);
-	for (i = 1; i < nr_pages; i++) {
-		struct page *p = page + i;
-		p->mapping = TAIL_MAPPING;
-		set_compound_head(p, page);
-	}
+	for (i = 1; i < nr_pages; i++)
+		prep_compound_tail(page, i);
 
-	set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
-	set_compound_order(page, order);
-	atomic_set(compound_mapcount_ptr(page), -1);
-	if (hpage_pincount_available(page))
-		atomic_set(compound_pincount_ptr(page), 0);
+	prep_compound_head(page, order);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 3/8] mm/page_alloc: refactor memmap_init_zone_device() page init
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
  2021-11-12 15:08 ` [PATCH v5 1/8] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
  2021-11-12 15:08 ` [PATCH v5 2/8] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-12 15:08 ` [PATCH v5 4/8] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Move struct page init to an helper function __init_zone_device_page().

This is in preparation for sharing the storage for compound page
metadata.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/page_alloc.c | 74 +++++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20b9db0cf97c..23045a2a1339 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6572,6 +6572,46 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 }
 
 #ifdef CONFIG_ZONE_DEVICE
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+					  unsigned long zone_idx, int nid,
+					  struct dev_pagemap *pgmap)
+{
+
+	__init_single_page(page, pfn, zone_idx, nid);
+
+	/*
+	 * Mark page reserved as it will need to wait for onlining
+	 * phase for it to be fully associated with a zone.
+	 *
+	 * We can use the non-atomic __set_bit operation for setting
+	 * the flag as we are still initializing the pages.
+	 */
+	__SetPageReserved(page);
+
+	/*
+	 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+	 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
+	 * ever freed or placed on a driver-private list.
+	 */
+	page->pgmap = pgmap;
+	page->zone_device_data = NULL;
+
+	/*
+	 * Mark the block movable so that blocks are reserved for
+	 * movable at startup. This will force kernel allocations
+	 * to reserve their blocks rather than leaking throughout
+	 * the address space during boot when many long-lived
+	 * kernel allocations are made.
+	 *
+	 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
+	 * because this is done early in section_activate()
+	 */
+	if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		cond_resched();
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6600,39 +6640,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone_idx, nid);
-
-		/*
-		 * Mark page reserved as it will need to wait for onlining
-		 * phase for it to be fully associated with a zone.
-		 *
-		 * We can use the non-atomic __set_bit operation for setting
-		 * the flag as we are still initializing the pages.
-		 */
-		__SetPageReserved(page);
-
-		/*
-		 * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
-		 * and zone_device_data.  It is a bug if a ZONE_DEVICE page is
-		 * ever freed or placed on a driver-private list.
-		 */
-		page->pgmap = pgmap;
-		page->zone_device_data = NULL;
-
-		/*
-		 * Mark the block movable so that blocks are reserved for
-		 * movable at startup. This will force kernel allocations
-		 * to reserve their blocks rather than leaking throughout
-		 * the address space during boot when many long-lived
-		 * kernel allocations are made.
-		 *
-		 * Please note that MEMINIT_HOTPLUG path doesn't clear memmap
-		 * because this is done early in section_activate()
-		 */
-		if (IS_ALIGNED(pfn, pageblock_nr_pages)) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			cond_resched();
-		}
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 4/8] mm/memremap: add ZONE_DEVICE support for compound pages
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
                   ` (2 preceding siblings ...)
  2021-11-12 15:08 ` [PATCH v5 3/8] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-12 15:08 ` [PATCH v5 5/8] device-dax: use ALIGN() for determining pgoff Joao Martins
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Add a new @vmemmap_shift property for struct dev_pagemap which specifies that a
devmap is composed of a set of compound pages of order @vmemmap_shift, instead of
base pages. When a compound page devmap is requested, all but the first
page are initialised as tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page size,
this creates an opportunity to optimize GUP and GUP-fast walkers, treating
it the same way as THP or hugetlb pages.

Additionally, commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page") removed set_page_count() because the
setting of page ref count to zero was redundant. devmap pages don't come
from page allocator though and only head page refcount is used for
compound pages, hence initialize tail page count to zero.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/memremap.h | 11 +++++++++++
 mm/memremap.c            | 12 ++++++------
 mm/page_alloc.c          | 38 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 119f130ef8f1..aaf85bda093b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -99,6 +99,11 @@ struct dev_pagemap_ops {
  * @done: completion for @internal_ref
  * @type: memory type: see MEMORY_* in memory_hotplug.h
  * @flags: PGMAP_* flags to specify defailed behavior
+ * @vmemmap_shift: structural definition of how the vmemmap page metadata
+ *      is populated, specifically the metadata page order.
+ *	A zero value (default) uses base pages as the vmemmap metadata
+ *	representation. A bigger value will set up compound struct pages
+ *	of the requested order value.
  * @ops: method table
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
@@ -114,6 +119,7 @@ struct dev_pagemap {
 	struct completion done;
 	enum memory_type type;
 	unsigned int flags;
+	unsigned long vmemmap_shift;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
 	int nr_range;
@@ -130,6 +136,11 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
 	return NULL;
 }
 
+static inline unsigned long pgmap_vmemmap_nr(struct dev_pagemap *pgmap)
+{
+	return 1 << pgmap->vmemmap_shift;
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 84de22c14567..3afa246eb1ab 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -102,11 +102,11 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
 	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
 
-static unsigned long pfn_next(unsigned long pfn)
+static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
 {
-	if (pfn % 1024 == 0)
+	if (pfn % (1024 << pgmap->vmemmap_shift))
 		cond_resched();
-	return pfn + 1;
+	return pfn + pgmap_vmemmap_nr(pgmap);
 }
 
 /*
@@ -130,7 +130,7 @@ bool pfn_zone_device_reserved(unsigned long pfn)
 }
 
 #define for_each_device_pfn(pfn, map, i) \
-	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
+	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
 
 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
@@ -315,8 +315,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-			- pfn_first(pgmap, range_id));
+	percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id)) >> pgmap->vmemmap_shift);
 	return 0;
 
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23045a2a1339..d59023a676ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6612,6 +6612,35 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 }
 
+static void __ref memmap_init_compound(struct page *head,
+				       unsigned long head_pfn,
+				       unsigned long zone_idx, int nid,
+				       struct dev_pagemap *pgmap,
+				       unsigned long nr_pages)
+{
+	unsigned long pfn, end_pfn = head_pfn + nr_pages;
+	unsigned int order = pgmap->vmemmap_shift;
+
+	__SetPageHead(head);
+	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+		prep_compound_tail(head, pfn - head_pfn);
+		set_page_count(page, 0);
+
+		/*
+		 * The first tail page stores compound_mapcount_ptr() and
+		 * compound_order() and the second tail page stores
+		 * compound_pincount_ptr(). Call prep_compound_head() after
+		 * the first and second tail pages have been initialized to
+		 * not have the data overwritten.
+		 */
+		if (pfn == head_pfn + 2)
+			prep_compound_head(head, order);
+	}
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long start_pfn,
 				   unsigned long nr_pages,
@@ -6620,6 +6649,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+	unsigned int pfns_per_compound = pgmap_vmemmap_nr(pgmap);
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
@@ -6637,10 +6667,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
 		struct page *page = pfn_to_page(pfn);
 
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+		if (pfns_per_compound == 1)
+			continue;
+
+		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+				     pfns_per_compound);
 	}
 
 	pr_info("%s initialised %lu pages in %ums\n", __func__,
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 5/8] device-dax: use ALIGN() for determining pgoff
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
                   ` (3 preceding siblings ...)
  2021-11-12 15:08 ` [PATCH v5 4/8] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-12 15:08 ` [PATCH v5 6/8] device-dax: use struct_size() Joao Martins
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Rather than calculating @pgoff manually, switch to ALIGN() instead.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index dd8222a42808..0b82159b3564 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -234,8 +234,8 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma, vmf->address
-				& ~(fault_size - 1));
+		pgoff = linear_page_index(vmf->vma,
+				ALIGN(vmf->address, fault_size));
 		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
 			struct page *page;
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 6/8] device-dax: use struct_size()
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
                   ` (4 preceding siblings ...)
  2021-11-12 15:08 ` [PATCH v5 5/8] device-dax: use ALIGN() for determining pgoff Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-17  9:37   ` Christoph Hellwig
  2021-11-12 15:08 ` [PATCH v5 7/8] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Use the struct_size() helper for the size of a struct with variable array
member at the end, rather than manually calculating it.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 0b82159b3564..d6796a3115a6 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -404,8 +404,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 		return -EINVAL;
 
 	if (!pgmap) {
-		pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
-				* (dev_dax->nr_range - 1), GFP_KERNEL);
+		pgmap = devm_kzalloc(
+                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
+                       GFP_KERNEL);
 		if (!pgmap)
 			return -ENOMEM;
 		pgmap->nr_range = dev_dax->nr_range;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 7/8] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
                   ` (5 preceding siblings ...)
  2021-11-12 15:08 ` [PATCH v5 6/8] device-dax: use struct_size() Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-17  9:37   ` Christoph Hellwig
  2021-11-12 15:08 ` [PATCH v5 8/8] device-dax: compound devmap support Joao Martins
  2021-11-12 15:40 ` [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Jason Gunthorpe
  8 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Right now, only static dax regions have a valid @pgmap pointer in its
struct dev_dax. Dynamic dax case however, do not.

In preparation for device-dax compound devmap support, make sure that
dev_dax pgmap field is set after it has been allocated and initialized.

dynamic dax device have the @pgmap is allocated at probe() and it's
managed by devm (contrast to static dax region which a pgmap is provided
and dax core kfrees it). So in addition to ensure a valid @pgmap, clear
the pgmap when the dynamic dax device is released to avoid the same
pgmap ranges to be re-requested across multiple region device reconfigs.

Add a static_dev_dax() and use that helper in dev_dax_probe() to ensure
the initialization differences between dynamic and static regions are
more explicit. While at it, consolidate the ranges initialization when we
allocate the @pgmap for the dynamic dax region case.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/bus.c    | 14 ++++++++++++++
 drivers/dax/bus.h    |  1 +
 drivers/dax/device.c | 26 +++++++++++++++++++-------
 3 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 6cc4da4c713d..19dd83d3f3ea 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -134,6 +134,12 @@ static bool is_static(struct dax_region *dax_region)
 	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }
 
+bool static_dev_dax(struct dev_dax *dev_dax)
+{
+	return is_static(dev_dax->region);
+}
+EXPORT_SYMBOL_GPL(static_dev_dax);
+
 static u64 dev_dax_size(struct dev_dax *dev_dax)
 {
 	u64 size = 0;
@@ -363,6 +369,14 @@ void kill_dev_dax(struct dev_dax *dev_dax)
 
 	kill_dax(dax_dev);
 	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+
+	/*
+	 * Dynamic dax region have the pgmap allocated via dev_kzalloc()
+	 * and thus freed by devm. Clear the pgmap to not have stale pgmap
+	 * ranges on probe() from previous reconfigurations of region devices.
+	 */
+	if (!static_dev_dax(dev_dax))
+		dev_dax->pgmap = NULL;
 }
 EXPORT_SYMBOL_GPL(kill_dev_dax);
 
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 1e946ad7780a..4acdfee7dd59 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -48,6 +48,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv,
 	__dax_driver_register(driver, THIS_MODULE, KBUILD_MODNAME)
 void dax_driver_unregister(struct dax_device_driver *dax_drv);
 void kill_dev_dax(struct dev_dax *dev_dax);
+bool static_dev_dax(struct dev_dax *dev_dax);
 
 #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
 int dev_dax_probe(struct dev_dax *dev_dax);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d6796a3115a6..a65c67ab5ee0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -398,18 +398,33 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	void *addr;
 	int rc, i;
 
-	pgmap = dev_dax->pgmap;
-	if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
-			"static pgmap / multi-range device conflict\n"))
+	if (static_dev_dax(dev_dax) && dev_dax->nr_range > 1) {
+		dev_warn(dev, "static pgmap / multi-range device conflict\n");
 		return -EINVAL;
+	}
+
+	if (static_dev_dax(dev_dax))  {
+		pgmap = dev_dax->pgmap;
+	} else {
+		if (dev_dax->pgmap) {
+			dev_warn(dev,
+				 "dynamic-dax with pre-populated page map\n");
+			return -EINVAL;
+		}
 
-	if (!pgmap) {
 		pgmap = devm_kzalloc(
                        dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
                        GFP_KERNEL);
 		if (!pgmap)
 			return -ENOMEM;
+
 		pgmap->nr_range = dev_dax->nr_range;
+		dev_dax->pgmap = pgmap;
+
+		for (i = 0; i < dev_dax->nr_range; i++) {
+			struct range *range = &dev_dax->ranges[i].range;
+			pgmap->ranges[i] = *range;
+		}
 	}
 
 	for (i = 0; i < dev_dax->nr_range; i++) {
@@ -421,9 +436,6 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 					i, range->start, range->end);
 			return -EBUSY;
 		}
-		/* don't update the range for static pgmap */
-		if (!dev_dax->pgmap)
-			pgmap->ranges[i] = *range;
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
                   ` (6 preceding siblings ...)
  2021-11-12 15:08 ` [PATCH v5 7/8] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
@ 2021-11-12 15:08 ` Joao Martins
  2021-11-12 15:34   ` Jason Gunthorpe
  2021-11-17  9:43   ` Christoph Hellwig
  2021-11-12 15:40 ` [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Jason Gunthorpe
  8 siblings, 2 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-12 15:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Dan Williams, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, Jason Gunthorpe, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc, Joao Martins

Use the newly added compound devmap facility which maps the assigned dax
ranges as compound pages at a page size of @align.

dax devices are created with a fixed @align (huge page size) which is
enforced through as well at mmap() of the device. Faults, consequently
happen too at the specified @align specified at the creation, and those
don't change throughout dax device lifetime. MCEs unmap a whole dax
huge page, as well as splits occurring at the configured page size.

Performance measured by gup_test improves considerably for
unpin_user_pages() and altmap with NVDIMMs:

$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
[altmap]
(pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms

 $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
[altmap with -m 127004]
(pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms

.. as well as unpin_user_page_range_dirty_lock() being just as effective
as THP/hugetlb[0] pages.

[0] https://lore.kernel.org/linux-mm/20210212130843.13865-5-joao.m.martins@oracle.com/

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c | 57 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index a65c67ab5ee0..0c2ac97d397d 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
+static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
+			     unsigned long fault_size,
+			     struct address_space *f_mapping)
+{
+	unsigned long i;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
+
+	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
+		struct page *page;
+
+		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+		if (page->mapping)
+			continue;
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
+				 unsigned long fault_size,
+				 struct address_space *f_mapping)
+{
+	struct page *head;
+
+	head = pfn_to_page(pfn_t_to_pfn(pfn));
+	head = compound_head(head);
+	if (head->mapping)
+		return;
+
+	head->mapping = f_mapping;
+	head->index = linear_page_index(vmf->vma,
+			ALIGN(vmf->address, fault_size));
+}
+
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size)
 {
@@ -225,8 +261,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	}
 
 	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
+		struct dev_pagemap *pgmap = dev_dax->pgmap;
 
 		/*
 		 * In the device-dax case the only possibility for a
@@ -234,17 +269,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		 * mapped. No need to consider the zero page, or racing
 		 * conflicting mappings.
 		 */
-		pgoff = linear_page_index(vmf->vma,
-				ALIGN(vmf->address, fault_size));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
+		if (pgmap->vmemmap_shift)
+			set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping);
+		else
+			set_page_mapping(vmf, pfn, fault_size, filp->f_mapping);
 	}
 	dax_read_unlock(id);
 
@@ -439,6 +467,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	if (dev_dax->align > PAGE_SIZE)
+		pgmap->vmemmap_shift =
+			order_base_2(dev_dax->align >> PAGE_SHIFT);
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-12 15:08 ` [PATCH v5 8/8] device-dax: compound devmap support Joao Martins
@ 2021-11-12 15:34   ` Jason Gunthorpe
  2021-11-15 12:11     ` Joao Martins
  2021-11-17  9:43   ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2021-11-12 15:34 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Fri, Nov 12, 2021 at 04:08:24PM +0100, Joao Martins wrote:

> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index a65c67ab5ee0..0c2ac97d397d 100644
> +++ b/drivers/dax/device.c
> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  }
>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>  
> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
> +			     unsigned long fault_size,
> +			     struct address_space *f_mapping)
> +{
> +	unsigned long i;
> +	pgoff_t pgoff;
> +
> +	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
> +
> +	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> +		struct page *page;
> +
> +		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> +		if (page->mapping)
> +			continue;
> +		page->mapping = f_mapping;
> +		page->index = pgoff + i;
> +	}
> +}
> +
> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
> +				 unsigned long fault_size,
> +				 struct address_space *f_mapping)
> +{
> +	struct page *head;
> +
> +	head = pfn_to_page(pfn_t_to_pfn(pfn));
> +	head = compound_head(head);
> +	if (head->mapping)
> +		return;
> +
> +	head->mapping = f_mapping;
> +	head->index = linear_page_index(vmf->vma,
> +			ALIGN(vmf->address, fault_size));
> +}

Should this stuff be setup before doing vmf_insert_pfn_XX?

In normal cases the page should be returned in the vmf and populated
to the page tables by the core code after all this is done. 

dax can't do that because of the refcount mess, but I would think the
basic ordering should be the same.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap
  2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
                   ` (7 preceding siblings ...)
  2021-11-12 15:08 ` [PATCH v5 8/8] device-dax: compound devmap support Joao Martins
@ 2021-11-12 15:40 ` Jason Gunthorpe
  2021-11-15 11:00   ` Joao Martins
  8 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2021-11-12 15:40 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Fri, Nov 12, 2021 at 04:08:16PM +0100, Joao Martins wrote:

> This series converts device-dax to use compound pages, and moves away from the
> 'struct page per basepage on PMD/PUD' that is done today. Doing so, unlocks a
> few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case
> 4x times faster in pinning (numbers below and in last patch).

I like it - aside from performance this series is important to clean
up the ZONE_DEVICE refcounting mess as it means that only fsdax will
be installing tail pages as PMD entries.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap
  2021-11-12 15:40 ` [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Jason Gunthorpe
@ 2021-11-15 11:00   ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-15 11:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 11/12/21 16:40, Jason Gunthorpe wrote:
> On Fri, Nov 12, 2021 at 04:08:16PM +0100, Joao Martins wrote:
> 
>> This series converts device-dax to use compound pages, and moves away from the
>> 'struct page per basepage on PMD/PUD' that is done today. Doing so, unlocks a
>> few noticeable improvements on unpin_user_pages() and makes device-dax+altmap case
>> 4x times faster in pinning (numbers below and in last patch).
> 
> I like it - aside from performance this series is important to clean
> up the ZONE_DEVICE refcounting mess as it means that only fsdax will
> be installing tail pages as PMD entries.
> 
Yes, indeed. I should have emphasized that more in the cover letter.

Will fix for v6 (if there's a new respin).

> Thanks,
> Jason
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-12 15:34   ` Jason Gunthorpe
@ 2021-11-15 12:11     ` Joao Martins
  2021-11-15 16:49       ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2021-11-15 12:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 11/12/21 16:34, Jason Gunthorpe wrote:
> On Fri, Nov 12, 2021 at 04:08:24PM +0100, Joao Martins wrote:
> 
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index a65c67ab5ee0..0c2ac97d397d 100644
>> +++ b/drivers/dax/device.c
>> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>>  }
>>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>  
>> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +			     unsigned long fault_size,
>> +			     struct address_space *f_mapping)
>> +{
>> +	unsigned long i;
>> +	pgoff_t pgoff;
>> +
>> +	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
>> +
>> +	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>> +		struct page *page;
>> +
>> +		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>> +		if (page->mapping)
>> +			continue;
>> +		page->mapping = f_mapping;
>> +		page->index = pgoff + i;
>> +	}
>> +}
>> +
>> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +				 unsigned long fault_size,
>> +				 struct address_space *f_mapping)
>> +{
>> +	struct page *head;
>> +
>> +	head = pfn_to_page(pfn_t_to_pfn(pfn));
>> +	head = compound_head(head);
>> +	if (head->mapping)
>> +		return;
>> +
>> +	head->mapping = f_mapping;
>> +	head->index = linear_page_index(vmf->vma,
>> +			ALIGN(vmf->address, fault_size));
>> +}
> 
> Should this stuff be setup before doing vmf_insert_pfn_XX?
> 

Interestingly filesystem-dax does this, but not device-dax.

set_page_mapping/set_compound_mapping() could be moved to before and then torn down
on @rc != VM_FAULT_NOPAGE (failure). I am not sure what's the benefit in this series..
besides the ordering (that you hinted below) ?

> In normal cases the page should be returned in the vmf and populated
> to the page tables by the core code after all this is done. 
> 

So I suppose by call sites examples as 'core code' is either hugetlbfs call to
__filemap_add_folio() (on hugetlbfs fault handler), shmem_add_to_page_cache() or
anon-equivalent.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-15 12:11     ` Joao Martins
@ 2021-11-15 16:49       ` Jason Gunthorpe
  2021-11-16 16:38         ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2021-11-15 16:49 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Mon, Nov 15, 2021 at 01:11:32PM +0100, Joao Martins wrote:
> On 11/12/21 16:34, Jason Gunthorpe wrote:
> > On Fri, Nov 12, 2021 at 04:08:24PM +0100, Joao Martins wrote:
> > 
> >> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> >> index a65c67ab5ee0..0c2ac97d397d 100644
> >> +++ b/drivers/dax/device.c
> >> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
> >>  }
> >>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> >>  
> >> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
> >> +			     unsigned long fault_size,
> >> +			     struct address_space *f_mapping)
> >> +{
> >> +	unsigned long i;
> >> +	pgoff_t pgoff;
> >> +
> >> +	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
> >> +
> >> +	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> >> +		struct page *page;
> >> +
> >> +		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> >> +		if (page->mapping)
> >> +			continue;
> >> +		page->mapping = f_mapping;
> >> +		page->index = pgoff + i;
> >> +	}
> >> +}
> >> +
> >> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
> >> +				 unsigned long fault_size,
> >> +				 struct address_space *f_mapping)
> >> +{
> >> +	struct page *head;
> >> +
> >> +	head = pfn_to_page(pfn_t_to_pfn(pfn));
> >> +	head = compound_head(head);
> >> +	if (head->mapping)
> >> +		return;
> >> +
> >> +	head->mapping = f_mapping;
> >> +	head->index = linear_page_index(vmf->vma,
> >> +			ALIGN(vmf->address, fault_size));
> >> +}
> > 
> > Should this stuff be setup before doing vmf_insert_pfn_XX?
> > 
> 
> Interestingly filesystem-dax does this, but not device-dax.

I think it may be a bug ?

> set_page_mapping/set_compound_mapping() could be moved to before and
> then torn down on @rc != VM_FAULT_NOPAGE (failure). I am not sure
> what's the benefit in this series..  besides the ordering (that you
> hinted below) ?

Well, it should probably be fixed in a precursor patch.

I think the general idea is that page->mapping/index are stable once
the page is published in a PTE?

> > In normal cases the page should be returned in the vmf and populated
> > to the page tables by the core code after all this is done.
>
> So I suppose by call sites examples as 'core code' is either hugetlbfs call to
> __filemap_add_folio() (on hugetlbfs fault handler), shmem_add_to_page_cache() or
> anon-equivalent.

I was talking more about the normal page insertion flow which is done
by setting vmf->page and then returning. finish_fault() will install
the PTE

If this is the best way then I would expect some future where maybe
there is a vmf->folio and finish_fault() will install a PUD/PMD/PTE
and we don't call vmf_insert_pfnxx in DAX.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-15 16:49       ` Jason Gunthorpe
@ 2021-11-16 16:38         ` Joao Martins
  2021-11-19 16:12           ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2021-11-16 16:38 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: linux-mm, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Christoph Hellwig,
	nvdimm, linux-doc

On 11/15/21 17:49, Jason Gunthorpe wrote:
> On Mon, Nov 15, 2021 at 01:11:32PM +0100, Joao Martins wrote:
>> On 11/12/21 16:34, Jason Gunthorpe wrote:
>>> On Fri, Nov 12, 2021 at 04:08:24PM +0100, Joao Martins wrote:
>>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>>> index a65c67ab5ee0..0c2ac97d397d 100644
>>>> +++ b/drivers/dax/device.c
>>>> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>>>>  }
>>>>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>>>  
>>>> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
>>>> +			     unsigned long fault_size,
>>>> +			     struct address_space *f_mapping)
>>>> +{
>>>> +	unsigned long i;
>>>> +	pgoff_t pgoff;
>>>> +
>>>> +	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
>>>> +
>>>> +	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>>>> +		struct page *page;
>>>> +
>>>> +		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>>>> +		if (page->mapping)
>>>> +			continue;
>>>> +		page->mapping = f_mapping;
>>>> +		page->index = pgoff + i;
>>>> +	}
>>>> +}
>>>> +
>>>> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
>>>> +				 unsigned long fault_size,
>>>> +				 struct address_space *f_mapping)
>>>> +{
>>>> +	struct page *head;
>>>> +
>>>> +	head = pfn_to_page(pfn_t_to_pfn(pfn));
>>>> +	head = compound_head(head);
>>>> +	if (head->mapping)
>>>> +		return;
>>>> +
>>>> +	head->mapping = f_mapping;
>>>> +	head->index = linear_page_index(vmf->vma,
>>>> +			ALIGN(vmf->address, fault_size));
>>>> +}
>>>
>>> Should this stuff be setup before doing vmf_insert_pfn_XX?
>>>
>>
>> Interestingly filesystem-dax does this, but not device-dax.
> 
> I think it may be a bug ?
> 
Possibly.

Dan, any thoughts (see also below) ? You probably hold all that
history since its inception on commit 2232c6382a4 ("device-dax: Enable page_mapping()")
and commit 35de299547d1 ("device-dax: Set page->index").

>> set_page_mapping/set_compound_mapping() could be moved to before and
>> then torn down on @rc != VM_FAULT_NOPAGE (failure). I am not sure
>> what's the benefit in this series..  besides the ordering (that you
>> hinted below) ?
> 
> Well, it should probably be fixed in a precursor patch.
> 
Yeap -- I would move page_mapping prior to introduce set_compound_mapping().
Now I am thinking again and with that logic it makes more sense to ammend inside
set_page_mapping() -- should have less nested around in the fault handler.

> I think the general idea is that page->mapping/index are stable once
> the page is published in a PTE?
> 
/me nods

>>> In normal cases the page should be returned in the vmf and populated
>>> to the page tables by the core code after all this is done.
>>
>> So I suppose by call sites examples as 'core code' is either hugetlbfs call to
>> __filemap_add_folio() (on hugetlbfs fault handler), shmem_add_to_page_cache() or
>> anon-equivalent.
> 
> I was talking more about the normal page insertion flow which is done
> by setting vmf->page and then returning. finish_fault() will install
> the PTE
> 
I misunderstood you earlier -- I thought you were suggesting me to look at
how mapping/index is set (in the context of the flow you just described)

	Joao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 6/8] device-dax: use struct_size()
  2021-11-12 15:08 ` [PATCH v5 6/8] device-dax: use struct_size() Joao Martins
@ 2021-11-17  9:37   ` Christoph Hellwig
  2021-11-17 10:15     ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2021-11-17  9:37 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc

> +		pgmap = devm_kzalloc(
> +                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
> +                       GFP_KERNEL);

Keeping the dev argument on the previous line would not only make this
much more readable but also avoid the overly long line.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 7/8] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-11-12 15:08 ` [PATCH v5 7/8] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
@ 2021-11-17  9:37   ` Christoph Hellwig
  2021-11-17 10:15     ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2021-11-17  9:37 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc

> +bool static_dev_dax(struct dev_dax *dev_dax)
> +{
> +	return is_static(dev_dax->region);
> +}
> +EXPORT_SYMBOL_GPL(static_dev_dax);

This function would massively benefit from documentic what a static
DAX region is and why someone would want to care.  Because even as
someone occasionally dabbling with the DAX code I have no idea at all
what that means.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-12 15:08 ` [PATCH v5 8/8] device-dax: compound devmap support Joao Martins
  2021-11-12 15:34   ` Jason Gunthorpe
@ 2021-11-17  9:43   ` Christoph Hellwig
  2021-11-17 10:22     ` Joao Martins
  1 sibling, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2021-11-17  9:43 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, Christoph Hellwig, nvdimm, linux-doc

On Fri, Nov 12, 2021 at 04:08:24PM +0100, Joao Martins wrote:
> Use the newly added compound devmap facility which maps the assigned dax
> ranges as compound pages at a page size of @align.
> 
> dax devices are created with a fixed @align (huge page size) which is
> enforced through as well at mmap() of the device. Faults, consequently
> happen too at the specified @align specified at the creation, and those
> don't change throughout dax device lifetime. MCEs unmap a whole dax
> huge page, as well as splits occurring at the configured page size.
> 
> Performance measured by gup_test improves considerably for
> unpin_user_pages() and altmap with NVDIMMs:
> 
> $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
> (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
> [altmap]
> (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
> 
>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
> (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
> [altmap with -m 127004]
> (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
> 
> .. as well as unpin_user_page_range_dirty_lock() being just as effective
> as THP/hugetlb[0] pages.
> 
> [0] https://lore.kernel.org/linux-mm/20210212130843.13865-5-joao.m.martins@oracle.com/
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/device.c | 57 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 44 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index a65c67ab5ee0..0c2ac97d397d 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>  }
>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>  
> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
> +			     unsigned long fault_size,
> +			     struct address_space *f_mapping)
> +{
> +	unsigned long i;
> +	pgoff_t pgoff;
> +
> +	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
> +
> +	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
> +		struct page *page;
> +
> +		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
> +		if (page->mapping)
> +			continue;
> +		page->mapping = f_mapping;
> +		page->index = pgoff + i;
> +	}
> +}

No need to pass f_mapping here, it must be vmf->vma->vm_file->f_mapping.

> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
> +				 unsigned long fault_size,
> +				 struct address_space *f_mapping)
> +{
> +	struct page *head;
> +
> +	head = pfn_to_page(pfn_t_to_pfn(pfn));
> +	head = compound_head(head);
> +	if (head->mapping)
> +		return;
> +
> +	head->mapping = f_mapping;
> +	head->index = linear_page_index(vmf->vma,
> +			ALIGN(vmf->address, fault_size));
> +}

Same here.

>  	if (rc == VM_FAULT_NOPAGE) {
> -		unsigned long i;
> -		pgoff_t pgoff;
> +		struct dev_pagemap *pgmap = dev_dax->pgmap;

If you're touching this anyway:  why not do an early return here for
the error case to simplify the flow?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 6/8] device-dax: use struct_size()
  2021-11-17  9:37   ` Christoph Hellwig
@ 2021-11-17 10:15     ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-17 10:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On 11/17/21 10:37, Christoph Hellwig wrote:
>> +		pgmap = devm_kzalloc(
>> +                       dev, struct_size(pgmap, ranges, dev_dax->nr_range - 1),
>> +                       GFP_KERNEL);
> 
> Keeping the dev argument on the previous line would not only make this
> much more readable but also avoid the overly long line.
> 
Fixed.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 7/8] device-dax: ensure dev_dax->pgmap is valid for dynamic devices
  2021-11-17  9:37   ` Christoph Hellwig
@ 2021-11-17 10:15     ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-17 10:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc



On 11/17/21 10:37, Christoph Hellwig wrote:
>> +bool static_dev_dax(struct dev_dax *dev_dax)
>> +{
>> +	return is_static(dev_dax->region);
>> +}
>> +EXPORT_SYMBOL_GPL(static_dev_dax);
> 
> This function would massively benefit from documentic what a static
> DAX region is and why someone would want to care.  Because even as
> someone occasionally dabbling with the DAX code I have no idea at all
> what that means.
> 
Good idea.

Maybe something like this:

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 19dd83d3f3ea..8be6ec1ba193 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -129,6 +129,19 @@ ATTRIBUTE_GROUPS(dax_drv);

 static int dax_bus_match(struct device *dev, struct device_driver *drv);

+/*
+ * Static dax regions (PMEM) are regions partitioned by an external entity like
+ * nvdimm where a single range is assigned and which boundaries are
+ * defined by NVDIMM Namespace boundaries (e.g. need to be contiguous).
+ * Dynamic dax regions (HMEM), the assigned region can be subdivided by dax
+ * core into multiple devices (i.e. "partitions") which are composed with 1 or
+ * more discontiguous ranges.
+ * When allocating a dax region, drivers must set whether it's static.
+ * On static dax devices, the @pgmap is pre-assigned to dax when calling
+ * devm_create_dev_dax(), whereas in dynamic dax devices it is allocated
+ * on device ->probe(). Care is needed to make sure that non static devices
+ * are killed with a cleared @pgmap field (see kill_dev_dax()).
+ */
 static bool is_static(struct dax_region *dax_region)
 {
        return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-17  9:43   ` Christoph Hellwig
@ 2021-11-17 10:22     ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-17 10:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, Dan Williams, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, Jason Gunthorpe, John Hubbard,
	Jane Chu, Muchun Song, Mike Kravetz, Andrew Morton,
	Jonathan Corbet, nvdimm, linux-doc

On 11/17/21 10:43, Christoph Hellwig wrote:
> On Fri, Nov 12, 2021 at 04:08:24PM +0100, Joao Martins wrote:
>> Use the newly added compound devmap facility which maps the assigned dax
>> ranges as compound pages at a page size of @align.
>>
>> dax devices are created with a fixed @align (huge page size) which is
>> enforced through as well at mmap() of the device. Faults, consequently
>> happen too at the specified @align specified at the creation, and those
>> don't change throughout dax device lifetime. MCEs unmap a whole dax
>> huge page, as well as splits occurring at the configured page size.
>>
>> Performance measured by gup_test improves considerably for
>> unpin_user_pages() and altmap with NVDIMMs:
>>
>> $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
>> (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
>> [altmap]
>> (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
>>
>>  $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
>> (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
>> [altmap with -m 127004]
>> (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
>>
>> .. as well as unpin_user_page_range_dirty_lock() being just as effective
>> as THP/hugetlb[0] pages.
>>
>> [0] https://lore.kernel.org/linux-mm/20210212130843.13865-5-joao.m.martins@oracle.com/
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  drivers/dax/device.c | 57 ++++++++++++++++++++++++++++++++++----------
>>  1 file changed, 44 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index a65c67ab5ee0..0c2ac97d397d 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>>  }
>>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>  
>> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +			     unsigned long fault_size,
>> +			     struct address_space *f_mapping)
>> +{
>> +	unsigned long i;
>> +	pgoff_t pgoff;
>> +
>> +	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
>> +
>> +	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>> +		struct page *page;
>> +
>> +		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>> +		if (page->mapping)
>> +			continue;
>> +		page->mapping = f_mapping;
>> +		page->index = pgoff + i;
>> +	}
>> +}
> 
> No need to pass f_mapping here, it must be vmf->vma->vm_file->f_mapping.
> 
Hmmm good point -- If I keep this structure yeah I will nuke @f_mapping.

Should I move the @mapping setting to before vmf_insert_pfn*() (as Jason suggests)
then the @f_mapping argument might be useful to clear it on @rc != VM_FAULT_NOPAGE.

>> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
>> +				 unsigned long fault_size,
>> +				 struct address_space *f_mapping)
>> +{
>> +	struct page *head;
>> +
>> +	head = pfn_to_page(pfn_t_to_pfn(pfn));
>> +	head = compound_head(head);
>> +	if (head->mapping)
>> +		return;
>> +
>> +	head->mapping = f_mapping;
>> +	head->index = linear_page_index(vmf->vma,
>> +			ALIGN(vmf->address, fault_size));
>> +}
> 
> Same here.
> 
/me nods

>>  	if (rc == VM_FAULT_NOPAGE) {
>> -		unsigned long i;
>> -		pgoff_t pgoff;
>> +		struct dev_pagemap *pgmap = dev_dax->pgmap;
> 
> If you're touching this anyway:  why not do an early return here for
> the error case to simplify the flow?
> 
Yeah. I was thinking in doing that i.e. calling set_compound_mapping() earlier
or even inside set_page_mapping(). Albeit this would need a new argument (the dev_dax).

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-16 16:38         ` Joao Martins
@ 2021-11-19 16:12           ` Joao Martins
  2021-11-19 16:55             ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2021-11-19 16:12 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: linux-mm, Vishal Verma, Dave Jiang, Naoya Horiguchi,
	Matthew Wilcox, John Hubbard, Jane Chu, Muchun Song,
	Mike Kravetz, Andrew Morton, Jonathan Corbet, Christoph Hellwig,
	nvdimm, linux-doc

On 11/16/21 16:38, Joao Martins wrote:
> On 11/15/21 17:49, Jason Gunthorpe wrote:
>> On Mon, Nov 15, 2021 at 01:11:32PM +0100, Joao Martins wrote:
>>> On 11/12/21 16:34, Jason Gunthorpe wrote:
>>>> On Fri, Nov 12, 2021 at 04:08:24PM +0100, Joao Martins wrote:
>>>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>>>> index a65c67ab5ee0..0c2ac97d397d 100644
>>>>> +++ b/drivers/dax/device.c
>>>>> @@ -192,6 +192,42 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
>>>>>  }
>>>>>  #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>>>>  
>>>>> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn,
>>>>> +			     unsigned long fault_size,
>>>>> +			     struct address_space *f_mapping)
>>>>> +{
>>>>> +	unsigned long i;
>>>>> +	pgoff_t pgoff;
>>>>> +
>>>>> +	pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size));
>>>>> +
>>>>> +	for (i = 0; i < fault_size / PAGE_SIZE; i++) {
>>>>> +		struct page *page;
>>>>> +
>>>>> +		page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
>>>>> +		if (page->mapping)
>>>>> +			continue;
>>>>> +		page->mapping = f_mapping;
>>>>> +		page->index = pgoff + i;
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn,
>>>>> +				 unsigned long fault_size,
>>>>> +				 struct address_space *f_mapping)
>>>>> +{
>>>>> +	struct page *head;
>>>>> +
>>>>> +	head = pfn_to_page(pfn_t_to_pfn(pfn));
>>>>> +	head = compound_head(head);
>>>>> +	if (head->mapping)
>>>>> +		return;
>>>>> +
>>>>> +	head->mapping = f_mapping;
>>>>> +	head->index = linear_page_index(vmf->vma,
>>>>> +			ALIGN(vmf->address, fault_size));
>>>>> +}
>>>>
>>>> Should this stuff be setup before doing vmf_insert_pfn_XX?
>>>>
>>>
>>> Interestingly filesystem-dax does this, but not device-dax.
>>
>> I think it may be a bug ?
>>
> Possibly.
> 
> Dan, any thoughts (see also below) ? You probably hold all that
> history since its inception on commit 2232c6382a4 ("device-dax: Enable page_mapping()")
> and commit 35de299547d1 ("device-dax: Set page->index").
> 
Below is what I have staged so far as a percursor patch (see below scissors mark).

It also lets me simplify compound page case for __dax_set_mapping() in this patch,
like below diff.

But I still wonder whether this ordering adjustment of @mapping setting is best placed
as a percursor patch whenever pgmap/page refcount changes happen. Anyways it's just a
thought.

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 80824e460fbf..35706214778e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -78,15 +78,21 @@ static void __dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
                              struct address_space *f_mapping)
 {
        struct address_space *c_mapping = vmf->vma->vm_file->f_mapping;
+       struct dev_dax *dev_dax = vmf->vma->vm_file->private_data;
        unsigned long i, nr_pages = fault_size / PAGE_SIZE;
        pgoff_t pgoff;

+       /* mapping is only set on the head */
+       if (dev_dax->pgmap->vmemmap_shift)
+               nr_pages = 1;
+
        pgoff = linear_page_index(vmf->vma,
                        ALIGN(vmf->address, fault_size));

        for (i = 0; i < nr_pages; i++) {
                struct page *page = pfn_to_page(pfn_t_to_pfn(pfn) + i);

+               page = compound_head(page);
                if (page->mapping &&
                    (!f_mapping && page->mapping != c_mapping))
                        continue;
@@ -473,6 +479,9 @@ int dev_dax_probe(struct dev_dax *dev_dax)
        }

        pgmap->type = MEMORY_DEVICE_GENERIC;
+       if (dev_dax->align > PAGE_SIZE)
+               pgmap->vmemmap_shift =
+                       order_base_2(dev_dax->align >> PAGE_SHIFT);
        addr = devm_memremap_pages(dev, pgmap);
        if (IS_ERR(addr))
                return PTR_ERR(addr);
(

----------------------------------->8-------------------------------------

From: Joao Martins <joao.m.martins@oracle.com>
Subject: [PATCH] device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}()

Normally, the @page mapping is set prior to inserting the page into a
page table entry. Make device-dax adhere to the same  ordering, rather
than setting mapping after the PTE is inserted.

Care is taken to clear the mapping on a vmf_insert_pfn* failure (rc !=
VM_FAULT_NOPAGE). Thus mapping is cleared when we have a valid @pfn
which is right before we call vmf_insert_pfn*() and it is only cleared
if the one set on the page is the mapping recorded in the fault handler
data (@vmf).

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 drivers/dax/device.c | 79 +++++++++++++++++++++++++++++++-------------
 1 file changed, 56 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 630de5a795b0..80824e460fbf 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -73,6 +73,43 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t
pgoff,
 	return -1;
 }

+static void __dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
+			      unsigned long fault_size,
+			      struct address_space *f_mapping)
+{
+	struct address_space *c_mapping = vmf->vma->vm_file->f_mapping;
+	unsigned long i, nr_pages = fault_size / PAGE_SIZE;
+	pgoff_t pgoff;
+
+	pgoff = linear_page_index(vmf->vma,
+			ALIGN(vmf->address, fault_size));
+
+	for (i = 0; i < nr_pages; i++) {
+		struct page *page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
+
+		if (page->mapping &&
+		    (!f_mapping && page->mapping != c_mapping))
+			continue;
+
+		page->mapping = f_mapping;
+		page->index = pgoff + i;
+	}
+}
+
+static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
+			    unsigned long fault_size)
+{
+	struct address_space *c_mapping = vmf->vma->vm_file->f_mapping;
+
+	__dax_set_mapping(vmf, pfn, fault_size, c_mapping);
+}
+
+static void dax_clear_mapping(struct vm_fault *vmf, pfn_t pfn,
+			      unsigned long fault_size)
+{
+	__dax_set_mapping(vmf, pfn, fault_size, NULL);
+}
+
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf, pfn_t *pfn)
 {
@@ -100,6 +137,8 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,

 	*pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);

+	dax_set_mapping(vmf, *pfn, fault_size);
+
 	return vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
 }

@@ -140,6 +179,8 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,

 	*pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);

+	dax_set_mapping(vmf, *pfn, fault_size);
+
 	return vmf_insert_pfn_pmd(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
 }

@@ -182,6 +223,8 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,

 	*pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);

+	dax_set_mapping(vmf, *pfn, fault_size);
+
 	return vmf_insert_pfn_pud(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -199,7 +242,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	unsigned long fault_size;
 	vm_fault_t rc = VM_FAULT_SIGBUS;
 	int id;
-	pfn_t pfn;
+	pfn_t pfn = { .val = 0 };
 	struct dev_dax *dev_dax = filp->private_data;

 	dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", current->comm,
@@ -224,28 +267,18 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 		rc = VM_FAULT_SIGBUS;
 	}

-	if (rc == VM_FAULT_NOPAGE) {
-		unsigned long i;
-		pgoff_t pgoff;
-
-		/*
-		 * In the device-dax case the only possibility for a
-		 * VM_FAULT_NOPAGE result is when device-dax capacity is
-		 * mapped. No need to consider the zero page, or racing
-		 * conflicting mappings.
-		 */
-		pgoff = linear_page_index(vmf->vma,
-				ALIGN(vmf->address, fault_size));
-		for (i = 0; i < fault_size / PAGE_SIZE; i++) {
-			struct page *page;
-
-			page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-			if (page->mapping)
-				continue;
-			page->mapping = filp->f_mapping;
-			page->index = pgoff + i;
-		}
-	}
+	/*
+	 * In the device-dax case the only possibility for a
+	 * VM_FAULT_NOPAGE result is when device-dax capacity is
+	 * mapped. No need to consider the zero page, or racing
+	 * conflicting mappings.
+	 * We could get VM_FAULT_FALLBACK without even attempting
+	 * to insert the page table entry. So make sure we test
+	 * for the error code with a devmap @pfn value which is
+	 * set right before vmf_insert_pfn*().
+	 */
+	if (rc != VM_FAULT_NOPAGE && pfn_t_devmap(pfn))
+		dax_clear_mapping(vmf, pfn, fault_size);
 	dax_read_unlock(id);

 	return rc;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-19 16:12           ` Joao Martins
@ 2021-11-19 16:55             ` Jason Gunthorpe
  2021-11-19 19:26               ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2021-11-19 16:55 UTC (permalink / raw)
  To: Joao Martins
  Cc: Dan Williams, linux-mm, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Fri, Nov 19, 2021 at 04:12:18PM +0000, Joao Martins wrote:

> > Dan, any thoughts (see also below) ? You probably hold all that
> > history since its inception on commit 2232c6382a4 ("device-dax: Enable page_mapping()")
> > and commit 35de299547d1 ("device-dax: Set page->index").
> > 
> Below is what I have staged so far as a percursor patch (see below scissors mark).
> 
> It also lets me simplify compound page case for __dax_set_mapping() in this patch,
> like below diff.
> 
> But I still wonder whether this ordering adjustment of @mapping setting is best placed
> as a percursor patch whenever pgmap/page refcount changes happen. Anyways it's just a
> thought.

naively I would have thought you'd set the mapping on all pages when
you create the address_space and allocate pages into it. AFAIK devmap
assigns all pages to a single address_space, so shouldn't the mapping
just be done once?

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-19 16:55             ` Jason Gunthorpe
@ 2021-11-19 19:26               ` Joao Martins
  2021-11-19 19:53                 ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Joao Martins @ 2021-11-19 19:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, linux-mm, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 11/19/21 16:55, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 04:12:18PM +0000, Joao Martins wrote:
> 
>>> Dan, any thoughts (see also below) ? You probably hold all that
>>> history since its inception on commit 2232c6382a4 ("device-dax: Enable page_mapping()")
>>> and commit 35de299547d1 ("device-dax: Set page->index").
>>>
>> Below is what I have staged so far as a percursor patch (see below scissors mark).
>>
>> It also lets me simplify compound page case for __dax_set_mapping() in this patch,
>> like below diff.
>>
>> But I still wonder whether this ordering adjustment of @mapping setting is best placed
>> as a percursor patch whenever pgmap/page refcount changes happen. Anyways it's just a
>> thought.
> 
> naively I would have thought you'd set the mapping on all pages when
> you create the address_space and allocate pages into it. 

Today in fsdax/device-dax (hugetlb too) this is set on fault and set once
only (as you say) on the mapped pages. fsdax WARN_ON() you when you clearing
a page mapping that was not set to the expected address_space (similar to
what I did here)

Similar to what I do in the previous snippet I pasted. Except that maybe
I should just clear the mapping rather than clearing if f_mapping is the
expected one.

> AFAIK devmap
> assigns all pages to a single address_space, so shouldn't the mapping
> just be done once?
Isn't it a bit more efficient that you set only when you try to map a page?

I take it that's why it is being done this way.

	Joao

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-19 19:26               ` Joao Martins
@ 2021-11-19 19:53                 ` Jason Gunthorpe
  2021-11-19 20:13                   ` Joao Martins
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2021-11-19 19:53 UTC (permalink / raw)
  To: Joao Martins
  Cc: Dan Williams, linux-mm, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On Fri, Nov 19, 2021 at 07:26:44PM +0000, Joao Martins wrote:
> On 11/19/21 16:55, Jason Gunthorpe wrote:
> > On Fri, Nov 19, 2021 at 04:12:18PM +0000, Joao Martins wrote:
> > 
> >>> Dan, any thoughts (see also below) ? You probably hold all that
> >>> history since its inception on commit 2232c6382a4 ("device-dax: Enable page_mapping()")
> >>> and commit 35de299547d1 ("device-dax: Set page->index").
> >>>
> >> Below is what I have staged so far as a percursor patch (see below scissors mark).
> >>
> >> It also lets me simplify compound page case for __dax_set_mapping() in this patch,
> >> like below diff.
> >>
> >> But I still wonder whether this ordering adjustment of @mapping setting is best placed
> >> as a percursor patch whenever pgmap/page refcount changes happen. Anyways it's just a
> >> thought.
> > 
> > naively I would have thought you'd set the mapping on all pages when
> > you create the address_space and allocate pages into it. 
> 
> Today in fsdax/device-dax (hugetlb too) this is set on fault and set once
> only (as you say) on the mapped pages. fsdax WARN_ON() you when you clearing
> a page mapping that was not set to the expected address_space (similar to
> what I did here)

I would imagine that a normal FS case is to allocate some new memory
and then join it to the address_space and set mapping, so that makes
sense.

For fsdax, logically the DAX pages on the medium with struct pages
could be in the address_space as soon as the inode is created. That
would improve fault performance at the cost of making address_space
creation a lot slower, so I can see why not to do that.

> > AFAIK devmap
> > assigns all pages to a single address_space, so shouldn't the mapping
> > just be done once?
> Isn't it a bit more efficient that you set only when you try to map a page?

For devdax if you can set the address space as part of initializing
each struct page and setting the compounds it would probably be a net
win?

Anyhow, I think what you did here is OK? Maybe I don't understand the
question 'whenever pgmap/page refcount changes happen'

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 8/8] device-dax: compound devmap support
  2021-11-19 19:53                 ` Jason Gunthorpe
@ 2021-11-19 20:13                   ` Joao Martins
  0 siblings, 0 replies; 26+ messages in thread
From: Joao Martins @ 2021-11-19 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, linux-mm, Vishal Verma, Dave Jiang,
	Naoya Horiguchi, Matthew Wilcox, John Hubbard, Jane Chu,
	Muchun Song, Mike Kravetz, Andrew Morton, Jonathan Corbet,
	Christoph Hellwig, nvdimm, linux-doc

On 11/19/21 19:53, Jason Gunthorpe wrote:
> On Fri, Nov 19, 2021 at 07:26:44PM +0000, Joao Martins wrote:
>> On 11/19/21 16:55, Jason Gunthorpe wrote:
>>> On Fri, Nov 19, 2021 at 04:12:18PM +0000, Joao Martins wrote:
>>>
>>>>> Dan, any thoughts (see also below) ? You probably hold all that
>>>>> history since its inception on commit 2232c6382a4 ("device-dax: Enable page_mapping()")
>>>>> and commit 35de299547d1 ("device-dax: Set page->index").
>>>>>
>>>> Below is what I have staged so far as a percursor patch (see below scissors mark).
>>>>
>>>> It also lets me simplify compound page case for __dax_set_mapping() in this patch,
>>>> like below diff.
>>>>
>>>> But I still wonder whether this ordering adjustment of @mapping setting is best placed
>>>> as a percursor patch whenever pgmap/page refcount changes happen. Anyways it's just a
>>>> thought.
>>>
>>> naively I would have thought you'd set the mapping on all pages when
>>> you create the address_space and allocate pages into it. 
>>
>> Today in fsdax/device-dax (hugetlb too) this is set on fault and set once
>> only (as you say) on the mapped pages. fsdax WARN_ON() you when you clearing
>> a page mapping that was not set to the expected address_space (similar to
>> what I did here)
> 
> I would imagine that a normal FS case is to allocate some new memory
> and then join it to the address_space and set mapping, so that makes
> sense.
> 
> For fsdax, logically the DAX pages on the medium with struct pages
> could be in the address_space as soon as the inode is created. That
> would improve fault performance at the cost of making address_space
> creation a lot slower, so I can see why not to do that.
> 
>>> AFAIK devmap
>>> assigns all pages to a single address_space, so shouldn't the mapping
>>> just be done once?
>> Isn't it a bit more efficient that you set only when you try to map a page?
> 
> For devdax if you can set the address space as part of initializing
> each struct page and setting the compounds it would probably be a net
> win?
> 

Provided that we only set in the head yes, it would have a neligible cost
over region bringup as it only touches the head mapping.

Now with the base pages on device-dax the zone init would probably
jump considerably.

> Anyhow, I think what you did here is OK? 
Yeah, I wanted to hear Dan thoughts over -- or maybe I should just respin
the series with the added cleanup.

	Joao

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-11-19 20:13 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-12 15:08 [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Joao Martins
2021-11-12 15:08 ` [PATCH v5 1/8] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-11-12 15:08 ` [PATCH v5 2/8] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-11-12 15:08 ` [PATCH v5 3/8] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-11-12 15:08 ` [PATCH v5 4/8] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-11-12 15:08 ` [PATCH v5 5/8] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-11-12 15:08 ` [PATCH v5 6/8] device-dax: use struct_size() Joao Martins
2021-11-17  9:37   ` Christoph Hellwig
2021-11-17 10:15     ` Joao Martins
2021-11-12 15:08 ` [PATCH v5 7/8] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-11-17  9:37   ` Christoph Hellwig
2021-11-17 10:15     ` Joao Martins
2021-11-12 15:08 ` [PATCH v5 8/8] device-dax: compound devmap support Joao Martins
2021-11-12 15:34   ` Jason Gunthorpe
2021-11-15 12:11     ` Joao Martins
2021-11-15 16:49       ` Jason Gunthorpe
2021-11-16 16:38         ` Joao Martins
2021-11-19 16:12           ` Joao Martins
2021-11-19 16:55             ` Jason Gunthorpe
2021-11-19 19:26               ` Joao Martins
2021-11-19 19:53                 ` Jason Gunthorpe
2021-11-19 20:13                   ` Joao Martins
2021-11-17  9:43   ` Christoph Hellwig
2021-11-17 10:22     ` Joao Martins
2021-11-12 15:40 ` [PATCH v5 0/8] mm, dax: Introduce compound pages in devmap Jason Gunthorpe
2021-11-15 11:00   ` Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).