nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/18] Fix the DAX-gup mistake
@ 2022-09-16  3:35 Dan Williams
  2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
                   ` (19 more replies)
  0 siblings, 20 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Jason Gunthorpe, Jan Kara, Christoph Hellwig, Darrick J. Wong,
	Matthew Wilcox, John Hubbard, linux-fsdevel, nvdimm, linux-xfs,
	linux-mm, linux-ext4

Changes since v1 [1]:
- Jason rightly pointed out that the approach taken in v1 still did not
  properly handle the case of waiting for all page pins to drop to zero.
  The new approach in this set fixes that and more closely mirrors what
  happens for typical pages, details below.

[1]: https://lore.kernel.org/nvdimm/166225775968.2351842.11156458342486082012.stgit@dwillia2-xfh.jf.intel.com/
---

Typical pages have their reference count elevated when they are
allocated and installed in the page cache, elevated again when they are
mapped into userspace, and elevated for gup. The DAX-gup mistake is that
page-references were only ever taken for gup and the device backing the
memory was only pinned (get_dev_pagemap()) at gup time. That leaves a
hole where the page is mapped for userspace access without a pin on the
device.

Rework the DAX page reference scheme to be more like typical pages. DAX
pages start life at reference count 0, elevate their reference count at
map time and gup time. Unlike typical pages that can be safely truncated
from files while they are pinned for gup, DAX pages can only be
truncated while their reference count is 0. The device is pinned via
get_dev_pagemap() whenever a DAX page transitions from _refcount 0 -> 1,
and unpinned only after the 1 -> 0 transition and being truncated from
their host inode.

To facilitate this reference counting and synchronization a new
dax_zap_pages() operation is introduced before any truncate event. That
dax_zap_pages() operation is carried out as a side effect of any 'break
layouts' event. Effectively dax_zap_pages() and the new DAX_ZAP flag (in
the DAX-inode i_pages entries), is mimicking what _mapcount tracks for
typical pages. The zap state allows the Xarray to cache page->mapping
information for entries until the page _refcount drops to zero and is
finally truncated from the file / no longer in use.

This hackery continues the status of DAX pages as special cases in the
VM. The thought being carrying the Xarray / mapping infrastructure
forward still allows for the continuation of the page-less DAX effort.
Otherwise, the work to convert DAX pages to behave like typical
vm_normal_page() needs more investigation to untangle transparent huge
page assumptions.

This passes the "ndctl:dax" suite of tests from the ndctl project.
Thanks to Jason for the discussion on v1 to come up with this new
approach.

---

Dan Williams (18):
      fsdax: Wait on @page not @page->_refcount
      fsdax: Use dax_page_idle() to document DAX busy page checking
      fsdax: Include unmapped inodes for page-idle detection
      ext4: Add ext4_break_layouts() to the inode eviction path
      xfs: Add xfs_break_layouts() to the inode eviction path
      fsdax: Rework dax_layout_busy_page() to dax_zap_mappings()
      fsdax: Update dax_insert_entry() calling convention to return an error
      fsdax: Cleanup dax_associate_entry()
      fsdax: Rework dax_insert_entry() calling convention
      fsdax: Manage pgmap references at entry insertion and deletion
      devdax: Minor warning fixups
      devdax: Move address_space helpers to the DAX core
      dax: Prep mapping helpers for compound pages
      devdax: add PUD support to the DAX mapping infrastructure
      devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
      mm/memremap_pages: Support initializing pages to a zero reference count
      fsdax: Delete put_devmap_managed_page_refs()
      mm/gup: Drop DAX pgmap accounting


 .clang-format             |    1 
 drivers/Makefile          |    2 
 drivers/dax/Kconfig       |    5 
 drivers/dax/Makefile      |    1 
 drivers/dax/bus.c         |   15 +
 drivers/dax/dax-private.h |    2 
 drivers/dax/device.c      |   74 ++-
 drivers/dax/mapping.c     | 1055 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/dax/super.c       |    6 
 drivers/nvdimm/Kconfig    |    1 
 drivers/nvdimm/pmem.c     |    2 
 fs/dax.c                  | 1049 ++-------------------------------------------
 fs/ext4/inode.c           |   17 +
 fs/fuse/dax.c             |    9 
 fs/xfs/xfs_file.c         |   16 -
 fs/xfs/xfs_inode.c        |    7 
 fs/xfs/xfs_inode.h        |    6 
 fs/xfs/xfs_super.c        |   22 +
 include/linux/dax.h       |  128 ++++-
 include/linux/huge_mm.h   |   23 -
 include/linux/memremap.h  |   29 +
 include/linux/mm.h        |   30 -
 mm/gup.c                  |   89 +---
 mm/huge_memory.c          |   54 --
 mm/memremap.c             |   46 +-
 mm/page_alloc.c           |    2 
 mm/swap.c                 |    2 
 27 files changed, 1415 insertions(+), 1278 deletions(-)
 create mode 100644 drivers/dax/mapping.c

base-commit: 1c23f9e627a7b412978b4e852793c5e3c3efc555

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-20 14:30   ` Jason Gunthorpe
  2022-09-16  3:35 ` [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

The __wait_var_event facility calculates a wait queue from a hash of the
address of the variable being passed. Use the @page argument directly as
it is less to type and is the object that is being waited upon.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext4/inode.c   |    8 ++++----
 fs/fuse/dax.c     |    6 +++---
 fs/xfs/xfs_file.c |    6 +++---
 mm/memremap.c     |    2 +-
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 601214453c3a..b028a4413bea 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3961,10 +3961,10 @@ int ext4_break_layouts(struct inode *inode)
 		if (!page)
 			return 0;
 
-		error = ___wait_var_event(&page->_refcount,
-				atomic_read(&page->_refcount) == 1,
-				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(inode));
+		error = ___wait_var_event(page,
+					  atomic_read(&page->_refcount) == 1,
+					  TASK_INTERRUPTIBLE, 0, 0,
+					  ext4_wait_dax_page(inode));
 	} while (error == 0);
 
 	return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index e23e802a8013..4e12108c68af 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,9 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, fuse_wait_dax_page(inode));
+	return ___wait_var_event(page, atomic_read(&page->_refcount) == 1,
+				 TASK_INTERRUPTIBLE, 0, 0,
+				 fuse_wait_dax_page(inode));
 }
 
 /* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c6c80265c0b2..73e7b7ec0a4c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -827,9 +827,9 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(&page->_refcount,
-			atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-			0, 0, xfs_wait_dax_page(inode));
+	return ___wait_var_event(page, atomic_read(&page->_refcount) == 1,
+				 TASK_INTERRUPTIBLE, 0, 0,
+				 xfs_wait_dax_page(inode));
 }
 
 int
diff --git a/mm/memremap.c b/mm/memremap.c
index 58b20c3c300b..95f6ffe9cb0f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -520,7 +520,7 @@ bool __put_devmap_managed_page_refs(struct page *page, int refs)
 	 * stable because nobody holds a reference on the page.
 	 */
 	if (page_ref_sub_return(page, refs) == 1)
-		wake_up_var(&page->_refcount);
+		wake_up_var(page);
 	return true;
 }
 EXPORT_SYMBOL(__put_devmap_managed_page_refs);


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
  2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-20 14:31   ` Jason Gunthorpe
  2022-09-16  3:35 ` [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection Dan Williams
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In advance of converting DAX pages to be 0-based, use a new
dax_page_idle() helper to both simplify that future conversion, but also
document all the kernel locations that are watching for DAX page idle
events.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |    4 ++--
 fs/ext4/inode.c     |    3 +--
 fs/fuse/dax.c       |    5 ++---
 fs/xfs/xfs_file.c   |    5 ++---
 include/linux/dax.h |    9 +++++++++
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index c440dcef4b1b..e762b9c04fb4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -395,7 +395,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+		WARN_ON_ONCE(trunc && !dax_page_idle(page));
 		if (dax_mapping_is_cow(page->mapping)) {
 			/* keep the CoW flag if this page is still shared */
 			if (page->index-- > 0)
@@ -414,7 +414,7 @@ static struct page *dax_busy_page(void *entry)
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		if (page_ref_count(page) > 1)
+		if (!dax_page_idle(page))
 			return page;
 	}
 	return NULL;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b028a4413bea..478ec6bc0935 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3961,8 +3961,7 @@ int ext4_break_layouts(struct inode *inode)
 		if (!page)
 			return 0;
 
-		error = ___wait_var_event(page,
-					  atomic_read(&page->_refcount) == 1,
+		error = ___wait_var_event(page, dax_page_idle(page),
 					  TASK_INTERRUPTIBLE, 0, 0,
 					  ext4_wait_dax_page(inode));
 	} while (error == 0);
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 4e12108c68af..ae52ef7dbabe 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,8 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(page, atomic_read(&page->_refcount) == 1,
-				 TASK_INTERRUPTIBLE, 0, 0,
-				 fuse_wait_dax_page(inode));
+	return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE,
+				 0, 0, fuse_wait_dax_page(inode));
 }
 
 /* dmap_end == 0 leads to unmapping of whole file */
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 73e7b7ec0a4c..556e28d06788 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -827,9 +827,8 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(page, atomic_read(&page->_refcount) == 1,
-				 TASK_INTERRUPTIBLE, 0, 0,
-				 xfs_wait_dax_page(inode));
+	return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE,
+				 0, 0, xfs_wait_dax_page(inode));
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ba985333e26b..04987d14d7e0 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -210,6 +210,15 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 		const struct iomap_ops *ops);
 
+/*
+ * Document all the code locations that want know when a dax page is
+ * unreferenced.
+ */
+static inline bool dax_page_idle(struct page *page)
+{
+	return page_ref_count(page) == 1;
+}
+
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
 void dax_read_unlock(int id);


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
  2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
  2022-09-16  3:35 ` [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-16  3:35 ` [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path Dan Williams
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

A page can remain pinned even after it has been unmapped from userspace
/ removed from the rmap. In advance of requiring that all
dax_insert_entry() events are followed up 'break layouts' before a
truncate event, make sure that 'break layouts' can find unmapped
entries.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index e762b9c04fb4..76bad1c095c0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -698,7 +698,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return NULL;
 
-	if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+	if (!dax_mapping(mapping))
 		return NULL;
 
 	/* If end == LLONG_MAX, all pages from start to till end of file */


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (2 preceding siblings ...)
  2022-09-16  3:35 ` [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-16  3:35 ` [PATCH v2 05/18] xfs: Add xfs_break_layouts() " Dan Williams
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In preparation for moving DAX pages to be 0-based rather than 1-based
for the idle refcount, the fsdax core wants to have all mappings in a
"zapped" state before truncate. For typical pages this happens naturally
via unmap_mapping_range(), for DAX pages some help is needed to record
this state in the 'struct address_space' of the inode(s) where the page
is mapped.

That "zapped" state is recorded in DAX entries as a side effect of
ext4_break_layouts(). Arrange for it to be called before all truncation
events which already happens for truncate() and PUNCH_HOLE, but not
truncate_inode_pages_final(). Arrange for ext4_break_layouts() before
truncate_inode_pages_final().

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/ext4/inode.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 478ec6bc0935..326269ad3961 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -207,7 +207,11 @@ void ext4_evict_inode(struct inode *inode)
 			jbd2_complete_transaction(journal, commit_tid);
 			filemap_write_and_wait(&inode->i_data);
 		}
+
+		filemap_invalidate_lock(inode->i_mapping);
+		ext4_break_layouts(inode);
 		truncate_inode_pages_final(&inode->i_data);
+		filemap_invalidate_unlock(inode->i_mapping);
 
 		goto no_delete;
 	}
@@ -218,7 +222,11 @@ void ext4_evict_inode(struct inode *inode)
 
 	if (ext4_should_order_data(inode))
 		ext4_begin_ordered_truncate(inode, 0);
+
+	filemap_invalidate_lock(inode->i_mapping);
+	ext4_break_layouts(inode);
 	truncate_inode_pages_final(&inode->i_data);
+	filemap_invalidate_unlock(inode->i_mapping);
 
 	/*
 	 * For inodes with journalled data, transaction commit could have


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (3 preceding siblings ...)
  2022-09-16  3:35 ` [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-18 22:57   ` Dave Chinner
  2022-09-16  3:35 ` [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() Dan Williams
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In preparation for moving DAX pages to be 0-based rather than 1-based
for the idle refcount, the fsdax core wants to have all mappings in a
"zapped" state before truncate. For typical pages this happens naturally
via unmap_mapping_range(), for DAX pages some help is needed to record
this state in the 'struct address_space' of the inode(s) where the page
is mapped.

That "zapped" state is recorded in DAX entries as a side effect of
xfs_break_layouts(). Arrange for it to be called before all truncation
events which already happens for truncate() and PUNCH_HOLE, but not
truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
truncate_inode_pages_final().

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_file.c  |   13 +++++++++----
 fs/xfs/xfs_inode.c |    3 ++-
 fs/xfs/xfs_inode.h |    6 ++++--
 fs/xfs/xfs_super.c |   22 ++++++++++++++++++++++
 4 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 556e28d06788..d3ff692d5546 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -816,7 +816,8 @@ xfs_wait_dax_page(
 int
 xfs_break_dax_layouts(
 	struct inode		*inode,
-	bool			*retry)
+	bool			*retry,
+	int			state)
 {
 	struct page		*page;
 
@@ -827,8 +828,8 @@ xfs_break_dax_layouts(
 		return 0;
 
 	*retry = true;
-	return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE,
-				 0, 0, xfs_wait_dax_page(inode));
+	return ___wait_var_event(page, dax_page_idle(page), state, 0, 0,
+				 xfs_wait_dax_page(inode));
 }
 
 int
@@ -839,14 +840,18 @@ xfs_break_layouts(
 {
 	bool			retry;
 	int			error;
+	int			state = TASK_INTERRUPTIBLE;
 
 	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
 
 	do {
 		retry = false;
 		switch (reason) {
+		case BREAK_UNMAP_FINAL:
+			state = TASK_UNINTERRUPTIBLE;
+			fallthrough;
 		case BREAK_UNMAP:
-			error = xfs_break_dax_layouts(inode, &retry);
+			error = xfs_break_dax_layouts(inode, &retry, state);
 			if (error || retry)
 				break;
 			fallthrough;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 28493c8e9bb2..72ce1cb72736 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3452,6 +3452,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
 	struct xfs_inode	*ip1,
 	struct xfs_inode	*ip2)
 {
+	int			state = TASK_INTERRUPTIBLE;
 	int			error;
 	bool			retry;
 	struct page		*page;
@@ -3463,7 +3464,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
 	retry = false;
 	/* Lock the first inode */
 	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
-	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
+	error = xfs_break_dax_layouts(VFS_I(ip1), &retry, state);
 	if (error || retry) {
 		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
 		if (error == 0 && retry)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index fa780f08dc89..e4994eb6e521 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -454,11 +454,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
  * layout-holder has a consistent view of the file's extent map. While
  * BREAK_WRITE breaks can be satisfied by recalling FL_LAYOUT leases,
  * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
- * go idle.
+ * go idle. BREAK_UNMAP_FINAL is an uninterruptible version of
+ * BREAK_UNMAP.
  */
 enum layout_break_reason {
         BREAK_WRITE,
         BREAK_UNMAP,
+        BREAK_UNMAP_FINAL,
 };
 
 /*
@@ -531,7 +533,7 @@ xfs_itruncate_extents(
 }
 
 /* from xfs_file.c */
-int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
+int	xfs_break_dax_layouts(struct inode *inode, bool *retry, int state);
 int	xfs_break_layouts(struct inode *inode, uint *iolock,
 		enum layout_break_reason reason);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9ac59814bbb6..ebb4a6eba3fc 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -725,6 +725,27 @@ xfs_fs_drop_inode(
 	return generic_drop_inode(inode);
 }
 
+STATIC void
+xfs_fs_evict_inode(
+	struct inode		*inode)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	long			error;
+
+	xfs_ilock(ip, iolock);
+
+	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP_FINAL);
+
+	/* The final layout break is uninterruptible */
+	ASSERT_ALWAYS(!error);
+
+	truncate_inode_pages_final(&inode->i_data);
+	clear_inode(inode);
+
+	xfs_iunlock(ip, iolock);
+}
+
 static void
 xfs_mount_free(
 	struct xfs_mount	*mp)
@@ -1144,6 +1165,7 @@ static const struct super_operations xfs_super_operations = {
 	.destroy_inode		= xfs_fs_destroy_inode,
 	.dirty_inode		= xfs_fs_dirty_inode,
 	.drop_inode		= xfs_fs_drop_inode,
+	.evict_inode		= xfs_fs_evict_inode,
 	.put_super		= xfs_fs_put_super,
 	.sync_fs		= xfs_fs_sync_fs,
 	.freeze_fs		= xfs_fs_freeze,


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings()
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (4 preceding siblings ...)
  2022-09-16  3:35 ` [PATCH v2 05/18] xfs: Add xfs_break_layouts() " Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-16  3:35 ` [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error Dan Williams
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In preparation for moving the truncate vs DAX-busy-page detection from
detecting _refcount == 1 to _refcount == 0, change the busy page
tracking to take refernces at dax_insert_entry(), drop references at
dax_zap_mappings() time, and finally clean out the entries at
dax_delete_mapping_entries().

This approach will rely on all paths that call truncate_inode_pages() to
first call dax_zap_mappings(). This mirrors the zapped state of pages
after unmap_mapping_range(), but since DAX pages do not maintain
_mapcount this DAX specific flow is introduced. This approach helps
address the immediate _refcount problem, but continues to kick the "DAX
without pages?" question down the road.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |   82 ++++++++++++++++++++++++++++++++++++---------------
 fs/ext4/inode.c     |    2 +
 fs/fuse/dax.c       |    4 +-
 fs/xfs/xfs_file.c   |    2 +
 fs/xfs/xfs_inode.c  |    4 +-
 include/linux/dax.h |   11 ++++---
 6 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 76bad1c095c0..616bac4b7df3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -74,11 +74,12 @@ fs_initcall(init_dax_wait_table);
  * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
  * block allocation.
  */
-#define DAX_SHIFT	(4)
+#define DAX_SHIFT	(5)
 #define DAX_LOCKED	(1UL << 0)
 #define DAX_PMD		(1UL << 1)
 #define DAX_ZERO_PAGE	(1UL << 2)
 #define DAX_EMPTY	(1UL << 3)
+#define DAX_ZAP		(1UL << 4)
 
 static unsigned long dax_to_pfn(void *entry)
 {
@@ -95,6 +96,11 @@ static bool dax_is_locked(void *entry)
 	return xa_to_value(entry) & DAX_LOCKED;
 }
 
+static bool dax_is_zapped(void *entry)
+{
+	return xa_to_value(entry) & DAX_ZAP;
+}
+
 static unsigned int dax_entry_order(void *entry)
 {
 	if (xa_to_value(entry) & DAX_PMD)
@@ -380,6 +386,7 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 			WARN_ON_ONCE(page->mapping);
 			page->mapping = mapping;
 			page->index = index + i++;
+			page_ref_inc(page);
 		}
 	}
 }
@@ -395,31 +402,20 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(trunc && !dax_page_idle(page));
 		if (dax_mapping_is_cow(page->mapping)) {
 			/* keep the CoW flag if this page is still shared */
 			if (page->index-- > 0)
 				continue;
-		} else
+		} else {
+			WARN_ON_ONCE(trunc && !dax_is_zapped(entry));
+			WARN_ON_ONCE(trunc && !dax_page_idle(page));
 			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+		}
 		page->mapping = NULL;
 		page->index = 0;
 	}
 }
 
-static struct page *dax_busy_page(void *entry)
-{
-	unsigned long pfn;
-
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
-
-		if (!dax_page_idle(page))
-			return page;
-	}
-	return NULL;
-}
-
 /*
  * dax_lock_page - Lock the DAX entry corresponding to a page
  * @page: The page whose entry we want to lock
@@ -664,8 +660,46 @@ static void *grab_mapping_entry(struct xa_state *xas,
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
 
+static void *dax_zap_entry(struct xa_state *xas, void *entry)
+{
+	unsigned long v = xa_to_value(entry);
+
+	return xas_store(xas, xa_mk_value(v | DAX_ZAP));
+}
+
+/**
+ * Return NULL if the entry is zapped and all pages in the entry are
+ * idle, otherwise return the non-idle page in the entry
+ */
+static struct page *dax_zap_pages(struct xa_state *xas, void *entry)
+{
+	struct page *ret = NULL;
+	unsigned long pfn;
+	bool zap;
+
+	if (!dax_entry_size(entry))
+		return NULL;
+
+	zap = !dax_is_zapped(entry);
+
+	for_each_mapped_pfn(entry, pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (zap)
+			page_ref_dec(page);
+
+		if (!ret && !dax_page_idle(page))
+			ret = page;
+	}
+
+	if (zap)
+		dax_zap_entry(xas, entry);
+
+	return ret;
+}
+
 /**
- * dax_layout_busy_page_range - find first pinned page in @mapping
+ * dax_zap_mappings_range - find first pinned page in @mapping
  * @mapping: address space to scan for a page with ref count > 1
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
@@ -682,8 +716,8 @@ static void *grab_mapping_entry(struct xa_state *xas,
  * to be able to run unmap_mapping_range() and subsequently not race
  * mapping_mapped() becoming true.
  */
-struct page *dax_layout_busy_page_range(struct address_space *mapping,
-					loff_t start, loff_t end)
+struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
+				    loff_t end)
 {
 	void *entry;
 	unsigned int scanned = 0;
@@ -727,7 +761,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
 		if (unlikely(dax_is_locked(entry)))
 			entry = get_unlocked_entry(&xas, 0);
 		if (entry)
-			page = dax_busy_page(entry);
+			page = dax_zap_pages(&xas, entry);
 		put_unlocked_entry(&xas, entry, WAKE_NEXT);
 		if (page)
 			break;
@@ -742,13 +776,13 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
 	xas_unlock_irq(&xas);
 	return page;
 }
-EXPORT_SYMBOL_GPL(dax_layout_busy_page_range);
+EXPORT_SYMBOL_GPL(dax_zap_mappings_range);
 
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_zap_mappings(struct address_space *mapping)
 {
-	return dax_layout_busy_page_range(mapping, 0, LLONG_MAX);
+	return dax_zap_mappings_range(mapping, 0, LLONG_MAX);
 }
-EXPORT_SYMBOL_GPL(dax_layout_busy_page);
+EXPORT_SYMBOL_GPL(dax_zap_mappings);
 
 static int __dax_invalidate_entry(struct address_space *mapping,
 					  pgoff_t index, bool trunc)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 326269ad3961..0ce73af69c49 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3965,7 +3965,7 @@ int ext4_break_layouts(struct inode *inode)
 		return -EINVAL;
 
 	do {
-		page = dax_layout_busy_page(inode->i_mapping);
+		page = dax_zap_mappings(inode->i_mapping);
 		if (!page)
 			return 0;
 
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index ae52ef7dbabe..8cdc9402e8f7 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -443,7 +443,7 @@ static int fuse_setup_new_dax_mapping(struct inode *inode, loff_t pos,
 
 	/*
 	 * Can't do inline reclaim in fault path. We call
-	 * dax_layout_busy_page() before we free a range. And
+	 * dax_zap_mappings() before we free a range. And
 	 * fuse_wait_dax_page() drops mapping->invalidate_lock and requires it.
 	 * In fault path we enter with mapping->invalidate_lock held and can't
 	 * drop it. Also in fault path we hold mapping->invalidate_lock shared
@@ -671,7 +671,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
 {
 	struct page *page;
 
-	page = dax_layout_busy_page_range(inode->i_mapping, start, end);
+	page = dax_zap_mappings_range(inode->i_mapping, start, end);
 	if (!page)
 		return 0;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index d3ff692d5546..918ab9130c96 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -823,7 +823,7 @@ xfs_break_dax_layouts(
 
 	ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));
 
-	page = dax_layout_busy_page(inode->i_mapping);
+	page = dax_zap_mappings(inode->i_mapping);
 	if (!page)
 		return 0;
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 72ce1cb72736..9bbc68500cec 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3482,8 +3482,8 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
 	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
 	 * for this nested lock case.
 	 */
-	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
-	if (page && page_ref_count(page) != 1) {
+	page = dax_zap_mappings(VFS_I(ip2)->i_mapping);
+	if (page) {
 		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
 		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
 		goto again;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 04987d14d7e0..f6acb4ed73cb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -157,8 +157,9 @@ static inline void fs_put_dax(struct dax_device *dax_dev, void *holder)
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc);
 
-struct page *dax_layout_busy_page(struct address_space *mapping);
-struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
+struct page *dax_zap_mappings(struct address_space *mapping);
+struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
+				    loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
 dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
@@ -166,12 +167,14 @@ dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
 void dax_unlock_mapping_entry(struct address_space *mapping,
 		unsigned long index, dax_entry_t cookie);
 #else
-static inline struct page *dax_layout_busy_page(struct address_space *mapping)
+static inline struct page *dax_zap_mappings(struct address_space *mapping)
 {
 	return NULL;
 }
 
-static inline struct page *dax_layout_busy_page_range(struct address_space *mapping, pgoff_t start, pgoff_t nr_pages)
+static inline struct page *dax_zap_mappings_range(struct address_space *mapping,
+						  pgoff_t start,
+						  pgoff_t nr_pages)
 {
 	return NULL;
 }


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (5 preceding siblings ...)
  2022-09-16  3:35 ` [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-16  3:35 ` [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry() Dan Williams
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In preparation for teaching dax_insert_entry() to take live @pgmap
references, enable it to return errors. Given the observation that all
callers overwrite the passed in entry with the return value, just update
@entry in place and convert the return code to a vm_fault_t status.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 616bac4b7df3..8382aab0d2f7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -887,14 +887,15 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter)
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
-		const struct iomap_iter *iter, void *entry, pfn_t pfn,
-		unsigned long flags)
+static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+				   const struct iomap_iter *iter, void **pentry,
+				   pfn_t pfn, unsigned long flags)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	void *new_entry = dax_make_entry(pfn, flags);
 	bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
 	bool cow = dax_fault_is_cow(iter);
+	void *entry = *pentry;
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -940,7 +941,8 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
 
 	xas_unlock_irq(xas);
-	return entry;
+	*pentry = entry;
+	return 0;
 }
 
 static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
@@ -1188,9 +1190,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
 	vm_fault_t ret;
 
-	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
+	ret = dax_insert_entry(xas, vmf, iter, entry, pfn, DAX_ZERO_PAGE);
+	if (ret)
+		goto out;
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+out:
 	trace_dax_load_hole(inode, vmf, ret);
 	return ret;
 }
@@ -1207,6 +1212,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	struct page *zero_page;
 	spinlock_t *ptl;
 	pmd_t pmd_entry;
+	vm_fault_t ret;
 	pfn_t pfn;
 
 	zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
@@ -1215,8 +1221,10 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		goto fallback;
 
 	pfn = page_to_pfn_t(zero_page);
-	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
-				  DAX_PMD | DAX_ZERO_PAGE);
+	ret = dax_insert_entry(xas, vmf, iter, entry, pfn,
+			       DAX_PMD | DAX_ZERO_PAGE);
+	if (ret)
+		return ret;
 
 	if (arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
@@ -1568,6 +1576,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
 	bool write = iter->flags & IOMAP_WRITE;
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
+	vm_fault_t ret;
 	int err = 0;
 	pfn_t pfn;
 	void *kaddr;
@@ -1592,7 +1601,9 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, entry_flags);
+	ret = dax_insert_entry(xas, vmf, iter, entry, pfn, entry_flags);
+	if (ret)
+		return ret;
 
 	if (write &&
 	    srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry()
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (6 preceding siblings ...)
  2022-09-16  3:35 ` [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error Dan Williams
@ 2022-09-16  3:35 ` Dan Williams
  2022-09-16  3:36 ` [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention Dan Williams
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:35 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Pass @vmf to drop the separate @vma and @address arguments to
dax_associate_entry(), use the existing DAX flags to convey the @cow
argument, and replace the open-coded ALIGN().

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 8382aab0d2f7..bd5c6b6e371e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -368,7 +368,7 @@ static inline void dax_mapping_set_cow(struct page *page)
  * FS_DAX_MAPPING_COW, and use page->index as refcount.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-		struct vm_area_struct *vma, unsigned long address, bool cow)
+				struct vm_fault *vmf, unsigned long flags)
 {
 	unsigned long size = dax_entry_size(entry), pfn, index;
 	int i = 0;
@@ -376,11 +376,11 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
-	index = linear_page_index(vma, address & ~(size - 1));
+	index = linear_page_index(vmf->vma, ALIGN(vmf->address, size));
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		if (cow) {
+		if (flags & DAX_COW) {
 			dax_mapping_set_cow(page);
 		} else {
 			WARN_ON_ONCE(page->mapping);
@@ -916,8 +916,7 @@ static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-				cow);
+		dax_associate_entry(new_entry, mapping, vmf, flags);
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (7 preceding siblings ...)
  2022-09-16  3:35 ` [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry() Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-16  3:36 ` [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Dan Williams
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Move the determination of @dirty and @cow in dax_insert_entry() to flags
(DAX_DIRTY and DAX_COW) that are passed in. This allows the iomap
related code to remain fs/dax.c in preparation for the Xarray
infrastructure to move to drivers/dax/mapping.c.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   44 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 35 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bd5c6b6e371e..5d9f30105db4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -75,12 +75,20 @@ fs_initcall(init_dax_wait_table);
  * block allocation.
  */
 #define DAX_SHIFT	(5)
+#define DAX_MASK	((1UL << DAX_SHIFT) - 1)
 #define DAX_LOCKED	(1UL << 0)
 #define DAX_PMD		(1UL << 1)
 #define DAX_ZERO_PAGE	(1UL << 2)
 #define DAX_EMPTY	(1UL << 3)
 #define DAX_ZAP		(1UL << 4)
 
+/*
+ * These flags are not conveyed in Xarray value entries, they are just
+ * modifiers to dax_insert_entry().
+ */
+#define DAX_DIRTY (1UL << (DAX_SHIFT + 0))
+#define DAX_COW   (1UL << (DAX_SHIFT + 1))
+
 static unsigned long dax_to_pfn(void *entry)
 {
 	return xa_to_value(entry) >> DAX_SHIFT;
@@ -88,7 +96,8 @@ static unsigned long dax_to_pfn(void *entry)
 
 static void *dax_make_entry(pfn_t pfn, unsigned long flags)
 {
-	return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
+	return xa_mk_value((flags & DAX_MASK) |
+			   (pfn_t_to_pfn(pfn) << DAX_SHIFT));
 }
 
 static bool dax_is_locked(void *entry)
@@ -880,6 +889,20 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter)
 		(iter->iomap.flags & IOMAP_F_SHARED);
 }
 
+static unsigned long dax_iter_flags(const struct iomap_iter *iter,
+				    struct vm_fault *vmf)
+{
+	unsigned long flags = 0;
+
+	if (!dax_fault_is_synchronous(iter, vmf->vma))
+		flags |= DAX_DIRTY;
+
+	if (dax_fault_is_cow(iter))
+		flags |= DAX_COW;
+
+	return flags;
+}
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -888,13 +911,13 @@ static bool dax_fault_is_cow(const struct iomap_iter *iter)
  * appropriate.
  */
 static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
-				   const struct iomap_iter *iter, void **pentry,
-				   pfn_t pfn, unsigned long flags)
+				   void **pentry, pfn_t pfn,
+				   unsigned long flags)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	void *new_entry = dax_make_entry(pfn, flags);
-	bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
-	bool cow = dax_fault_is_cow(iter);
+	bool dirty = flags & DAX_DIRTY;
+	bool cow = flags & DAX_COW;
 	void *entry = *pentry;
 
 	if (dirty)
@@ -1189,7 +1212,8 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
 	vm_fault_t ret;
 
-	ret = dax_insert_entry(xas, vmf, iter, entry, pfn, DAX_ZERO_PAGE);
+	ret = dax_insert_entry(xas, vmf, entry, pfn,
+			       DAX_ZERO_PAGE | dax_iter_flags(iter, vmf));
 	if (ret)
 		goto out;
 
@@ -1220,8 +1244,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		goto fallback;
 
 	pfn = page_to_pfn_t(zero_page);
-	ret = dax_insert_entry(xas, vmf, iter, entry, pfn,
-			       DAX_PMD | DAX_ZERO_PAGE);
+	ret = dax_insert_entry(xas, vmf, entry, pfn,
+			       DAX_PMD | DAX_ZERO_PAGE |
+				       dax_iter_flags(iter, vmf));
 	if (ret)
 		return ret;
 
@@ -1600,7 +1625,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-	ret = dax_insert_entry(xas, vmf, iter, entry, pfn, entry_flags);
+	ret = dax_insert_entry(xas, vmf, entry, pfn,
+			       entry_flags | dax_iter_flags(iter, vmf));
 	if (ret)
 		return ret;
 


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (8 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-21 14:03   ` Jason Gunthorpe
  2022-09-16  3:36 ` [PATCH v2 11/18] devdax: Minor warning fixups Dan Williams
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

The percpu_ref in 'struct dev_pagemap' is used to coordinate active
mappings of device-memory with the device-removal / unbind path. It
enables the semantic that initiating device-removal (or
device-driver-unbind) blocks new mapping and DMA attempts, and waits for
mapping revocation or inflight DMA to complete.

Expand the scope of the reference count to pin the DAX device active at
mapping time and not later at the first gup event. With a device
reference being held while any page on that device is mapped the need to
manage pgmap reference counts in the gup code is eliminated. That
cleanup is saved for a follow-on change.

For now, teach dax_insert_entry() and dax_delete_mapping_entry() to take
and drop pgmap references respectively. Where dax_insert_entry() is
called to take the initial reference on the page, and
dax_delete_mapping_entry() is called once there are no outstanding
references to the given page(s).

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c                 |   34 ++++++++++++++++++++++++++++------
 include/linux/memremap.h |   18 ++++++++++++++----
 mm/memremap.c            |   13 ++++++++-----
 3 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5d9f30105db4..ee2568c8b135 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -376,14 +376,26 @@ static inline void dax_mapping_set_cow(struct page *page)
  * whether this entry is shared by multiple files.  If so, set the page->mapping
  * FS_DAX_MAPPING_COW, and use page->index as refcount.
  */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
-				struct vm_fault *vmf, unsigned long flags)
+static vm_fault_t dax_associate_entry(void *entry,
+				      struct address_space *mapping,
+				      struct vm_fault *vmf, unsigned long flags)
 {
 	unsigned long size = dax_entry_size(entry), pfn, index;
+	struct dev_pagemap *pgmap;
 	int i = 0;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
+		return 0;
+
+	if (!size)
+		return 0;
+
+	if (!(flags & DAX_COW)) {
+		pfn = dax_to_pfn(entry);
+		pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size));
+		if (!pgmap)
+			return VM_FAULT_SIGBUS;
+	}
 
 	index = linear_page_index(vmf->vma, ALIGN(vmf->address, size));
 	for_each_mapped_pfn(entry, pfn) {
@@ -398,19 +410,24 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 			page_ref_inc(page);
 		}
 	}
+
+	return 0;
 }
 
 static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		bool trunc)
 {
-	unsigned long pfn;
+	unsigned long size = dax_entry_size(entry), pfn;
+	struct page *page;
 
 	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
 		return;
 
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
+	if (!size)
+		return;
 
+	for_each_mapped_pfn(entry, pfn) {
+		page = pfn_to_page(pfn);
 		if (dax_mapping_is_cow(page->mapping)) {
 			/* keep the CoW flag if this page is still shared */
 			if (page->index-- > 0)
@@ -423,6 +440,11 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		page->mapping = NULL;
 		page->index = 0;
 	}
+
+	if (trunc && !dax_mapping_is_cow(page->mapping)) {
+		page = pfn_to_page(dax_to_pfn(entry));
+		put_dev_pagemap_many(page->pgmap, PHYS_PFN(size));
+	}
 }
 
 /*
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c3b4cc84877b..fd57407e7f3d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -191,8 +191,13 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid);
 void memunmap_pages(struct dev_pagemap *pgmap);
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
-struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap);
+struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn,
+					 struct dev_pagemap *pgmap, int refs);
+static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
+						  struct dev_pagemap *pgmap)
+{
+	return get_dev_pagemap_many(pfn, pgmap, 1);
+}
 bool pgmap_pfn_valid(struct dev_pagemap *pgmap, unsigned long pfn);
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
@@ -244,10 +249,15 @@ static inline unsigned long memremap_compat_align(void)
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
+static inline void put_dev_pagemap_many(struct dev_pagemap *pgmap, int refs)
 {
 	if (pgmap)
-		percpu_ref_put(&pgmap->ref);
+		percpu_ref_put_many(&pgmap->ref, refs);
+}
+
+static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
+{
+	put_dev_pagemap_many(pgmap, 1);
 }
 
 #endif /* _LINUX_MEMREMAP_H_ */
diff --git a/mm/memremap.c b/mm/memremap.c
index 95f6ffe9cb0f..83c5e6fafd84 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -430,15 +430,16 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
 }
 
 /**
- * get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
+ * get_dev_pagemap_many() - take new live references(s) on the dev_pagemap for @pfn
  * @pfn: page frame number to lookup page_map
  * @pgmap: optional known pgmap that already has a reference
+ * @refs: number of references to take
  *
  * If @pgmap is non-NULL and covers @pfn it will be returned as-is.  If @pgmap
  * is non-NULL but does not cover @pfn the reference to it will be released.
  */
-struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
-		struct dev_pagemap *pgmap)
+struct dev_pagemap *get_dev_pagemap_many(unsigned long pfn,
+					 struct dev_pagemap *pgmap, int refs)
 {
 	resource_size_t phys = PFN_PHYS(pfn);
 
@@ -454,13 +455,15 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 	/* fall back to slow path lookup */
 	rcu_read_lock();
 	pgmap = xa_load(&pgmap_array, PHYS_PFN(phys));
-	if (pgmap && !percpu_ref_tryget_live(&pgmap->ref))
+	if (pgmap && !percpu_ref_tryget_live_rcu(&pgmap->ref))
 		pgmap = NULL;
+	if (pgmap && refs > 1)
+		percpu_ref_get_many(&pgmap->ref, refs - 1);
 	rcu_read_unlock();
 
 	return pgmap;
 }
-EXPORT_SYMBOL_GPL(get_dev_pagemap);
+EXPORT_SYMBOL_GPL(get_dev_pagemap_many);
 
 void free_zone_device_page(struct page *page)
 {


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 11/18] devdax: Minor warning fixups
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (9 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-16  3:36 ` [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core Dan Williams
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm; +Cc: hch, linux-fsdevel, nvdimm, linux-xfs, linux-mm, linux-ext4

Fix a missing prototype warning for dev_dax_probe(), and fix
dax_holder() comment block format.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/dax-private.h |    1 +
 drivers/dax/super.c       |    2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 1c974b7caae6..202cafd836e8 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -87,6 +87,7 @@ static inline struct dax_mapping *to_dax_mapping(struct device *dev)
 }
 
 phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size);
+int dev_dax_probe(struct dev_dax *dev_dax);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline bool dax_align_valid(unsigned long align)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..4909ad945a49 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -475,7 +475,7 @@ EXPORT_SYMBOL_GPL(put_dax);
 /**
  * dax_holder() - obtain the holder of a dax device
  * @dax_dev: a dax_device instance
-
+ *
  * Return: the holder's data which represents the holder if registered,
  * otherwize NULL.
  */


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (10 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 11/18] devdax: Minor warning fixups Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-27  6:20   ` Alistair Popple
  2022-09-16  3:36 ` [PATCH v2 13/18] dax: Prep mapping helpers for compound pages Dan Williams
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In preparation for decamping get_dev_pagemap() and
put_devmap_managed_page() from code paths outside of DAX, device-dax
needs to track mapping references similar to the tracking done for
fsdax. Reuse the same infrastructure as fsdax (dax_insert_entry() and
dax_delete_mapping_entry()). For now, just move that infrastructure into
a common location with no other code changes.

The move involves splitting iomap and supporting helpers into fs/dax.c
and all 'struct address_space' and DAX-entry manipulation into
drivers/dax/mapping.c. grab_mapping_entry() is renamed
dax_grab_mapping_entry(), and some common definitions and declarations
are moved to include/linux/dax.h.

No functional change is intended, just code movement.

The interactions between drivers/dax/mapping.o and mm/memory-failure.o
result in drivers/dax/mapping.o and the rest of the dax core losing its
option to be compiled as a module. That can be addressed later given the
fact the CONFIG_FS_DAX has always been forcing the dax core to be
compiled in. I.e. this is only a vmlinux size regression for
CONFIG_FS_DAX=n and CONFIG_DEV_DAX=m builds.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 .clang-format             |    1 
 drivers/Makefile          |    2 
 drivers/dax/Kconfig       |    4 
 drivers/dax/Makefile      |    1 
 drivers/dax/dax-private.h |    1 
 drivers/dax/mapping.c     | 1010 +++++++++++++++++++++++++++++++++++++++++
 drivers/dax/super.c       |    4 
 drivers/nvdimm/Kconfig    |    1 
 fs/dax.c                  | 1109 +--------------------------------------------
 include/linux/dax.h       |  110 +++-
 include/linux/memremap.h  |    6 
 11 files changed, 1143 insertions(+), 1106 deletions(-)
 create mode 100644 drivers/dax/mapping.c

diff --git a/.clang-format b/.clang-format
index 1247d54f9e49..336fa266386e 100644
--- a/.clang-format
+++ b/.clang-format
@@ -269,6 +269,7 @@ ForEachMacros:
   - 'for_each_link_cpus'
   - 'for_each_link_platforms'
   - 'for_each_lru'
+  - 'for_each_mapped_pfn'
   - 'for_each_matching_node'
   - 'for_each_matching_node_and_match'
   - 'for_each_mem_pfn_range'
diff --git a/drivers/Makefile b/drivers/Makefile
index 057857258bfd..ec6c4146b966 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -71,7 +71,7 @@ obj-$(CONFIG_FB_INTEL)          += video/fbdev/intelfb/
 obj-$(CONFIG_PARPORT)		+= parport/
 obj-y				+= base/ block/ misc/ mfd/ nfc/
 obj-$(CONFIG_LIBNVDIMM)		+= nvdimm/
-obj-$(CONFIG_DAX)		+= dax/
+obj-y				+= dax/
 obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
 obj-$(CONFIG_NUBUS)		+= nubus/
 obj-y				+= cxl/
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 5fdf269a822e..205e9dda8928 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -1,8 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0-only
 menuconfig DAX
-	tristate "DAX: direct access to differentiated memory"
+	bool "DAX: direct access to differentiated memory"
+	depends on MMU
 	select SRCU
-	default m if NVDIMM_DAX
 
 if DAX
 
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 90a56ca3b345..3546bca7adbf 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -6,6 +6,7 @@ obj-$(CONFIG_DEV_DAX_PMEM) += dax_pmem.o
 
 dax-y := super.o
 dax-y += bus.o
+dax-y += mapping.o
 device_dax-y := device.o
 dax_pmem-y := pmem.o
 
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 202cafd836e8..19076f9d5c51 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -15,6 +15,7 @@ struct dax_device *inode_dax(struct inode *inode);
 struct inode *dax_inode(struct dax_device *dax_dev);
 int dax_bus_init(void);
 void dax_bus_exit(void);
+void dax_mapping_init(void);
 
 /**
  * struct dax_region - mapping infrastructure for dax devices
diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c
new file mode 100644
index 000000000000..70576aa02148
--- /dev/null
+++ b/drivers/dax/mapping.c
@@ -0,0 +1,1010 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Direct Access mapping infrastructure split from fs/dax.c
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <matthew.r.wilcox@intel.com>
+ * Author: Ross Zwisler <ross.zwisler@linux.intel.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/dax.h>
+#include <linux/rmap.h>
+#include <linux/pfn_t.h>
+#include <linux/sizes.h>
+#include <linux/pagemap.h>
+
+#include "dax-private.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/fs_dax.h>
+
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+void __init dax_mapping_init(void)
+{
+	int i;
+
+	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+		init_waitqueue_head(wait_table + i);
+}
+
+static unsigned long dax_to_pfn(void *entry)
+{
+	return xa_to_value(entry) >> DAX_SHIFT;
+}
+
+static void *dax_make_entry(pfn_t pfn, unsigned long flags)
+{
+	return xa_mk_value((flags & DAX_MASK) |
+			   (pfn_t_to_pfn(pfn) << DAX_SHIFT));
+}
+
+static bool dax_is_locked(void *entry)
+{
+	return xa_to_value(entry) & DAX_LOCKED;
+}
+
+static bool dax_is_zapped(void *entry)
+{
+	return xa_to_value(entry) & DAX_ZAP;
+}
+
+static unsigned int dax_entry_order(void *entry)
+{
+	if (xa_to_value(entry) & DAX_PMD)
+		return PMD_ORDER;
+	return 0;
+}
+
+static unsigned long dax_is_pmd_entry(void *entry)
+{
+	return xa_to_value(entry) & DAX_PMD;
+}
+
+static bool dax_is_pte_entry(void *entry)
+{
+	return !(xa_to_value(entry) & DAX_PMD);
+}
+
+static int dax_is_zero_entry(void *entry)
+{
+	return xa_to_value(entry) & DAX_ZERO_PAGE;
+}
+
+static int dax_is_empty_entry(void *entry)
+{
+	return xa_to_value(entry) & DAX_EMPTY;
+}
+
+/*
+ * true if the entry that was found is of a smaller order than the entry
+ * we were looking for
+ */
+static bool dax_is_conflict(void *entry)
+{
+	return entry == XA_RETRY_ENTRY;
+}
+
+/*
+ * DAX page cache entry locking
+ */
+struct exceptional_entry_key {
+	struct xarray *xa;
+	pgoff_t entry_start;
+};
+
+struct wait_exceptional_entry_queue {
+	wait_queue_entry_t wait;
+	struct exceptional_entry_key key;
+};
+
+/**
+ * enum dax_wake_mode: waitqueue wakeup behaviour
+ * @WAKE_ALL: wake all waiters in the waitqueue
+ * @WAKE_NEXT: wake only the first waiter in the waitqueue
+ */
+enum dax_wake_mode {
+	WAKE_ALL,
+	WAKE_NEXT,
+};
+
+static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, void *entry,
+					      struct exceptional_entry_key *key)
+{
+	unsigned long hash;
+	unsigned long index = xas->xa_index;
+
+	/*
+	 * If 'entry' is a PMD, align the 'index' that we use for the wait
+	 * queue to the start of that PMD.  This ensures that all offsets in
+	 * the range covered by the PMD map to the same bit lock.
+	 */
+	if (dax_is_pmd_entry(entry))
+		index &= ~PG_PMD_COLOUR;
+	key->xa = xas->xa;
+	key->entry_start = index;
+
+	hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS);
+	return wait_table + hash;
+}
+
+static int wake_exceptional_entry_func(wait_queue_entry_t *wait,
+				       unsigned int mode, int sync, void *keyp)
+{
+	struct exceptional_entry_key *key = keyp;
+	struct wait_exceptional_entry_queue *ewait =
+		container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+	if (key->xa != ewait->key.xa ||
+	    key->entry_start != ewait->key.entry_start)
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+/*
+ * @entry may no longer be the entry at the index in the mapping.
+ * The important information it's conveying is whether the entry at
+ * this index used to be a PMD entry.
+ */
+static void dax_wake_entry(struct xa_state *xas, void *entry,
+			   enum dax_wake_mode mode)
+{
+	struct exceptional_entry_key key;
+	wait_queue_head_t *wq;
+
+	wq = dax_entry_waitqueue(xas, entry, &key);
+
+	/*
+	 * Checking for locked entry and prepare_to_wait_exclusive() happens
+	 * under the i_pages lock, ditto for entry handling in our callers.
+	 * So at this point all tasks that could have seen our entry locked
+	 * must be in the waitqueue and the following check will see them.
+	 */
+	if (waitqueue_active(wq))
+		__wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key);
+}
+
+/*
+ * Look up entry in page cache, wait for it to become unlocked if it
+ * is a DAX entry and return it.  The caller must subsequently call
+ * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry()
+ * if it did.  The entry returned may have a larger order than @order.
+ * If @order is larger than the order of the entry found in i_pages, this
+ * function returns a dax_is_conflict entry.
+ *
+ * Must be called with the i_pages lock held.
+ */
+static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
+{
+	void *entry;
+	struct wait_exceptional_entry_queue ewait;
+	wait_queue_head_t *wq;
+
+	init_wait(&ewait.wait);
+	ewait.wait.func = wake_exceptional_entry_func;
+
+	for (;;) {
+		entry = xas_find_conflict(xas);
+		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
+			return entry;
+		if (dax_entry_order(entry) < order)
+			return XA_RETRY_ENTRY;
+		if (!dax_is_locked(entry))
+			return entry;
+
+		wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+		prepare_to_wait_exclusive(wq, &ewait.wait,
+					  TASK_UNINTERRUPTIBLE);
+		xas_unlock_irq(xas);
+		xas_reset(xas);
+		schedule();
+		finish_wait(wq, &ewait.wait);
+		xas_lock_irq(xas);
+	}
+}
+
+/*
+ * The only thing keeping the address space around is the i_pages lock
+ * (it's cycled in clear_inode() after removing the entries from i_pages)
+ * After we call xas_unlock_irq(), we cannot touch xas->xa.
+ */
+static void wait_entry_unlocked(struct xa_state *xas, void *entry)
+{
+	struct wait_exceptional_entry_queue ewait;
+	wait_queue_head_t *wq;
+
+	init_wait(&ewait.wait);
+	ewait.wait.func = wake_exceptional_entry_func;
+
+	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+	/*
+	 * Unlike get_unlocked_entry() there is no guarantee that this
+	 * path ever successfully retrieves an unlocked entry before an
+	 * inode dies. Perform a non-exclusive wait in case this path
+	 * never successfully performs its own wake up.
+	 */
+	prepare_to_wait(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
+	xas_unlock_irq(xas);
+	schedule();
+	finish_wait(wq, &ewait.wait);
+}
+
+static void put_unlocked_entry(struct xa_state *xas, void *entry,
+			       enum dax_wake_mode mode)
+{
+	if (entry && !dax_is_conflict(entry))
+		dax_wake_entry(xas, entry, mode);
+}
+
+/*
+ * We used the xa_state to get the entry, but then we locked the entry and
+ * dropped the xa_lock, so we know the xa_state is stale and must be reset
+ * before use.
+ */
+void dax_unlock_entry(struct xa_state *xas, void *entry)
+{
+	void *old;
+
+	WARN_ON(dax_is_locked(entry));
+	xas_reset(xas);
+	xas_lock_irq(xas);
+	old = xas_store(xas, entry);
+	xas_unlock_irq(xas);
+	WARN_ON(!dax_is_locked(old));
+	dax_wake_entry(xas, entry, WAKE_NEXT);
+}
+
+/*
+ * Return: The entry stored at this location before it was locked.
+ */
+static void *dax_lock_entry(struct xa_state *xas, void *entry)
+{
+	unsigned long v = xa_to_value(entry);
+
+	return xas_store(xas, xa_mk_value(v | DAX_LOCKED));
+}
+
+static unsigned long dax_entry_size(void *entry)
+{
+	if (dax_is_zero_entry(entry))
+		return 0;
+	else if (dax_is_empty_entry(entry))
+		return 0;
+	else if (dax_is_pmd_entry(entry))
+		return PMD_SIZE;
+	else
+		return PAGE_SIZE;
+}
+
+static unsigned long dax_end_pfn(void *entry)
+{
+	return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
+}
+
+/*
+ * Iterate through all mapped pfns represented by an entry, i.e. skip
+ * 'empty' and 'zero' entries.
+ */
+#define for_each_mapped_pfn(entry, pfn) \
+	for (pfn = dax_to_pfn(entry); pfn < dax_end_pfn(entry); pfn++)
+
+static bool dax_mapping_is_cow(struct address_space *mapping)
+{
+	return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+}
+
+/*
+ * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ */
+static void dax_mapping_set_cow(struct page *page)
+{
+	if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+		/*
+		 * Reset the index if the page was already mapped
+		 * regularly before.
+		 */
+		if (page->mapping)
+			page->index = 1;
+		page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+	}
+	page->index++;
+}
+
+/*
+ * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * whether this entry is shared by multiple files.  If so, set the page->mapping
+ * FS_DAX_MAPPING_COW, and use page->index as refcount.
+ */
+static vm_fault_t dax_associate_entry(void *entry,
+				      struct address_space *mapping,
+				      struct vm_fault *vmf, unsigned long flags)
+{
+	unsigned long size = dax_entry_size(entry), pfn, index;
+	struct dev_pagemap *pgmap;
+	int i = 0;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return 0;
+
+	if (!size)
+		return 0;
+
+	if (!(flags & DAX_COW)) {
+		pfn = dax_to_pfn(entry);
+		pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size));
+		if (!pgmap)
+			return VM_FAULT_SIGBUS;
+	}
+
+	index = linear_page_index(vmf->vma, ALIGN(vmf->address, size));
+	for_each_mapped_pfn(entry, pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (flags & DAX_COW) {
+			dax_mapping_set_cow(page);
+		} else {
+			WARN_ON_ONCE(page->mapping);
+			page->mapping = mapping;
+			page->index = index + i++;
+			page_ref_inc(page);
+		}
+	}
+
+	return 0;
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+		bool trunc)
+{
+	unsigned long size = dax_entry_size(entry), pfn;
+	struct page *page;
+
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return;
+
+	if (!size)
+		return;
+
+	for_each_mapped_pfn(entry, pfn) {
+		page = pfn_to_page(pfn);
+		if (dax_mapping_is_cow(page->mapping)) {
+			/* keep the CoW flag if this page is still shared */
+			if (page->index-- > 0)
+				continue;
+		} else {
+			WARN_ON_ONCE(trunc && !dax_is_zapped(entry));
+			WARN_ON_ONCE(trunc && !dax_page_idle(page));
+			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+		}
+		page->mapping = NULL;
+		page->index = 0;
+	}
+
+	if (trunc && !dax_mapping_is_cow(page->mapping)) {
+		page = pfn_to_page(dax_to_pfn(entry));
+		put_dev_pagemap_many(page->pgmap, PHYS_PFN(size));
+	}
+}
+
+/*
+ * dax_lock_page - Lock the DAX entry corresponding to a page
+ * @page: The page whose entry we want to lock
+ *
+ * Context: Process context.
+ * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could
+ * not be locked.
+ */
+dax_entry_t dax_lock_page(struct page *page)
+{
+	XA_STATE(xas, NULL, 0);
+	void *entry;
+
+	/* Ensure page->mapping isn't freed while we look at it */
+	rcu_read_lock();
+	for (;;) {
+		struct address_space *mapping = READ_ONCE(page->mapping);
+
+		entry = NULL;
+		if (!mapping || !dax_mapping(mapping))
+			break;
+
+		/*
+		 * In the device-dax case there's no need to lock, a
+		 * struct dev_pagemap pin is sufficient to keep the
+		 * inode alive, and we assume we have dev_pagemap pin
+		 * otherwise we would not have a valid pfn_to_page()
+		 * translation.
+		 */
+		entry = (void *)~0UL;
+		if (S_ISCHR(mapping->host->i_mode))
+			break;
+
+		xas.xa = &mapping->i_pages;
+		xas_lock_irq(&xas);
+		if (mapping != page->mapping) {
+			xas_unlock_irq(&xas);
+			continue;
+		}
+		xas_set(&xas, page->index);
+		entry = xas_load(&xas);
+		if (dax_is_locked(entry)) {
+			rcu_read_unlock();
+			wait_entry_unlocked(&xas, entry);
+			rcu_read_lock();
+			continue;
+		}
+		dax_lock_entry(&xas, entry);
+		xas_unlock_irq(&xas);
+		break;
+	}
+	rcu_read_unlock();
+	return (dax_entry_t)entry;
+}
+
+void dax_unlock_page(struct page *page, dax_entry_t cookie)
+{
+	struct address_space *mapping = page->mapping;
+	XA_STATE(xas, &mapping->i_pages, page->index);
+
+	if (S_ISCHR(mapping->host->i_mode))
+		return;
+
+	dax_unlock_entry(&xas, (void *)cookie);
+}
+
+/*
+ * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
+ * @mapping: the file's mapping whose entry we want to lock
+ * @index: the offset within this file
+ * @page: output the dax page corresponding to this dax entry
+ *
+ * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
+ * could not be locked.
+ */
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
+				   struct page **page)
+{
+	XA_STATE(xas, NULL, 0);
+	void *entry;
+
+	rcu_read_lock();
+	for (;;) {
+		entry = NULL;
+		if (!dax_mapping(mapping))
+			break;
+
+		xas.xa = &mapping->i_pages;
+		xas_lock_irq(&xas);
+		xas_set(&xas, index);
+		entry = xas_load(&xas);
+		if (dax_is_locked(entry)) {
+			rcu_read_unlock();
+			wait_entry_unlocked(&xas, entry);
+			rcu_read_lock();
+			continue;
+		}
+		if (!entry || dax_is_zero_entry(entry) ||
+		    dax_is_empty_entry(entry)) {
+			/*
+			 * Because we are looking for entry from file's mapping
+			 * and index, so the entry may not be inserted for now,
+			 * or even a zero/empty entry.  We don't think this is
+			 * an error case.  So, return a special value and do
+			 * not output @page.
+			 */
+			entry = (void *)~0UL;
+		} else {
+			*page = pfn_to_page(dax_to_pfn(entry));
+			dax_lock_entry(&xas, entry);
+		}
+		xas_unlock_irq(&xas);
+		break;
+	}
+	rcu_read_unlock();
+	return (dax_entry_t)entry;
+}
+
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
+			      dax_entry_t cookie)
+{
+	XA_STATE(xas, &mapping->i_pages, index);
+
+	if (cookie == ~0UL)
+		return;
+
+	dax_unlock_entry(&xas, (void *)cookie);
+}
+
+/*
+ * Find page cache entry at given index. If it is a DAX entry, return it
+ * with the entry locked. If the page cache doesn't contain an entry at
+ * that index, add a locked empty entry.
+ *
+ * When requesting an entry with size DAX_PMD, dax_grab_mapping_entry() will
+ * either return that locked entry or will return VM_FAULT_FALLBACK.
+ * This will happen if there are any PTE entries within the PMD range
+ * that we are requesting.
+ *
+ * We always favor PTE entries over PMD entries. There isn't a flow where we
+ * evict PTE entries in order to 'upgrade' them to a PMD entry.  A PMD
+ * insertion will fail if it finds any PTE entries already in the tree, and a
+ * PTE insertion will cause an existing PMD entry to be unmapped and
+ * downgraded to PTE entries.  This happens for both PMD zero pages as
+ * well as PMD empty entries.
+ *
+ * The exception to this downgrade path is for PMD entries that have
+ * real storage backing them.  We will leave these real PMD entries in
+ * the tree, and PTE writes will simply dirty the entire PMD entry.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ *
+ * On error, this function does not return an ERR_PTR.  Instead it returns
+ * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
+ * overlap with xarray value entries.
+ */
+void *dax_grab_mapping_entry(struct xa_state *xas,
+			     struct address_space *mapping, unsigned int order)
+{
+	unsigned long index = xas->xa_index;
+	bool pmd_downgrade; /* splitting PMD entry into PTE entries? */
+	void *entry;
+
+retry:
+	pmd_downgrade = false;
+	xas_lock_irq(xas);
+	entry = get_unlocked_entry(xas, order);
+
+	if (entry) {
+		if (dax_is_conflict(entry))
+			goto fallback;
+		if (!xa_is_value(entry)) {
+			xas_set_err(xas, -EIO);
+			goto out_unlock;
+		}
+
+		if (order == 0) {
+			if (dax_is_pmd_entry(entry) &&
+			    (dax_is_zero_entry(entry) ||
+			     dax_is_empty_entry(entry))) {
+				pmd_downgrade = true;
+			}
+		}
+	}
+
+	if (pmd_downgrade) {
+		/*
+		 * Make sure 'entry' remains valid while we drop
+		 * the i_pages lock.
+		 */
+		dax_lock_entry(xas, entry);
+
+		/*
+		 * Besides huge zero pages the only other thing that gets
+		 * downgraded are empty entries which don't need to be
+		 * unmapped.
+		 */
+		if (dax_is_zero_entry(entry)) {
+			xas_unlock_irq(xas);
+			unmap_mapping_pages(mapping,
+					    xas->xa_index & ~PG_PMD_COLOUR,
+					    PG_PMD_NR, false);
+			xas_reset(xas);
+			xas_lock_irq(xas);
+		}
+
+		dax_disassociate_entry(entry, mapping, false);
+		xas_store(xas, NULL); /* undo the PMD join */
+		dax_wake_entry(xas, entry, WAKE_ALL);
+		mapping->nrpages -= PG_PMD_NR;
+		entry = NULL;
+		xas_set(xas, index);
+	}
+
+	if (entry) {
+		dax_lock_entry(xas, entry);
+	} else {
+		unsigned long flags = DAX_EMPTY;
+
+		if (order > 0)
+			flags |= DAX_PMD;
+		entry = dax_make_entry(pfn_to_pfn_t(0), flags);
+		dax_lock_entry(xas, entry);
+		if (xas_error(xas))
+			goto out_unlock;
+		mapping->nrpages += 1UL << order;
+	}
+
+out_unlock:
+	xas_unlock_irq(xas);
+	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
+		goto retry;
+	if (xas->xa_node == XA_ERROR(-ENOMEM))
+		return xa_mk_internal(VM_FAULT_OOM);
+	if (xas_error(xas))
+		return xa_mk_internal(VM_FAULT_SIGBUS);
+	return entry;
+fallback:
+	xas_unlock_irq(xas);
+	return xa_mk_internal(VM_FAULT_FALLBACK);
+}
+
+static void *dax_zap_entry(struct xa_state *xas, void *entry)
+{
+	unsigned long v = xa_to_value(entry);
+
+	return xas_store(xas, xa_mk_value(v | DAX_ZAP));
+}
+
+/*
+ * Return NULL if the entry is zapped and all pages in the entry are
+ * idle, otherwise return the non-idle page in the entry
+ */
+static struct page *dax_zap_pages(struct xa_state *xas, void *entry)
+{
+	struct page *ret = NULL;
+	unsigned long pfn;
+	bool zap;
+
+	if (!dax_entry_size(entry))
+		return NULL;
+
+	zap = !dax_is_zapped(entry);
+
+	for_each_mapped_pfn(entry, pfn) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (zap)
+			page_ref_dec(page);
+
+		if (!ret && !dax_page_idle(page))
+			ret = page;
+	}
+
+	if (zap)
+		dax_zap_entry(xas, entry);
+
+	return ret;
+}
+
+/**
+ * dax_zap_mappings_range - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ * @start: Starting offset. Page containing 'start' is included.
+ * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
+ *       pages from 'start' till the end of file are included.
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true.
+ */
+struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
+				    loff_t end)
+{
+	void *entry;
+	unsigned int scanned = 0;
+	struct page *page = NULL;
+	pgoff_t start_idx = start >> PAGE_SHIFT;
+	pgoff_t end_idx;
+	XA_STATE(xas, &mapping->i_pages, start_idx);
+
+	/*
+	 * In the 'limited' case get_user_pages() for dax is disabled.
+	 */
+	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+		return NULL;
+
+	if (!dax_mapping(mapping))
+		return NULL;
+
+	/* If end == LLONG_MAX, all pages from start to till end of file */
+	if (end == LLONG_MAX)
+		end_idx = ULONG_MAX;
+	else
+		end_idx = end >> PAGE_SHIFT;
+	/*
+	 * If we race get_user_pages_fast() here either we'll see the
+	 * elevated page count in the iteration and wait, or
+	 * get_user_pages_fast() will see that the page it took a reference
+	 * against is no longer mapped in the page tables and bail to the
+	 * get_user_pages() slow path.  The slow path is protected by
+	 * pte_lock() and pmd_lock(). New references are not taken without
+	 * holding those locks, and unmap_mapping_pages() will not zero the
+	 * pte or pmd without holding the respective lock, so we are
+	 * guaranteed to either see new references or prevent new
+	 * references from being established.
+	 */
+	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
+
+	xas_lock_irq(&xas);
+	xas_for_each(&xas, entry, end_idx) {
+		if (WARN_ON_ONCE(!xa_is_value(entry)))
+			continue;
+		if (unlikely(dax_is_locked(entry)))
+			entry = get_unlocked_entry(&xas, 0);
+		if (entry)
+			page = dax_zap_pages(&xas, entry);
+		put_unlocked_entry(&xas, entry, WAKE_NEXT);
+		if (page)
+			break;
+		if (++scanned % XA_CHECK_SCHED)
+			continue;
+
+		xas_pause(&xas);
+		xas_unlock_irq(&xas);
+		cond_resched();
+		xas_lock_irq(&xas);
+	}
+	xas_unlock_irq(&xas);
+	return page;
+}
+EXPORT_SYMBOL_GPL(dax_zap_mappings_range);
+
+struct page *dax_zap_mappings(struct address_space *mapping)
+{
+	return dax_zap_mappings_range(mapping, 0, LLONG_MAX);
+}
+EXPORT_SYMBOL_GPL(dax_zap_mappings);
+
+static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index,
+				  bool trunc)
+{
+	XA_STATE(xas, &mapping->i_pages, index);
+	int ret = 0;
+	void *entry;
+
+	xas_lock_irq(&xas);
+	entry = get_unlocked_entry(&xas, 0);
+	if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
+		goto out;
+	if (!trunc && (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
+		       xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
+		goto out;
+	dax_disassociate_entry(entry, mapping, trunc);
+	xas_store(&xas, NULL);
+	mapping->nrpages -= 1UL << dax_entry_order(entry);
+	ret = 1;
+out:
+	put_unlocked_entry(&xas, entry, WAKE_ALL);
+	xas_unlock_irq(&xas);
+	return ret;
+}
+
+int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
+				      pgoff_t index)
+{
+	return __dax_invalidate_entry(mapping, index, false);
+}
+
+/*
+ * Delete DAX entry at @index from @mapping.  Wait for it
+ * to be unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+	int ret = __dax_invalidate_entry(mapping, index, true);
+
+	/*
+	 * This gets called from truncate / punch_hole path. As such, the caller
+	 * must hold locks protecting against concurrent modifications of the
+	 * page cache (usually fs-private i_mmap_sem for writing). Since the
+	 * caller has seen a DAX entry for this index, we better find it
+	 * at that index as well...
+	 */
+	WARN_ON_ONCE(!ret);
+	return ret;
+}
+
+/*
+ * By this point dax_grab_mapping_entry() has ensured that we have a locked entry
+ * of the appropriate size so we don't have to worry about downgrading PMDs to
+ * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
+ * already in the tree, we will skip the insertion and just dirty the PMD as
+ * appropriate.
+ */
+vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+			    void **pentry, pfn_t pfn, unsigned long flags)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	void *new_entry = dax_make_entry(pfn, flags);
+	bool dirty = flags & DAX_DIRTY;
+	bool cow = flags & DAX_COW;
+	void *entry = *pentry;
+
+	if (dirty)
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
+		unsigned long index = xas->xa_index;
+		/* we are replacing a zero page with block mapping */
+		if (dax_is_pmd_entry(entry))
+			unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
+					    PG_PMD_NR, false);
+		else /* pte entry */
+			unmap_mapping_pages(mapping, index, 1, false);
+	}
+
+	xas_reset(xas);
+	xas_lock_irq(xas);
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+		void *old;
+
+		dax_disassociate_entry(entry, mapping, false);
+		dax_associate_entry(new_entry, mapping, vmf, flags);
+		/*
+		 * Only swap our new entry into the page cache if the current
+		 * entry is a zero page or an empty entry.  If a normal PTE or
+		 * PMD entry is already in the cache, we leave it alone.  This
+		 * means that if we are trying to insert a PTE and the
+		 * existing entry is a PMD, we will just leave the PMD in the
+		 * tree and dirty it if necessary.
+		 */
+		old = dax_lock_entry(xas, new_entry);
+		WARN_ON_ONCE(old !=
+			     xa_mk_value(xa_to_value(entry) | DAX_LOCKED));
+		entry = new_entry;
+	} else {
+		xas_load(xas); /* Walk the xa_state */
+	}
+
+	if (dirty)
+		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
+
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
+	xas_unlock_irq(xas);
+	*pentry = entry;
+	return 0;
+}
+
+int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
+		      struct address_space *mapping, void *entry)
+{
+	unsigned long pfn, index, count, end;
+	long ret = 0;
+	struct vm_area_struct *vma;
+
+	/*
+	 * A page got tagged dirty in DAX mapping? Something is seriously
+	 * wrong.
+	 */
+	if (WARN_ON(!xa_is_value(entry)))
+		return -EIO;
+
+	if (unlikely(dax_is_locked(entry))) {
+		void *old_entry = entry;
+
+		entry = get_unlocked_entry(xas, 0);
+
+		/* Entry got punched out / reallocated? */
+		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
+			goto put_unlocked;
+		/*
+		 * Entry got reallocated elsewhere? No need to writeback.
+		 * We have to compare pfns as we must not bail out due to
+		 * difference in lockbit or entry type.
+		 */
+		if (dax_to_pfn(old_entry) != dax_to_pfn(entry))
+			goto put_unlocked;
+		if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
+					dax_is_zero_entry(entry))) {
+			ret = -EIO;
+			goto put_unlocked;
+		}
+
+		/* Another fsync thread may have already done this entry */
+		if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE))
+			goto put_unlocked;
+	}
+
+	/* Lock the entry to serialize with page faults */
+	dax_lock_entry(xas, entry);
+
+	/*
+	 * We can clear the tag now but we have to be careful so that concurrent
+	 * dax_writeback_one() calls for the same index cannot finish before we
+	 * actually flush the caches. This is achieved as the calls will look
+	 * at the entry only under the i_pages lock and once they do that
+	 * they will see the entry locked and wait for it to unlock.
+	 */
+	xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE);
+	xas_unlock_irq(xas);
+
+	/*
+	 * If dax_writeback_mapping_range() was given a wbc->range_start
+	 * in the middle of a PMD, the 'index' we use needs to be
+	 * aligned to the start of the PMD.
+	 * This allows us to flush for PMD_SIZE and not have to worry about
+	 * partial PMD writebacks.
+	 */
+	pfn = dax_to_pfn(entry);
+	count = 1UL << dax_entry_order(entry);
+	index = xas->xa_index & ~(count - 1);
+	end = index + count - 1;
+
+	/* Walk all mappings of a given index of a file and writeprotect them */
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
+		pfn_mkclean_range(pfn, count, index, vma);
+		cond_resched();
+	}
+	i_mmap_unlock_read(mapping);
+
+	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
+	/*
+	 * After we have flushed the cache, we can clear the dirty tag. There
+	 * cannot be new dirty data in the pfn after the flush has completed as
+	 * the pfn mappings are writeprotected and fault waits for mapping
+	 * entry lock.
+	 */
+	xas_reset(xas);
+	xas_lock_irq(xas);
+	xas_store(xas, entry);
+	xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
+	dax_wake_entry(xas, entry, WAKE_NEXT);
+
+	trace_dax_writeback_one(mapping->host, index, count);
+	return ret;
+
+ put_unlocked:
+	put_unlocked_entry(xas, entry, WAKE_NEXT);
+	return ret;
+}
+
+/*
+ * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables
+ * @vmf: The description of the fault
+ * @pfn: PFN to insert
+ * @order: Order of entry to insert.
+ *
+ * This function inserts a writeable PTE or PMD entry into the page tables
+ * for an mmaped DAX file.  It also marks the page cache entry as dirty.
+ */
+vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn,
+				  unsigned int order)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
+	void *entry;
+	vm_fault_t ret;
+
+	xas_lock_irq(&xas);
+	entry = get_unlocked_entry(&xas, order);
+	/* Did we race with someone splitting entry or so? */
+	if (!entry || dax_is_conflict(entry) ||
+	    (order == 0 && !dax_is_pte_entry(entry))) {
+		put_unlocked_entry(&xas, entry, WAKE_NEXT);
+		xas_unlock_irq(&xas);
+		trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
+						      VM_FAULT_NOPAGE);
+		return VM_FAULT_NOPAGE;
+	}
+	xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
+	dax_lock_entry(&xas, entry);
+	xas_unlock_irq(&xas);
+	if (order == 0)
+		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+#ifdef CONFIG_FS_DAX_PMD
+	else if (order == PMD_ORDER)
+		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
+#endif
+	else
+		ret = VM_FAULT_FALLBACK;
+	dax_unlock_entry(&xas, entry);
+	trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
+	return ret;
+}
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 4909ad945a49..0976857ec7f2 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -564,6 +564,8 @@ static int __init dax_core_init(void)
 	if (rc)
 		return rc;
 
+	dax_mapping_init();
+
 	rc = alloc_chrdev_region(&dax_devt, 0, MINORMASK+1, "dax");
 	if (rc)
 		goto err_chrdev;
@@ -590,5 +592,5 @@ static void __exit dax_core_exit(void)
 
 MODULE_AUTHOR("Intel Corporation");
 MODULE_LICENSE("GPL v2");
-subsys_initcall(dax_core_init);
+fs_initcall(dax_core_init);
 module_exit(dax_core_exit);
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 5a29046e3319..3bb17448d1c8 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -78,6 +78,7 @@ config NVDIMM_DAX
 	bool "NVDIMM DAX: Raw access to persistent memory"
 	default LIBNVDIMM
 	depends on NVDIMM_PFN
+	depends on DAX
 	help
 	  Support raw device dax access to a persistent memory
 	  namespace.  For environments that want to hard partition
diff --git a/fs/dax.c b/fs/dax.c
index ee2568c8b135..79e49e718d33 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -27,847 +27,8 @@
 #include <linux/rmap.h>
 #include <asm/pgalloc.h>
 
-#define CREATE_TRACE_POINTS
 #include <trace/events/fs_dax.h>
 
-static inline unsigned int pe_order(enum page_entry_size pe_size)
-{
-	if (pe_size == PE_SIZE_PTE)
-		return PAGE_SHIFT - PAGE_SHIFT;
-	if (pe_size == PE_SIZE_PMD)
-		return PMD_SHIFT - PAGE_SHIFT;
-	if (pe_size == PE_SIZE_PUD)
-		return PUD_SHIFT - PAGE_SHIFT;
-	return ~0;
-}
-
-/* We choose 4096 entries - same as per-zone page wait tables */
-#define DAX_WAIT_TABLE_BITS 12
-#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
-
-/* The 'colour' (ie low bits) within a PMD of a page offset.  */
-#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
-#define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
-
-/* The order of a PMD entry */
-#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
-
-static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
-
-static int __init init_dax_wait_table(void)
-{
-	int i;
-
-	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
-		init_waitqueue_head(wait_table + i);
-	return 0;
-}
-fs_initcall(init_dax_wait_table);
-
-/*
- * DAX pagecache entries use XArray value entries so they can't be mistaken
- * for pages.  We use one bit for locking, one bit for the entry size (PMD)
- * and two more to tell us if the entry is a zero page or an empty entry that
- * is just used for locking.  In total four special bits.
- *
- * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
- * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
- * block allocation.
- */
-#define DAX_SHIFT	(5)
-#define DAX_MASK	((1UL << DAX_SHIFT) - 1)
-#define DAX_LOCKED	(1UL << 0)
-#define DAX_PMD		(1UL << 1)
-#define DAX_ZERO_PAGE	(1UL << 2)
-#define DAX_EMPTY	(1UL << 3)
-#define DAX_ZAP		(1UL << 4)
-
-/*
- * These flags are not conveyed in Xarray value entries, they are just
- * modifiers to dax_insert_entry().
- */
-#define DAX_DIRTY (1UL << (DAX_SHIFT + 0))
-#define DAX_COW   (1UL << (DAX_SHIFT + 1))
-
-static unsigned long dax_to_pfn(void *entry)
-{
-	return xa_to_value(entry) >> DAX_SHIFT;
-}
-
-static void *dax_make_entry(pfn_t pfn, unsigned long flags)
-{
-	return xa_mk_value((flags & DAX_MASK) |
-			   (pfn_t_to_pfn(pfn) << DAX_SHIFT));
-}
-
-static bool dax_is_locked(void *entry)
-{
-	return xa_to_value(entry) & DAX_LOCKED;
-}
-
-static bool dax_is_zapped(void *entry)
-{
-	return xa_to_value(entry) & DAX_ZAP;
-}
-
-static unsigned int dax_entry_order(void *entry)
-{
-	if (xa_to_value(entry) & DAX_PMD)
-		return PMD_ORDER;
-	return 0;
-}
-
-static unsigned long dax_is_pmd_entry(void *entry)
-{
-	return xa_to_value(entry) & DAX_PMD;
-}
-
-static bool dax_is_pte_entry(void *entry)
-{
-	return !(xa_to_value(entry) & DAX_PMD);
-}
-
-static int dax_is_zero_entry(void *entry)
-{
-	return xa_to_value(entry) & DAX_ZERO_PAGE;
-}
-
-static int dax_is_empty_entry(void *entry)
-{
-	return xa_to_value(entry) & DAX_EMPTY;
-}
-
-/*
- * true if the entry that was found is of a smaller order than the entry
- * we were looking for
- */
-static bool dax_is_conflict(void *entry)
-{
-	return entry == XA_RETRY_ENTRY;
-}
-
-/*
- * DAX page cache entry locking
- */
-struct exceptional_entry_key {
-	struct xarray *xa;
-	pgoff_t entry_start;
-};
-
-struct wait_exceptional_entry_queue {
-	wait_queue_entry_t wait;
-	struct exceptional_entry_key key;
-};
-
-/**
- * enum dax_wake_mode: waitqueue wakeup behaviour
- * @WAKE_ALL: wake all waiters in the waitqueue
- * @WAKE_NEXT: wake only the first waiter in the waitqueue
- */
-enum dax_wake_mode {
-	WAKE_ALL,
-	WAKE_NEXT,
-};
-
-static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
-		void *entry, struct exceptional_entry_key *key)
-{
-	unsigned long hash;
-	unsigned long index = xas->xa_index;
-
-	/*
-	 * If 'entry' is a PMD, align the 'index' that we use for the wait
-	 * queue to the start of that PMD.  This ensures that all offsets in
-	 * the range covered by the PMD map to the same bit lock.
-	 */
-	if (dax_is_pmd_entry(entry))
-		index &= ~PG_PMD_COLOUR;
-	key->xa = xas->xa;
-	key->entry_start = index;
-
-	hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS);
-	return wait_table + hash;
-}
-
-static int wake_exceptional_entry_func(wait_queue_entry_t *wait,
-		unsigned int mode, int sync, void *keyp)
-{
-	struct exceptional_entry_key *key = keyp;
-	struct wait_exceptional_entry_queue *ewait =
-		container_of(wait, struct wait_exceptional_entry_queue, wait);
-
-	if (key->xa != ewait->key.xa ||
-	    key->entry_start != ewait->key.entry_start)
-		return 0;
-	return autoremove_wake_function(wait, mode, sync, NULL);
-}
-
-/*
- * @entry may no longer be the entry at the index in the mapping.
- * The important information it's conveying is whether the entry at
- * this index used to be a PMD entry.
- */
-static void dax_wake_entry(struct xa_state *xas, void *entry,
-			   enum dax_wake_mode mode)
-{
-	struct exceptional_entry_key key;
-	wait_queue_head_t *wq;
-
-	wq = dax_entry_waitqueue(xas, entry, &key);
-
-	/*
-	 * Checking for locked entry and prepare_to_wait_exclusive() happens
-	 * under the i_pages lock, ditto for entry handling in our callers.
-	 * So at this point all tasks that could have seen our entry locked
-	 * must be in the waitqueue and the following check will see them.
-	 */
-	if (waitqueue_active(wq))
-		__wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key);
-}
-
-/*
- * Look up entry in page cache, wait for it to become unlocked if it
- * is a DAX entry and return it.  The caller must subsequently call
- * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry()
- * if it did.  The entry returned may have a larger order than @order.
- * If @order is larger than the order of the entry found in i_pages, this
- * function returns a dax_is_conflict entry.
- *
- * Must be called with the i_pages lock held.
- */
-static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
-{
-	void *entry;
-	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq;
-
-	init_wait(&ewait.wait);
-	ewait.wait.func = wake_exceptional_entry_func;
-
-	for (;;) {
-		entry = xas_find_conflict(xas);
-		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
-			return entry;
-		if (dax_entry_order(entry) < order)
-			return XA_RETRY_ENTRY;
-		if (!dax_is_locked(entry))
-			return entry;
-
-		wq = dax_entry_waitqueue(xas, entry, &ewait.key);
-		prepare_to_wait_exclusive(wq, &ewait.wait,
-					  TASK_UNINTERRUPTIBLE);
-		xas_unlock_irq(xas);
-		xas_reset(xas);
-		schedule();
-		finish_wait(wq, &ewait.wait);
-		xas_lock_irq(xas);
-	}
-}
-
-/*
- * The only thing keeping the address space around is the i_pages lock
- * (it's cycled in clear_inode() after removing the entries from i_pages)
- * After we call xas_unlock_irq(), we cannot touch xas->xa.
- */
-static void wait_entry_unlocked(struct xa_state *xas, void *entry)
-{
-	struct wait_exceptional_entry_queue ewait;
-	wait_queue_head_t *wq;
-
-	init_wait(&ewait.wait);
-	ewait.wait.func = wake_exceptional_entry_func;
-
-	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
-	/*
-	 * Unlike get_unlocked_entry() there is no guarantee that this
-	 * path ever successfully retrieves an unlocked entry before an
-	 * inode dies. Perform a non-exclusive wait in case this path
-	 * never successfully performs its own wake up.
-	 */
-	prepare_to_wait(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
-	xas_unlock_irq(xas);
-	schedule();
-	finish_wait(wq, &ewait.wait);
-}
-
-static void put_unlocked_entry(struct xa_state *xas, void *entry,
-			       enum dax_wake_mode mode)
-{
-	if (entry && !dax_is_conflict(entry))
-		dax_wake_entry(xas, entry, mode);
-}
-
-/*
- * We used the xa_state to get the entry, but then we locked the entry and
- * dropped the xa_lock, so we know the xa_state is stale and must be reset
- * before use.
- */
-static void dax_unlock_entry(struct xa_state *xas, void *entry)
-{
-	void *old;
-
-	BUG_ON(dax_is_locked(entry));
-	xas_reset(xas);
-	xas_lock_irq(xas);
-	old = xas_store(xas, entry);
-	xas_unlock_irq(xas);
-	BUG_ON(!dax_is_locked(old));
-	dax_wake_entry(xas, entry, WAKE_NEXT);
-}
-
-/*
- * Return: The entry stored at this location before it was locked.
- */
-static void *dax_lock_entry(struct xa_state *xas, void *entry)
-{
-	unsigned long v = xa_to_value(entry);
-	return xas_store(xas, xa_mk_value(v | DAX_LOCKED));
-}
-
-static unsigned long dax_entry_size(void *entry)
-{
-	if (dax_is_zero_entry(entry))
-		return 0;
-	else if (dax_is_empty_entry(entry))
-		return 0;
-	else if (dax_is_pmd_entry(entry))
-		return PMD_SIZE;
-	else
-		return PAGE_SIZE;
-}
-
-static unsigned long dax_end_pfn(void *entry)
-{
-	return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
-}
-
-/*
- * Iterate through all mapped pfns represented by an entry, i.e. skip
- * 'empty' and 'zero' entries.
- */
-#define for_each_mapped_pfn(entry, pfn) \
-	for (pfn = dax_to_pfn(entry); \
-			pfn < dax_end_pfn(entry); pfn++)
-
-static inline bool dax_mapping_is_cow(struct address_space *mapping)
-{
-	return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
-}
-
-/*
- * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
- */
-static inline void dax_mapping_set_cow(struct page *page)
-{
-	if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
-		/*
-		 * Reset the index if the page was already mapped
-		 * regularly before.
-		 */
-		if (page->mapping)
-			page->index = 1;
-		page->mapping = (void *)PAGE_MAPPING_DAX_COW;
-	}
-	page->index++;
-}
-
-/*
- * When it is called in dax_insert_entry(), the cow flag will indicate that
- * whether this entry is shared by multiple files.  If so, set the page->mapping
- * FS_DAX_MAPPING_COW, and use page->index as refcount.
- */
-static vm_fault_t dax_associate_entry(void *entry,
-				      struct address_space *mapping,
-				      struct vm_fault *vmf, unsigned long flags)
-{
-	unsigned long size = dax_entry_size(entry), pfn, index;
-	struct dev_pagemap *pgmap;
-	int i = 0;
-
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return 0;
-
-	if (!size)
-		return 0;
-
-	if (!(flags & DAX_COW)) {
-		pfn = dax_to_pfn(entry);
-		pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size));
-		if (!pgmap)
-			return VM_FAULT_SIGBUS;
-	}
-
-	index = linear_page_index(vmf->vma, ALIGN(vmf->address, size));
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
-
-		if (flags & DAX_COW) {
-			dax_mapping_set_cow(page);
-		} else {
-			WARN_ON_ONCE(page->mapping);
-			page->mapping = mapping;
-			page->index = index + i++;
-			page_ref_inc(page);
-		}
-	}
-
-	return 0;
-}
-
-static void dax_disassociate_entry(void *entry, struct address_space *mapping,
-		bool trunc)
-{
-	unsigned long size = dax_entry_size(entry), pfn;
-	struct page *page;
-
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
-
-	if (!size)
-		return;
-
-	for_each_mapped_pfn(entry, pfn) {
-		page = pfn_to_page(pfn);
-		if (dax_mapping_is_cow(page->mapping)) {
-			/* keep the CoW flag if this page is still shared */
-			if (page->index-- > 0)
-				continue;
-		} else {
-			WARN_ON_ONCE(trunc && !dax_is_zapped(entry));
-			WARN_ON_ONCE(trunc && !dax_page_idle(page));
-			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-		}
-		page->mapping = NULL;
-		page->index = 0;
-	}
-
-	if (trunc && !dax_mapping_is_cow(page->mapping)) {
-		page = pfn_to_page(dax_to_pfn(entry));
-		put_dev_pagemap_many(page->pgmap, PHYS_PFN(size));
-	}
-}
-
-/*
- * dax_lock_page - Lock the DAX entry corresponding to a page
- * @page: The page whose entry we want to lock
- *
- * Context: Process context.
- * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could
- * not be locked.
- */
-dax_entry_t dax_lock_page(struct page *page)
-{
-	XA_STATE(xas, NULL, 0);
-	void *entry;
-
-	/* Ensure page->mapping isn't freed while we look at it */
-	rcu_read_lock();
-	for (;;) {
-		struct address_space *mapping = READ_ONCE(page->mapping);
-
-		entry = NULL;
-		if (!mapping || !dax_mapping(mapping))
-			break;
-
-		/*
-		 * In the device-dax case there's no need to lock, a
-		 * struct dev_pagemap pin is sufficient to keep the
-		 * inode alive, and we assume we have dev_pagemap pin
-		 * otherwise we would not have a valid pfn_to_page()
-		 * translation.
-		 */
-		entry = (void *)~0UL;
-		if (S_ISCHR(mapping->host->i_mode))
-			break;
-
-		xas.xa = &mapping->i_pages;
-		xas_lock_irq(&xas);
-		if (mapping != page->mapping) {
-			xas_unlock_irq(&xas);
-			continue;
-		}
-		xas_set(&xas, page->index);
-		entry = xas_load(&xas);
-		if (dax_is_locked(entry)) {
-			rcu_read_unlock();
-			wait_entry_unlocked(&xas, entry);
-			rcu_read_lock();
-			continue;
-		}
-		dax_lock_entry(&xas, entry);
-		xas_unlock_irq(&xas);
-		break;
-	}
-	rcu_read_unlock();
-	return (dax_entry_t)entry;
-}
-
-void dax_unlock_page(struct page *page, dax_entry_t cookie)
-{
-	struct address_space *mapping = page->mapping;
-	XA_STATE(xas, &mapping->i_pages, page->index);
-
-	if (S_ISCHR(mapping->host->i_mode))
-		return;
-
-	dax_unlock_entry(&xas, (void *)cookie);
-}
-
-/*
- * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
- * @mapping: the file's mapping whose entry we want to lock
- * @index: the offset within this file
- * @page: output the dax page corresponding to this dax entry
- *
- * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
- * could not be locked.
- */
-dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
-		struct page **page)
-{
-	XA_STATE(xas, NULL, 0);
-	void *entry;
-
-	rcu_read_lock();
-	for (;;) {
-		entry = NULL;
-		if (!dax_mapping(mapping))
-			break;
-
-		xas.xa = &mapping->i_pages;
-		xas_lock_irq(&xas);
-		xas_set(&xas, index);
-		entry = xas_load(&xas);
-		if (dax_is_locked(entry)) {
-			rcu_read_unlock();
-			wait_entry_unlocked(&xas, entry);
-			rcu_read_lock();
-			continue;
-		}
-		if (!entry ||
-		    dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
-			/*
-			 * Because we are looking for entry from file's mapping
-			 * and index, so the entry may not be inserted for now,
-			 * or even a zero/empty entry.  We don't think this is
-			 * an error case.  So, return a special value and do
-			 * not output @page.
-			 */
-			entry = (void *)~0UL;
-		} else {
-			*page = pfn_to_page(dax_to_pfn(entry));
-			dax_lock_entry(&xas, entry);
-		}
-		xas_unlock_irq(&xas);
-		break;
-	}
-	rcu_read_unlock();
-	return (dax_entry_t)entry;
-}
-
-void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
-		dax_entry_t cookie)
-{
-	XA_STATE(xas, &mapping->i_pages, index);
-
-	if (cookie == ~0UL)
-		return;
-
-	dax_unlock_entry(&xas, (void *)cookie);
-}
-
-/*
- * Find page cache entry at given index. If it is a DAX entry, return it
- * with the entry locked. If the page cache doesn't contain an entry at
- * that index, add a locked empty entry.
- *
- * When requesting an entry with size DAX_PMD, grab_mapping_entry() will
- * either return that locked entry or will return VM_FAULT_FALLBACK.
- * This will happen if there are any PTE entries within the PMD range
- * that we are requesting.
- *
- * We always favor PTE entries over PMD entries. There isn't a flow where we
- * evict PTE entries in order to 'upgrade' them to a PMD entry.  A PMD
- * insertion will fail if it finds any PTE entries already in the tree, and a
- * PTE insertion will cause an existing PMD entry to be unmapped and
- * downgraded to PTE entries.  This happens for both PMD zero pages as
- * well as PMD empty entries.
- *
- * The exception to this downgrade path is for PMD entries that have
- * real storage backing them.  We will leave these real PMD entries in
- * the tree, and PTE writes will simply dirty the entire PMD entry.
- *
- * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
- * persistent memory the benefit is doubtful. We can add that later if we can
- * show it helps.
- *
- * On error, this function does not return an ERR_PTR.  Instead it returns
- * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
- * overlap with xarray value entries.
- */
-static void *grab_mapping_entry(struct xa_state *xas,
-		struct address_space *mapping, unsigned int order)
-{
-	unsigned long index = xas->xa_index;
-	bool pmd_downgrade;	/* splitting PMD entry into PTE entries? */
-	void *entry;
-
-retry:
-	pmd_downgrade = false;
-	xas_lock_irq(xas);
-	entry = get_unlocked_entry(xas, order);
-
-	if (entry) {
-		if (dax_is_conflict(entry))
-			goto fallback;
-		if (!xa_is_value(entry)) {
-			xas_set_err(xas, -EIO);
-			goto out_unlock;
-		}
-
-		if (order == 0) {
-			if (dax_is_pmd_entry(entry) &&
-			    (dax_is_zero_entry(entry) ||
-			     dax_is_empty_entry(entry))) {
-				pmd_downgrade = true;
-			}
-		}
-	}
-
-	if (pmd_downgrade) {
-		/*
-		 * Make sure 'entry' remains valid while we drop
-		 * the i_pages lock.
-		 */
-		dax_lock_entry(xas, entry);
-
-		/*
-		 * Besides huge zero pages the only other thing that gets
-		 * downgraded are empty entries which don't need to be
-		 * unmapped.
-		 */
-		if (dax_is_zero_entry(entry)) {
-			xas_unlock_irq(xas);
-			unmap_mapping_pages(mapping,
-					xas->xa_index & ~PG_PMD_COLOUR,
-					PG_PMD_NR, false);
-			xas_reset(xas);
-			xas_lock_irq(xas);
-		}
-
-		dax_disassociate_entry(entry, mapping, false);
-		xas_store(xas, NULL);	/* undo the PMD join */
-		dax_wake_entry(xas, entry, WAKE_ALL);
-		mapping->nrpages -= PG_PMD_NR;
-		entry = NULL;
-		xas_set(xas, index);
-	}
-
-	if (entry) {
-		dax_lock_entry(xas, entry);
-	} else {
-		unsigned long flags = DAX_EMPTY;
-
-		if (order > 0)
-			flags |= DAX_PMD;
-		entry = dax_make_entry(pfn_to_pfn_t(0), flags);
-		dax_lock_entry(xas, entry);
-		if (xas_error(xas))
-			goto out_unlock;
-		mapping->nrpages += 1UL << order;
-	}
-
-out_unlock:
-	xas_unlock_irq(xas);
-	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
-		goto retry;
-	if (xas->xa_node == XA_ERROR(-ENOMEM))
-		return xa_mk_internal(VM_FAULT_OOM);
-	if (xas_error(xas))
-		return xa_mk_internal(VM_FAULT_SIGBUS);
-	return entry;
-fallback:
-	xas_unlock_irq(xas);
-	return xa_mk_internal(VM_FAULT_FALLBACK);
-}
-
-static void *dax_zap_entry(struct xa_state *xas, void *entry)
-{
-	unsigned long v = xa_to_value(entry);
-
-	return xas_store(xas, xa_mk_value(v | DAX_ZAP));
-}
-
-/**
- * Return NULL if the entry is zapped and all pages in the entry are
- * idle, otherwise return the non-idle page in the entry
- */
-static struct page *dax_zap_pages(struct xa_state *xas, void *entry)
-{
-	struct page *ret = NULL;
-	unsigned long pfn;
-	bool zap;
-
-	if (!dax_entry_size(entry))
-		return NULL;
-
-	zap = !dax_is_zapped(entry);
-
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
-
-		if (zap)
-			page_ref_dec(page);
-
-		if (!ret && !dax_page_idle(page))
-			ret = page;
-	}
-
-	if (zap)
-		dax_zap_entry(xas, entry);
-
-	return ret;
-}
-
-/**
- * dax_zap_mappings_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
- * @start: Starting offset. Page containing 'start' is included.
- * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
- *       pages from 'start' till the end of file are included.
- *
- * DAX requires ZONE_DEVICE mapped pages. These pages are never
- * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
- * any page in the mapping is busy, i.e. for DMA, or other
- * get_user_pages() usages.
- *
- * It is expected that the filesystem is holding locks to block the
- * establishment of new mappings in this address_space. I.e. it expects
- * to be able to run unmap_mapping_range() and subsequently not race
- * mapping_mapped() becoming true.
- */
-struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
-				    loff_t end)
-{
-	void *entry;
-	unsigned int scanned = 0;
-	struct page *page = NULL;
-	pgoff_t start_idx = start >> PAGE_SHIFT;
-	pgoff_t end_idx;
-	XA_STATE(xas, &mapping->i_pages, start_idx);
-
-	/*
-	 * In the 'limited' case get_user_pages() for dax is disabled.
-	 */
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return NULL;
-
-	if (!dax_mapping(mapping))
-		return NULL;
-
-	/* If end == LLONG_MAX, all pages from start to till end of file */
-	if (end == LLONG_MAX)
-		end_idx = ULONG_MAX;
-	else
-		end_idx = end >> PAGE_SHIFT;
-	/*
-	 * If we race get_user_pages_fast() here either we'll see the
-	 * elevated page count in the iteration and wait, or
-	 * get_user_pages_fast() will see that the page it took a reference
-	 * against is no longer mapped in the page tables and bail to the
-	 * get_user_pages() slow path.  The slow path is protected by
-	 * pte_lock() and pmd_lock(). New references are not taken without
-	 * holding those locks, and unmap_mapping_pages() will not zero the
-	 * pte or pmd without holding the respective lock, so we are
-	 * guaranteed to either see new references or prevent new
-	 * references from being established.
-	 */
-	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
-
-	xas_lock_irq(&xas);
-	xas_for_each(&xas, entry, end_idx) {
-		if (WARN_ON_ONCE(!xa_is_value(entry)))
-			continue;
-		if (unlikely(dax_is_locked(entry)))
-			entry = get_unlocked_entry(&xas, 0);
-		if (entry)
-			page = dax_zap_pages(&xas, entry);
-		put_unlocked_entry(&xas, entry, WAKE_NEXT);
-		if (page)
-			break;
-		if (++scanned % XA_CHECK_SCHED)
-			continue;
-
-		xas_pause(&xas);
-		xas_unlock_irq(&xas);
-		cond_resched();
-		xas_lock_irq(&xas);
-	}
-	xas_unlock_irq(&xas);
-	return page;
-}
-EXPORT_SYMBOL_GPL(dax_zap_mappings_range);
-
-struct page *dax_zap_mappings(struct address_space *mapping)
-{
-	return dax_zap_mappings_range(mapping, 0, LLONG_MAX);
-}
-EXPORT_SYMBOL_GPL(dax_zap_mappings);
-
-static int __dax_invalidate_entry(struct address_space *mapping,
-					  pgoff_t index, bool trunc)
-{
-	XA_STATE(xas, &mapping->i_pages, index);
-	int ret = 0;
-	void *entry;
-
-	xas_lock_irq(&xas);
-	entry = get_unlocked_entry(&xas, 0);
-	if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
-		goto out;
-	if (!trunc &&
-	    (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
-	     xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
-		goto out;
-	dax_disassociate_entry(entry, mapping, trunc);
-	xas_store(&xas, NULL);
-	mapping->nrpages -= 1UL << dax_entry_order(entry);
-	ret = 1;
-out:
-	put_unlocked_entry(&xas, entry, WAKE_ALL);
-	xas_unlock_irq(&xas);
-	return ret;
-}
-
-/*
- * Delete DAX entry at @index from @mapping.  Wait for it
- * to be unlocked before deleting it.
- */
-int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
-{
-	int ret = __dax_invalidate_entry(mapping, index, true);
-
-	/*
-	 * This gets called from truncate / punch_hole path. As such, the caller
-	 * must hold locks protecting against concurrent modifications of the
-	 * page cache (usually fs-private i_mmap_sem for writing). Since the
-	 * caller has seen a DAX entry for this index, we better find it
-	 * at that index as well...
-	 */
-	WARN_ON_ONCE(!ret);
-	return ret;
-}
-
-/*
- * Invalidate DAX entry if it is clean.
- */
-int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
-				      pgoff_t index)
-{
-	return __dax_invalidate_entry(mapping, index, false);
-}
-
 static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos)
 {
 	return PHYS_PFN(iomap->addr + (pos & PAGE_MASK) - iomap->offset);
@@ -894,195 +55,6 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
 	return 0;
 }
 
-/*
- * MAP_SYNC on a dax mapping guarantees dirty metadata is
- * flushed on write-faults (non-cow), but not read-faults.
- */
-static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
-		struct vm_area_struct *vma)
-{
-	return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
-		(iter->iomap.flags & IOMAP_F_DIRTY);
-}
-
-static bool dax_fault_is_cow(const struct iomap_iter *iter)
-{
-	return (iter->flags & IOMAP_WRITE) &&
-		(iter->iomap.flags & IOMAP_F_SHARED);
-}
-
-static unsigned long dax_iter_flags(const struct iomap_iter *iter,
-				    struct vm_fault *vmf)
-{
-	unsigned long flags = 0;
-
-	if (!dax_fault_is_synchronous(iter, vmf->vma))
-		flags |= DAX_DIRTY;
-
-	if (dax_fault_is_cow(iter))
-		flags |= DAX_COW;
-
-	return flags;
-}
-
-/*
- * By this point grab_mapping_entry() has ensured that we have a locked entry
- * of the appropriate size so we don't have to worry about downgrading PMDs to
- * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
- * already in the tree, we will skip the insertion and just dirty the PMD as
- * appropriate.
- */
-static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
-				   void **pentry, pfn_t pfn,
-				   unsigned long flags)
-{
-	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
-	void *new_entry = dax_make_entry(pfn, flags);
-	bool dirty = flags & DAX_DIRTY;
-	bool cow = flags & DAX_COW;
-	void *entry = *pentry;
-
-	if (dirty)
-		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
-	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
-		unsigned long index = xas->xa_index;
-		/* we are replacing a zero page with block mapping */
-		if (dax_is_pmd_entry(entry))
-			unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
-					PG_PMD_NR, false);
-		else /* pte entry */
-			unmap_mapping_pages(mapping, index, 1, false);
-	}
-
-	xas_reset(xas);
-	xas_lock_irq(xas);
-	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
-		void *old;
-
-		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf, flags);
-		/*
-		 * Only swap our new entry into the page cache if the current
-		 * entry is a zero page or an empty entry.  If a normal PTE or
-		 * PMD entry is already in the cache, we leave it alone.  This
-		 * means that if we are trying to insert a PTE and the
-		 * existing entry is a PMD, we will just leave the PMD in the
-		 * tree and dirty it if necessary.
-		 */
-		old = dax_lock_entry(xas, new_entry);
-		WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
-					DAX_LOCKED));
-		entry = new_entry;
-	} else {
-		xas_load(xas);	/* Walk the xa_state */
-	}
-
-	if (dirty)
-		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
-
-	if (cow)
-		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
-
-	xas_unlock_irq(xas);
-	*pentry = entry;
-	return 0;
-}
-
-static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
-		struct address_space *mapping, void *entry)
-{
-	unsigned long pfn, index, count, end;
-	long ret = 0;
-	struct vm_area_struct *vma;
-
-	/*
-	 * A page got tagged dirty in DAX mapping? Something is seriously
-	 * wrong.
-	 */
-	if (WARN_ON(!xa_is_value(entry)))
-		return -EIO;
-
-	if (unlikely(dax_is_locked(entry))) {
-		void *old_entry = entry;
-
-		entry = get_unlocked_entry(xas, 0);
-
-		/* Entry got punched out / reallocated? */
-		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
-			goto put_unlocked;
-		/*
-		 * Entry got reallocated elsewhere? No need to writeback.
-		 * We have to compare pfns as we must not bail out due to
-		 * difference in lockbit or entry type.
-		 */
-		if (dax_to_pfn(old_entry) != dax_to_pfn(entry))
-			goto put_unlocked;
-		if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
-					dax_is_zero_entry(entry))) {
-			ret = -EIO;
-			goto put_unlocked;
-		}
-
-		/* Another fsync thread may have already done this entry */
-		if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE))
-			goto put_unlocked;
-	}
-
-	/* Lock the entry to serialize with page faults */
-	dax_lock_entry(xas, entry);
-
-	/*
-	 * We can clear the tag now but we have to be careful so that concurrent
-	 * dax_writeback_one() calls for the same index cannot finish before we
-	 * actually flush the caches. This is achieved as the calls will look
-	 * at the entry only under the i_pages lock and once they do that
-	 * they will see the entry locked and wait for it to unlock.
-	 */
-	xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE);
-	xas_unlock_irq(xas);
-
-	/*
-	 * If dax_writeback_mapping_range() was given a wbc->range_start
-	 * in the middle of a PMD, the 'index' we use needs to be
-	 * aligned to the start of the PMD.
-	 * This allows us to flush for PMD_SIZE and not have to worry about
-	 * partial PMD writebacks.
-	 */
-	pfn = dax_to_pfn(entry);
-	count = 1UL << dax_entry_order(entry);
-	index = xas->xa_index & ~(count - 1);
-	end = index + count - 1;
-
-	/* Walk all mappings of a given index of a file and writeprotect them */
-	i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
-		pfn_mkclean_range(pfn, count, index, vma);
-		cond_resched();
-	}
-	i_mmap_unlock_read(mapping);
-
-	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
-	/*
-	 * After we have flushed the cache, we can clear the dirty tag. There
-	 * cannot be new dirty data in the pfn after the flush has completed as
-	 * the pfn mappings are writeprotected and fault waits for mapping
-	 * entry lock.
-	 */
-	xas_reset(xas);
-	xas_lock_irq(xas);
-	xas_store(xas, entry);
-	xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
-	dax_wake_entry(xas, entry, WAKE_NEXT);
-
-	trace_dax_writeback_one(mapping->host, index, count);
-	return ret;
-
- put_unlocked:
-	put_unlocked_entry(xas, entry, WAKE_NEXT);
-	return ret;
-}
-
 /*
  * Flush the mapping to the persistent domain within the byte range of [start,
  * end]. This is required by data integrity operations to ensure file data is
@@ -1219,6 +191,37 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
 	return 0;
 }
 
+/*
+ * MAP_SYNC on a dax mapping guarantees dirty metadata is
+ * flushed on write-faults (non-cow), but not read-faults.
+ */
+static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
+				     struct vm_area_struct *vma)
+{
+	return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
+	       (iter->iomap.flags & IOMAP_F_DIRTY);
+}
+
+static bool dax_fault_is_cow(const struct iomap_iter *iter)
+{
+	return (iter->flags & IOMAP_WRITE) &&
+	       (iter->iomap.flags & IOMAP_F_SHARED);
+}
+
+static unsigned long dax_iter_flags(const struct iomap_iter *iter,
+				    struct vm_fault *vmf)
+{
+	unsigned long flags = 0;
+
+	if (!dax_fault_is_synchronous(iter, vmf->vma))
+		flags |= DAX_DIRTY;
+
+	if (dax_fault_is_cow(iter))
+		flags |= DAX_COW;
+
+	return flags;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1701,7 +704,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
 		iter.flags |= IOMAP_WRITE;
 
-	entry = grab_mapping_entry(&xas, mapping, 0);
+	entry = dax_grab_mapping_entry(&xas, mapping, 0);
 	if (xa_is_internal(entry)) {
 		ret = xa_to_internal(entry);
 		goto out;
@@ -1818,12 +821,12 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 		goto fallback;
 
 	/*
-	 * grab_mapping_entry() will make sure we get an empty PMD entry,
+	 * dax_grab_mapping_entry() will make sure we get an empty PMD entry,
 	 * a zero PMD entry or a DAX PMD.  If it can't (because a PTE
 	 * entry is already in the array, for instance), it will return
 	 * VM_FAULT_FALLBACK.
 	 */
-	entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER);
 	if (xa_is_internal(entry)) {
 		ret = xa_to_internal(entry);
 		goto fallback;
@@ -1897,50 +900,6 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
 
-/*
- * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables
- * @vmf: The description of the fault
- * @pfn: PFN to insert
- * @order: Order of entry to insert.
- *
- * This function inserts a writeable PTE or PMD entry into the page tables
- * for an mmaped DAX file.  It also marks the page cache entry as dirty.
- */
-static vm_fault_t
-dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
-{
-	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
-	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
-	void *entry;
-	vm_fault_t ret;
-
-	xas_lock_irq(&xas);
-	entry = get_unlocked_entry(&xas, order);
-	/* Did we race with someone splitting entry or so? */
-	if (!entry || dax_is_conflict(entry) ||
-	    (order == 0 && !dax_is_pte_entry(entry))) {
-		put_unlocked_entry(&xas, entry, WAKE_NEXT);
-		xas_unlock_irq(&xas);
-		trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
-						      VM_FAULT_NOPAGE);
-		return VM_FAULT_NOPAGE;
-	}
-	xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
-	dax_lock_entry(&xas, entry);
-	xas_unlock_irq(&xas);
-	if (order == 0)
-		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
-#ifdef CONFIG_FS_DAX_PMD
-	else if (order == PMD_ORDER)
-		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
-#endif
-	else
-		ret = VM_FAULT_FALLBACK;
-	dax_unlock_entry(&xas, entry);
-	trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
-	return ret;
-}
-
 /**
  * dax_finish_sync_fault - finish synchronous page fault
  * @vmf: The description of the fault
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f6acb4ed73cb..de60a34088bb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -157,15 +157,33 @@ static inline void fs_put_dax(struct dax_device *dax_dev, void *holder)
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc);
 
-struct page *dax_zap_mappings(struct address_space *mapping);
-struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
-				    loff_t end);
+#else
+static inline int dax_writeback_mapping_range(struct address_space *mapping,
+		struct dax_device *dax_dev, struct writeback_control *wbc)
+{
+	return -EOPNOTSUPP;
+}
+
+#endif
+
+int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
+		const struct iomap_ops *ops);
+int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
+		const struct iomap_ops *ops);
+
+#if IS_ENABLED(CONFIG_DAX)
+int dax_read_lock(void);
+void dax_read_unlock(int id);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
+void run_dax(struct dax_device *dax_dev);
 dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
 		unsigned long index, struct page **page);
 void dax_unlock_mapping_entry(struct address_space *mapping,
 		unsigned long index, dax_entry_t cookie);
+struct page *dax_zap_mappings(struct address_space *mapping);
+struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
+				    loff_t end);
 #else
 static inline struct page *dax_zap_mappings(struct address_space *mapping)
 {
@@ -179,12 +197,6 @@ static inline struct page *dax_zap_mappings_range(struct address_space *mapping,
 	return NULL;
 }
 
-static inline int dax_writeback_mapping_range(struct address_space *mapping,
-		struct dax_device *dax_dev, struct writeback_control *wbc)
-{
-	return -EOPNOTSUPP;
-}
-
 static inline dax_entry_t dax_lock_page(struct page *page)
 {
 	if (IS_DAX(page->mapping->host))
@@ -196,6 +208,15 @@ static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 {
 }
 
+static inline int dax_read_lock(void)
+{
+	return 0;
+}
+
+static inline void dax_read_unlock(int id)
+{
+}
+
 static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
 		unsigned long index, struct page **page)
 {
@@ -208,11 +229,6 @@ static inline void dax_unlock_mapping_entry(struct address_space *mapping,
 }
 #endif
 
-int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
-		const struct iomap_ops *ops);
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
-
 /*
  * Document all the code locations that want know when a dax page is
  * unreferenced.
@@ -222,19 +238,6 @@ static inline bool dax_page_idle(struct page *page)
 	return page_ref_count(page) == 1;
 }
 
-#if IS_ENABLED(CONFIG_DAX)
-int dax_read_lock(void);
-void dax_read_unlock(int id);
-#else
-static inline int dax_read_lock(void)
-{
-	return 0;
-}
-
-static inline void dax_read_unlock(int id)
-{
-}
-#endif /* CONFIG_DAX */
 bool dax_alive(struct dax_device *dax_dev);
 void *dax_get_private(struct dax_device *dax_dev);
 long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
@@ -255,6 +258,9 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    pfn_t *pfnp, int *errp, const struct iomap_ops *ops);
 vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 		enum page_entry_size pe_size, pfn_t pfn);
+void *dax_grab_mapping_entry(struct xa_state *xas,
+			     struct address_space *mapping, unsigned int order);
+void dax_unlock_entry(struct xa_state *xas, void *entry);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
@@ -271,6 +277,56 @@ static inline bool dax_mapping(struct address_space *mapping)
 	return mapping->host && IS_DAX(mapping->host);
 }
 
+/*
+ * DAX pagecache entries use XArray value entries so they can't be mistaken
+ * for pages.  We use one bit for locking, one bit for the entry size (PMD)
+ * and two more to tell us if the entry is a zero page or an empty entry that
+ * is just used for locking.  In total four special bits.
+ *
+ * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
+ * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
+ * block allocation.
+ */
+#define DAX_SHIFT	(5)
+#define DAX_MASK	((1UL << DAX_SHIFT) - 1)
+#define DAX_LOCKED	(1UL << 0)
+#define DAX_PMD		(1UL << 1)
+#define DAX_ZERO_PAGE	(1UL << 2)
+#define DAX_EMPTY	(1UL << 3)
+#define DAX_ZAP		(1UL << 4)
+
+/*
+ * These flags are not conveyed in Xarray value entries, they are just
+ * modifiers to dax_insert_entry().
+ */
+#define DAX_DIRTY (1UL << (DAX_SHIFT + 0))
+#define DAX_COW   (1UL << (DAX_SHIFT + 1))
+
+vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+			    void **pentry, pfn_t pfn, unsigned long flags);
+vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn,
+				  unsigned int order);
+int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
+		      struct address_space *mapping, void *entry);
+
+/* The 'colour' (ie low bits) within a PMD of a page offset.  */
+#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
+#define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT)
+
+/* The order of a PMD entry */
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
+
+static inline unsigned int pe_order(enum page_entry_size pe_size)
+{
+	if (pe_size == PE_SIZE_PTE)
+		return PAGE_SHIFT - PAGE_SHIFT;
+	if (pe_size == PE_SIZE_PMD)
+		return PMD_SHIFT - PAGE_SHIFT;
+	if (pe_size == PE_SIZE_PUD)
+		return PUD_SHIFT - PAGE_SHIFT;
+	return ~0;
+}
+
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
 #else
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index fd57407e7f3d..e5d30eec3bf1 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -221,6 +221,12 @@ static inline void devm_memunmap_pages(struct device *dev,
 {
 }
 
+static inline struct dev_pagemap *
+get_dev_pagemap_many(unsigned long pfn, struct dev_pagemap *pgmap, int refs)
+{
+	return NULL;
+}
+
 static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 		struct dev_pagemap *pgmap)
 {


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 13/18] dax: Prep mapping helpers for compound pages
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (11 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-21 14:06   ` Jason Gunthorpe
  2022-09-16  3:36 ` [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure Dan Williams
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In preparation for device-dax to use the same mapping machinery as
fsdax, add support for device-dax compound pages.

Presently this is handled by dax_set_mapping() which is careful to only
update page->mapping for head pages. However, it does that by looking at
properties in the 'struct dev_dax' instance associated with the page.
Switch to just checking PageHead() directly in the functions that
iterate over pages in a large mapping.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/Kconfig   |    1 +
 drivers/dax/mapping.c |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 205e9dda8928..2eddd32c51f4 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -9,6 +9,7 @@ if DAX
 config DEV_DAX
 	tristate "Device DAX: direct access mapping device"
 	depends on TRANSPARENT_HUGEPAGE
+	depends on !FS_DAX_LIMITED
 	help
 	  Support raw access to differentiated (persistence, bandwidth,
 	  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c
index 70576aa02148..5d4b9601f183 100644
--- a/drivers/dax/mapping.c
+++ b/drivers/dax/mapping.c
@@ -345,6 +345,8 @@ static vm_fault_t dax_associate_entry(void *entry,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
+		page = compound_head(page);
+
 		if (flags & DAX_COW) {
 			dax_mapping_set_cow(page);
 		} else {
@@ -353,6 +355,9 @@ static vm_fault_t dax_associate_entry(void *entry,
 			page->index = index + i++;
 			page_ref_inc(page);
 		}
+
+		if (PageHead(page))
+			break;
 	}
 
 	return 0;
@@ -372,6 +377,9 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 
 	for_each_mapped_pfn(entry, pfn) {
 		page = pfn_to_page(pfn);
+
+		page = compound_head(page);
+
 		if (dax_mapping_is_cow(page->mapping)) {
 			/* keep the CoW flag if this page is still shared */
 			if (page->index-- > 0)
@@ -383,6 +391,9 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		}
 		page->mapping = NULL;
 		page->index = 0;
+
+		if (PageHead(page))
+			break;
 	}
 
 	if (trunc && !dax_mapping_is_cow(page->mapping)) {
@@ -660,11 +671,16 @@ static struct page *dax_zap_pages(struct xa_state *xas, void *entry)
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
+		page = compound_head(page);
+
 		if (zap)
 			page_ref_dec(page);
 
 		if (!ret && !dax_page_idle(page))
 			ret = page;
+
+		if (PageHead(page))
+			break;
 	}
 
 	if (zap)


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (12 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 13/18] dax: Prep mapping helpers for compound pages Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-16  3:36 ` [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

In preparation for using the DAX mapping infrastructure for device-dax,
update the helpers to handle PUD entries.

In practice the code related to @size_downgrade will go unused for PUD
entries since only devdax creates DAX PUD entries and devdax enforces
aligned mappings. The conversion is included for completeness.

The addition of PUD support to dax_insert_pfn_mkwrite() requires a new
stub for vmf_insert_pfn_pud() in the
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=n case.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/mapping.c   |   50 ++++++++++++++++++++++++++++++++++++-----------
 include/linux/dax.h     |   32 ++++++++++++++++++++----------
 include/linux/huge_mm.h |   11 ++++++++--
 3 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c
index 5d4b9601f183..b5a5196f8831 100644
--- a/drivers/dax/mapping.c
+++ b/drivers/dax/mapping.c
@@ -13,6 +13,7 @@
 #include <linux/pfn_t.h>
 #include <linux/sizes.h>
 #include <linux/pagemap.h>
+#include <linux/huge_mm.h>
 
 #include "dax-private.h"
 
@@ -56,6 +57,8 @@ static bool dax_is_zapped(void *entry)
 
 static unsigned int dax_entry_order(void *entry)
 {
+	if (xa_to_value(entry) & DAX_PUD)
+		return PUD_ORDER;
 	if (xa_to_value(entry) & DAX_PMD)
 		return PMD_ORDER;
 	return 0;
@@ -66,9 +69,14 @@ static unsigned long dax_is_pmd_entry(void *entry)
 	return xa_to_value(entry) & DAX_PMD;
 }
 
+static unsigned long dax_is_pud_entry(void *entry)
+{
+	return xa_to_value(entry) & DAX_PUD;
+}
+
 static bool dax_is_pte_entry(void *entry)
 {
-	return !(xa_to_value(entry) & DAX_PMD);
+	return !(xa_to_value(entry) & (DAX_PMD|DAX_PUD));
 }
 
 static int dax_is_zero_entry(void *entry)
@@ -277,6 +285,8 @@ static unsigned long dax_entry_size(void *entry)
 		return 0;
 	else if (dax_is_pmd_entry(entry))
 		return PMD_SIZE;
+	else if (dax_is_pud_entry(entry))
+		return PUD_SIZE;
 	else
 		return PAGE_SIZE;
 }
@@ -564,11 +574,11 @@ void *dax_grab_mapping_entry(struct xa_state *xas,
 			     struct address_space *mapping, unsigned int order)
 {
 	unsigned long index = xas->xa_index;
-	bool pmd_downgrade; /* splitting PMD entry into PTE entries? */
+	bool size_downgrade; /* splitting entry into PTE entries? */
 	void *entry;
 
 retry:
-	pmd_downgrade = false;
+	size_downgrade = false;
 	xas_lock_irq(xas);
 	entry = get_unlocked_entry(xas, order);
 
@@ -581,15 +591,25 @@ void *dax_grab_mapping_entry(struct xa_state *xas,
 		}
 
 		if (order == 0) {
-			if (dax_is_pmd_entry(entry) &&
+			if (!dax_is_pte_entry(entry) &&
 			    (dax_is_zero_entry(entry) ||
 			     dax_is_empty_entry(entry))) {
-				pmd_downgrade = true;
+				size_downgrade = true;
 			}
 		}
 	}
 
-	if (pmd_downgrade) {
+	if (size_downgrade) {
+		unsigned long colour, nr;
+
+		if (dax_is_pmd_entry(entry)) {
+			colour = PG_PMD_COLOUR;
+			nr = PG_PMD_NR;
+		} else {
+			colour = PG_PUD_COLOUR;
+			nr = PG_PUD_NR;
+		}
+
 		/*
 		 * Make sure 'entry' remains valid while we drop
 		 * the i_pages lock.
@@ -603,9 +623,8 @@ void *dax_grab_mapping_entry(struct xa_state *xas,
 		 */
 		if (dax_is_zero_entry(entry)) {
 			xas_unlock_irq(xas);
-			unmap_mapping_pages(mapping,
-					    xas->xa_index & ~PG_PMD_COLOUR,
-					    PG_PMD_NR, false);
+			unmap_mapping_pages(mapping, xas->xa_index & ~colour,
+					    nr, false);
 			xas_reset(xas);
 			xas_lock_irq(xas);
 		}
@@ -613,7 +632,7 @@ void *dax_grab_mapping_entry(struct xa_state *xas,
 		dax_disassociate_entry(entry, mapping, false);
 		xas_store(xas, NULL); /* undo the PMD join */
 		dax_wake_entry(xas, entry, WAKE_ALL);
-		mapping->nrpages -= PG_PMD_NR;
+		mapping->nrpages -= nr;
 		entry = NULL;
 		xas_set(xas, index);
 	}
@@ -623,7 +642,9 @@ void *dax_grab_mapping_entry(struct xa_state *xas,
 	} else {
 		unsigned long flags = DAX_EMPTY;
 
-		if (order > 0)
+		if (order == PUD_SHIFT - PAGE_SHIFT)
+			flags |= DAX_PUD;
+		else if (order == PMD_SHIFT - PAGE_SHIFT)
 			flags |= DAX_PMD;
 		entry = dax_make_entry(pfn_to_pfn_t(0), flags);
 		dax_lock_entry(xas, entry);
@@ -846,7 +867,10 @@ vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
-		if (dax_is_pmd_entry(entry))
+		if (dax_is_pud_entry(entry))
+			unmap_mapping_pages(mapping, index & ~PG_PUD_COLOUR,
+					    PG_PUD_NR, false);
+		else if (dax_is_pmd_entry(entry))
 			unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
 					    PG_PMD_NR, false);
 		else /* pte entry */
@@ -1018,6 +1042,8 @@ vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn,
 	else if (order == PMD_ORDER)
 		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
+	else if (order == PUD_ORDER)
+		ret = vmf_insert_pfn_pud(vmf, pfn, FAULT_FLAG_WRITE);
 	else
 		ret = VM_FAULT_FALLBACK;
 	dax_unlock_entry(&xas, entry);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index de60a34088bb..3a27fecf072a 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -278,22 +278,25 @@ static inline bool dax_mapping(struct address_space *mapping)
 }
 
 /*
- * DAX pagecache entries use XArray value entries so they can't be mistaken
- * for pages.  We use one bit for locking, one bit for the entry size (PMD)
- * and two more to tell us if the entry is a zero page or an empty entry that
- * is just used for locking.  In total four special bits.
+ * DAX pagecache entries use XArray value entries so they can't be
+ * mistaken for pages.  We use one bit for locking, two bits for the
+ * entry size (PMD, PUD) and two more to tell us if the entry is a zero
+ * page or an empty entry that is just used for locking.  In total 5
+ * special bits which limits the max pfn that can be stored as:
+ * (1UL << 57 - PAGE_SHIFT). 63 - DAX_SHIFT - 1 (for xa_mk_value()).
  *
- * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
- * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
- * block allocation.
+ * If the P{M,U}D bits are not set the entry has size PAGE_SIZE, and if
+ * the ZERO_PAGE and EMPTY bits aren't set the entry is a normal DAX
+ * entry with a filesystem block allocation.
  */
-#define DAX_SHIFT	(5)
+#define DAX_SHIFT	(6)
 #define DAX_MASK	((1UL << DAX_SHIFT) - 1)
 #define DAX_LOCKED	(1UL << 0)
 #define DAX_PMD		(1UL << 1)
-#define DAX_ZERO_PAGE	(1UL << 2)
-#define DAX_EMPTY	(1UL << 3)
-#define DAX_ZAP		(1UL << 4)
+#define DAX_PUD		(1UL << 2)
+#define DAX_ZERO_PAGE	(1UL << 3)
+#define DAX_EMPTY	(1UL << 4)
+#define DAX_ZAP		(1UL << 5)
 
 /*
  * These flags are not conveyed in Xarray value entries, they are just
@@ -316,6 +319,13 @@ int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 /* The order of a PMD entry */
 #define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
 
+/* The 'colour' (ie low bits) within a PUD of a page offset.  */
+#define PG_PUD_COLOUR ((PUD_SIZE >> PAGE_SHIFT) - 1)
+#define PG_PUD_NR (PUD_SIZE >> PAGE_SHIFT)
+
+/* The order of a PUD entry */
+#define PUD_ORDER (PUD_SHIFT - PAGE_SHIFT)
+
 static inline unsigned int pe_order(enum page_entry_size pe_size)
 {
 	if (pe_size == PE_SIZE_PTE)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 768e5261fdae..de73f5a16252 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -18,10 +18,19 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
+vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn,
+				   pgprot_t pgprot, bool write);
 #else
 static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
 {
 }
+
+static inline vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf,
+						 pfn_t pfn, pgprot_t pgprot,
+						 bool write)
+{
+	return VM_FAULT_SIGBUS;
+}
 #endif
 
 vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
@@ -58,8 +67,6 @@ static inline vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn,
 {
 	return vmf_insert_pfn_pmd_prot(vmf, pfn, vmf->vma->vm_page_prot, write);
 }
-vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn,
-				   pgprot_t pgprot, bool write);
 
 /**
  * vmf_insert_pfn_pud - insert a pud size pfn


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (13 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-21 14:10   ` Jason Gunthorpe
  2022-09-16  3:36 ` [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count Dan Williams
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Track entries and take pgmap references at mapping insertion time.
Revoke mappings (dax_zap_mappings()) and drop the associated pgmap
references at device destruction or inode eviction time. With this in
place, and the fsdax equivalent already in place, the gup code no longer
needs to consider PTE_DEVMAP as an indicator to get a pgmap reference
before taking a page reference.

In other words, GUP takes additional references on mapped pages. Until
now, DAX in all its forms was failing to take references at mapping
time. With that fixed there is no longer a requirement for gup to manage
@pgmap references. However, that cleanup is saved for a follow-on patch.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/bus.c     |   15 +++++++++-
 drivers/dax/device.c  |   73 +++++++++++++++++++++++++++++--------------------
 drivers/dax/mapping.c |    3 ++
 3 files changed, 60 insertions(+), 31 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1dad813ee4a6..35a319a76c82 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -382,9 +382,22 @@ void kill_dev_dax(struct dev_dax *dev_dax)
 {
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct inode *inode = dax_inode(dax_dev);
+	struct page *page;
 
 	kill_dax(dax_dev);
-	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+
+	/*
+	 * New mappings are blocked. Wait for all GUP users to release
+	 * their pins.
+	 */
+	do {
+		page = dax_zap_mappings(inode->i_mapping);
+		if (!page)
+			break;
+		__wait_var_event(page, dax_page_idle(page));
+	} while (true);
+
+	truncate_inode_pages(inode->i_mapping, 0);
 
 	/*
 	 * Dynamic dax region have the pgmap allocated via dev_kzalloc()
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 5494d745ced5..7f306939807e 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -73,38 +73,15 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 	return -1;
 }
 
-static void dax_set_mapping(struct vm_fault *vmf, pfn_t pfn,
-			      unsigned long fault_size)
-{
-	unsigned long i, nr_pages = fault_size / PAGE_SIZE;
-	struct file *filp = vmf->vma->vm_file;
-	struct dev_dax *dev_dax = filp->private_data;
-	pgoff_t pgoff;
-
-	/* mapping is only set on the head */
-	if (dev_dax->pgmap->vmemmap_shift)
-		nr_pages = 1;
-
-	pgoff = linear_page_index(vmf->vma,
-			ALIGN(vmf->address, fault_size));
-
-	for (i = 0; i < nr_pages; i++) {
-		struct page *page = pfn_to_page(pfn_t_to_pfn(pfn) + i);
-
-		page = compound_head(page);
-		if (page->mapping)
-			continue;
-
-		page->mapping = filp->f_mapping;
-		page->index = pgoff + i;
-	}
-}
-
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
 	struct device *dev = &dev_dax->dev;
 	phys_addr_t phys;
+	vm_fault_t ret;
+	void *entry;
 	pfn_t pfn;
 	unsigned int fault_size = PAGE_SIZE;
 
@@ -128,7 +105,16 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
-	dax_set_mapping(vmf, pfn, fault_size);
+	entry = dax_grab_mapping_entry(&xas, mapping, 0);
+	if (xa_is_internal(entry))
+		return xa_to_internal(entry);
+
+	ret = dax_insert_entry(&xas, vmf, &entry, pfn, 0);
+
+	dax_unlock_entry(&xas, entry);
+
+	if (ret)
+		return ret;
 
 	return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
@@ -136,10 +122,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
 	struct device *dev = &dev_dax->dev;
 	phys_addr_t phys;
+	vm_fault_t ret;
 	pgoff_t pgoff;
+	void *entry;
 	pfn_t pfn;
 	unsigned int fault_size = PMD_SIZE;
 
@@ -171,7 +161,16 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
-	dax_set_mapping(vmf, pfn, fault_size);
+	entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	if (xa_is_internal(entry))
+		return xa_to_internal(entry);
+
+	ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PMD);
+
+	dax_unlock_entry(&xas, entry);
+
+	if (ret)
+		return ret;
 
 	return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
@@ -180,10 +179,14 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 				struct vm_fault *vmf)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	unsigned long pud_addr = vmf->address & PUD_MASK;
+	XA_STATE(xas, &mapping->i_pages, vmf->pgoff);
 	struct device *dev = &dev_dax->dev;
 	phys_addr_t phys;
+	vm_fault_t ret;
 	pgoff_t pgoff;
+	void *entry;
 	pfn_t pfn;
 	unsigned int fault_size = PUD_SIZE;
 
@@ -216,7 +219,16 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
 
 	pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
-	dax_set_mapping(vmf, pfn, fault_size);
+	entry = dax_grab_mapping_entry(&xas, mapping, PUD_ORDER);
+	if (xa_is_internal(entry))
+		return xa_to_internal(entry);
+
+	ret = dax_insert_entry(&xas, vmf, &entry, pfn, DAX_PUD);
+
+	dax_unlock_entry(&xas, entry);
+
+	if (ret)
+		return ret;
 
 	return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
@@ -494,3 +506,4 @@ MODULE_LICENSE("GPL v2");
 module_init(dax_init);
 module_exit(dax_exit);
 MODULE_ALIAS_DAX_DEVICE(0);
+MODULE_IMPORT_NS(DAX);
diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c
index b5a5196f8831..9981eebb2dc5 100644
--- a/drivers/dax/mapping.c
+++ b/drivers/dax/mapping.c
@@ -266,6 +266,7 @@ void dax_unlock_entry(struct xa_state *xas, void *entry)
 	WARN_ON(!dax_is_locked(old));
 	dax_wake_entry(xas, entry, WAKE_NEXT);
 }
+EXPORT_SYMBOL_NS_GPL(dax_unlock_entry, DAX);
 
 /*
  * Return: The entry stored at this location before it was locked.
@@ -666,6 +667,7 @@ void *dax_grab_mapping_entry(struct xa_state *xas,
 	xas_unlock_irq(xas);
 	return xa_mk_internal(VM_FAULT_FALLBACK);
 }
+EXPORT_SYMBOL_NS_GPL(dax_grab_mapping_entry, DAX);
 
 static void *dax_zap_entry(struct xa_state *xas, void *entry)
 {
@@ -910,6 +912,7 @@ vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
 	*pentry = entry;
 	return 0;
 }
+EXPORT_SYMBOL_NS_GPL(dax_insert_entry, DAX);
 
 int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
 		      struct address_space *mapping, void *entry)


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (14 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-21 15:24   ` Jason Gunthorpe
  2022-09-16  3:36 ` [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs() Dan Williams
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

The initial memremap_pages() implementation inherited the
__init_single_page() default of pages starting life with an elevated
reference count. This originally allowed for the page->pgmap pointer to
alias with the storage for page->lru since a page was only allowed to be
on an lru list when its reference count was zero.

Since then, 'struct page' definition cleanups have arranged for
dedicated space for the ZONE_DEVICE page metadata, and the
MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
page->_refcount transition to route the page to free_zone_device_page()
and not the core-mm page-free. With those cleanups in place and with
filesystem-dax and device-dax now converted to take and drop references
at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
and MEMORY_DEVICE_GENERIC reference counts at 0.

MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
pages start life at _refcount 1, so make that the default if
pgmap->init_mode is left at zero.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/device.c     |    1 +
 drivers/nvdimm/pmem.c    |    2 ++
 include/linux/dax.h      |    2 +-
 include/linux/memremap.h |    5 +++++
 mm/memremap.c            |   15 ++++++++++-----
 mm/page_alloc.c          |    2 ++
 6 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 7f306939807e..8a7281d16c99 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -460,6 +460,7 @@ int dev_dax_probe(struct dev_dax *dev_dax)
 	}
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
+	pgmap->init_mode = INIT_PAGEMAP_IDLE;
 	if (dev_dax->align > PAGE_SIZE)
 		pgmap->vmemmap_shift =
 			order_base_2(dev_dax->align >> PAGE_SHIFT);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 7e88cd242380..9c98dcb9f33d 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -529,6 +529,7 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		pmem->pgmap.init_mode = INIT_PAGEMAP_IDLE;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pfn_sb = nd_pfn->pfn_sb;
@@ -543,6 +544,7 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.range.end = res->end;
 		pmem->pgmap.nr_range = 1;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		pmem->pgmap.init_mode = INIT_PAGEMAP_IDLE;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3a27fecf072a..b9fdd8951e06 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -235,7 +235,7 @@ static inline void dax_unlock_mapping_entry(struct address_space *mapping,
  */
 static inline bool dax_page_idle(struct page *page)
 {
-	return page_ref_count(page) == 1;
+	return page_ref_count(page) == 0;
 }
 
 bool dax_alive(struct dax_device *dax_dev);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index e5d30eec3bf1..9f1a57efd371 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -116,6 +116,7 @@ struct dev_pagemap_ops {
  *	representation. A bigger value will set up compound struct pages
  *	of the requested order value.
  * @ops: method table
+ * @init_mode: initial reference count mode
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
  *	foreign ZONE_DEVICE memory is accessed.
@@ -131,6 +132,10 @@ struct dev_pagemap {
 	unsigned int flags;
 	unsigned long vmemmap_shift;
 	const struct dev_pagemap_ops *ops;
+	enum {
+		INIT_PAGEMAP_BUSY = 0, /* default / historical */
+		INIT_PAGEMAP_IDLE,
+	} init_mode;
 	void *owner;
 	int nr_range;
 	union {
diff --git a/mm/memremap.c b/mm/memremap.c
index 83c5e6fafd84..b6a7a95339b3 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -467,8 +467,10 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap_many);
 
 void free_zone_device_page(struct page *page)
 {
-	if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
-		return;
+	struct dev_pagemap *pgmap = page->pgmap;
+
+	/* wake filesystem 'break dax layouts' waiters */
+	wake_up_var(page);
 
 	mem_cgroup_uncharge(page_folio(page));
 
@@ -503,12 +505,15 @@ void free_zone_device_page(struct page *page)
 	 * to clear page->mapping.
 	 */
 	page->mapping = NULL;
-	page->pgmap->ops->page_free(page);
+	if (pgmap->ops && pgmap->ops->page_free)
+		pgmap->ops->page_free(page);
 
 	/*
-	 * Reset the page count to 1 to prepare for handing out the page again.
+	 * Reset the page count to the @init_mode value to prepare for
+	 * handing out the page again.
 	 */
-	set_page_count(page, 1);
+	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
+		set_page_count(page, 1);
 }
 
 #ifdef CONFIG_FS_DAX
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..8ee52992055b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6719,6 +6719,8 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 {
 
 	__init_single_page(page, pfn, zone_idx, nid);
+	if (pgmap->init_mode == INIT_PAGEMAP_IDLE)
+		set_page_count(page, 0);
 
 	/*
 	 * Mark page reserved as it will need to wait for onlining


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs()
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (15 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-16  3:36 ` [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting Dan Williams
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Now that fsdax DMA-idle detection no longer depends on catching
transitions of page->_refcount to 1, remove
put_devmap_managed_page_refs() and associated infrastructure.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/mm.h |   30 ------------------------------
 mm/gup.c           |    6 ++----
 mm/memremap.c      |   18 ------------------
 mm/swap.c          |    2 --
 4 files changed, 2 insertions(+), 54 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3bedc449c14d..182fe336a268 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1048,30 +1048,6 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
  *   back into memory.
  */
 
-#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-bool __put_devmap_managed_page_refs(struct page *page, int refs);
-static inline bool put_devmap_managed_page_refs(struct page *page, int refs)
-{
-	if (!static_branch_unlikely(&devmap_managed_key))
-		return false;
-	if (!is_zone_device_page(page))
-		return false;
-	return __put_devmap_managed_page_refs(page, refs);
-}
-#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-static inline bool put_devmap_managed_page_refs(struct page *page, int refs)
-{
-	return false;
-}
-#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-
-static inline bool put_devmap_managed_page(struct page *page)
-{
-	return put_devmap_managed_page_refs(page, 1);
-}
-
 /* 127: arbitrary random number, small enough to assemble well */
 #define folio_ref_zero_or_close_to_overflow(folio) \
 	((unsigned int) folio_ref_count(folio) + 127u <= 127u)
@@ -1168,12 +1144,6 @@ static inline void put_page(struct page *page)
 {
 	struct folio *folio = page_folio(page);
 
-	/*
-	 * For some devmap managed pages we need to catch refcount transition
-	 * from 2 to 1:
-	 */
-	if (put_devmap_managed_page(&folio->page))
-		return;
 	folio_put(folio);
 }
 
diff --git a/mm/gup.c b/mm/gup.c
index 732825157430..c6d060dee9e0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -87,8 +87,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
 	 * belongs to this folio.
 	 */
 	if (unlikely(page_folio(page) != folio)) {
-		if (!put_devmap_managed_page_refs(&folio->page, refs))
-			folio_put_refs(folio, refs);
+		folio_put_refs(folio, refs);
 		goto retry;
 	}
 
@@ -177,8 +176,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 			refs *= GUP_PIN_COUNTING_BIAS;
 	}
 
-	if (!put_devmap_managed_page_refs(&folio->page, refs))
-		folio_put_refs(folio, refs);
+	folio_put_refs(folio, refs);
 }
 
 /**
diff --git a/mm/memremap.c b/mm/memremap.c
index b6a7a95339b3..0f4a2e20c159 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -515,21 +515,3 @@ void free_zone_device_page(struct page *page)
 	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
 		set_page_count(page, 1);
 }
-
-#ifdef CONFIG_FS_DAX
-bool __put_devmap_managed_page_refs(struct page *page, int refs)
-{
-	if (page->pgmap->type != MEMORY_DEVICE_FS_DAX)
-		return false;
-
-	/*
-	 * fsdax page refcounts are 1-based, rather than 0-based: if
-	 * refcount is 1, then the page is free and the refcount is
-	 * stable because nobody holds a reference on the page.
-	 */
-	if (page_ref_sub_return(page, refs) == 1)
-		wake_up_var(page);
-	return true;
-}
-EXPORT_SYMBOL(__put_devmap_managed_page_refs);
-#endif /* CONFIG_FS_DAX */
diff --git a/mm/swap.c b/mm/swap.c
index 9cee7f6a3809..b346dd24cde8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -960,8 +960,6 @@ void release_pages(struct page **pages, int nr)
 				unlock_page_lruvec_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
-			if (put_devmap_managed_page(&folio->page))
-				continue;
 			if (folio_put_testzero(folio))
 				free_zone_device_page(&folio->page);
 			continue;


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (16 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs() Dan Williams
@ 2022-09-16  3:36 ` Dan Williams
  2022-09-20 14:29 ` [PATCH v2 00/18] Fix the DAX-gup mistake Jason Gunthorpe
  2022-11-09  0:20 ` Andrew Morton
  19 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-16  3:36 UTC (permalink / raw)
  To: akpm
  Cc: Matthew Wilcox, Jan Kara, Darrick J. Wong, Christoph Hellwig,
	John Hubbard, Jason Gunthorpe, linux-fsdevel, nvdimm, linux-xfs,
	linux-mm, linux-ext4

Now that pgmap accounting is handled at map time, it can be dropped from
gup time.

A hurdle still remains that filesystem-DAX huge pages are not compound
pages which still requires infrastructure like
__gup_device_huge_p{m,u}d() to stick around.

Additionally, ZONE_DEVICE pages with this change are still not suitable
to be returned from vm_normal_page(), so this cleanup is limited to
deleting pgmap reference manipulation. This is an incremental step on
the path to removing pte_devmap() altogether.

Note that follow_pmd_devmap() can be deleted entirely since a few
additions of pmd_devmap() allows the transparent huge page path to be
reused.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: John Hubbard <jhubbard@nvidia.com>
Reported-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/huge_mm.h |   12 +------
 mm/gup.c                |   83 +++++++++++------------------------------------
 mm/huge_memory.c        |   54 +------------------------------
 3 files changed, 22 insertions(+), 127 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de73f5a16252..b8ed373c6090 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -263,10 +263,8 @@ static inline bool folio_test_pmd_mappable(struct folio *folio)
 	return folio_order(folio) >= HPAGE_PMD_ORDER;
 }
 
-struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap);
+		pud_t *pud, int flags);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
@@ -418,14 +416,8 @@ static inline void mm_put_huge_zero_page(struct mm_struct *mm)
 	return;
 }
 
-static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
-	unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
-{
-	return NULL;
-}
-
 static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
-	unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
+	unsigned long addr, pud_t *pud, int flags)
 {
 	return NULL;
 }
diff --git a/mm/gup.c b/mm/gup.c
index c6d060dee9e0..8e6dd4308e19 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -25,7 +25,6 @@
 #include "internal.h"
 
 struct follow_page_context {
-	struct dev_pagemap *pgmap;
 	unsigned int page_mask;
 };
 
@@ -487,8 +486,7 @@ static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd, unsigned int flags,
-		struct dev_pagemap **pgmap)
+		unsigned long address, pmd_t *pmd, unsigned int flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -532,17 +530,13 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
+	if (!page && pte_devmap(pte)) {
 		/*
-		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
-		 * case since they are only valid while holding the pgmap
-		 * reference.
+		 * ZONE_DEVICE pages are not yet treated as vm_normal_page()
+		 * instances, with respect to mapcount and compound-page
+		 * metadata
 		 */
-		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
-		if (*pgmap)
-			page = pte_page(pte);
-		else
-			goto no_page;
+		page = pte_page(pte);
 	} else if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
 			/* Avoid special (like zero) pages in core dumps */
@@ -660,15 +654,8 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			return no_page_table(vma, flags);
 		goto retry;
 	}
-	if (pmd_devmap(pmdval)) {
-		ptl = pmd_lock(mm, pmd);
-		page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
-		spin_unlock(ptl);
-		if (page)
-			return page;
-	}
-	if (likely(!pmd_trans_huge(pmdval)))
-		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+	if (likely(!(pmd_trans_huge(pmdval) || pmd_devmap(pmdval))))
+		return follow_page_pte(vma, address, pmd, flags);
 
 	if ((flags & FOLL_NUMA) && pmd_protnone(pmdval))
 		return no_page_table(vma, flags);
@@ -686,9 +673,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		pmd_migration_entry_wait(mm, pmd);
 		goto retry_locked;
 	}
-	if (unlikely(!pmd_trans_huge(*pmd))) {
+	if (unlikely(!(pmd_trans_huge(*pmd) || pmd_devmap(pmdval)))) {
 		spin_unlock(ptl);
-		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+		return follow_page_pte(vma, address, pmd, flags);
 	}
 	if (flags & FOLL_SPLIT_PMD) {
 		int ret;
@@ -706,7 +693,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		}
 
 		return ret ? ERR_PTR(ret) :
-			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+			follow_page_pte(vma, address, pmd, flags);
 	}
 	page = follow_trans_huge_pmd(vma, address, pmd, flags);
 	spin_unlock(ptl);
@@ -743,7 +730,7 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 	}
 	if (pud_devmap(*pud)) {
 		ptl = pud_lock(mm, pud);
-		page = follow_devmap_pud(vma, address, pud, flags, &ctx->pgmap);
+		page = follow_devmap_pud(vma, address, pud, flags);
 		spin_unlock(ptl);
 		if (page)
 			return page;
@@ -790,9 +777,6 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
  *
  * @flags can have FOLL_ flags set, defined in <linux/mm.h>
  *
- * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches
- * the device's dev_pagemap metadata to avoid repeating expensive lookups.
- *
  * When getting an anonymous page and the caller has to trigger unsharing
  * of a shared anonymous page first, -EMLINK is returned. The caller should
  * trigger a fault with FAULT_FLAG_UNSHARE set. Note that unsharing is only
@@ -847,7 +831,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 			 unsigned int foll_flags)
 {
-	struct follow_page_context ctx = { NULL };
+	struct follow_page_context ctx = { 0 };
 	struct page *page;
 
 	if (vma_is_secretmem(vma))
@@ -857,8 +841,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 		return NULL;
 
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
-	if (ctx.pgmap)
-		put_dev_pagemap(ctx.pgmap);
 	return page;
 }
 
@@ -1118,7 +1100,7 @@ static long __get_user_pages(struct mm_struct *mm,
 {
 	long ret = 0, i = 0;
 	struct vm_area_struct *vma = NULL;
-	struct follow_page_context ctx = { NULL };
+	struct follow_page_context ctx = { 0 };
 
 	if (!nr_pages)
 		return 0;
@@ -1241,8 +1223,6 @@ static long __get_user_pages(struct mm_struct *mm,
 		nr_pages -= page_increm;
 	} while (nr_pages);
 out:
-	if (ctx.pgmap)
-		put_dev_pagemap(ctx.pgmap);
 	return i ? i : ret;
 }
 
@@ -2322,9 +2302,8 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
 {
-	struct dev_pagemap *pgmap = NULL;
-	int nr_start = *nr, ret = 0;
 	pte_t *ptep, *ptem;
+	int ret = 0;
 
 	ptem = ptep = pte_offset_map(&pmd, addr);
 	do {
@@ -2345,12 +2324,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		if (pte_devmap(pte)) {
 			if (unlikely(flags & FOLL_LONGTERM))
 				goto pte_unmap;
-
-			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
-			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, flags, pages);
-				goto pte_unmap;
-			}
 		} else if (pte_special(pte))
 			goto pte_unmap;
 
@@ -2397,8 +2370,6 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 	ret = 1;
 
 pte_unmap:
-	if (pgmap)
-		put_dev_pagemap(pgmap);
 	pte_unmap(ptem);
 	return ret;
 }
@@ -2425,28 +2396,17 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 			     unsigned long end, unsigned int flags,
 			     struct page **pages, int *nr)
 {
-	int nr_start = *nr;
-	struct dev_pagemap *pgmap = NULL;
-
 	do {
 		struct page *page = pfn_to_page(pfn);
 
-		pgmap = get_dev_pagemap(pfn, pgmap);
-		if (unlikely(!pgmap)) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
-			break;
-		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
-		if (unlikely(!try_grab_page(page, flags))) {
-			undo_dev_pagemap(nr, nr_start, flags, pages);
+		if (unlikely(!try_grab_page(page, flags)))
 			break;
-		}
 		(*nr)++;
 		pfn++;
 	} while (addr += PAGE_SIZE, addr != end);
 
-	put_dev_pagemap(pgmap);
 	return addr == end;
 }
 
@@ -2455,16 +2415,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
-	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
-	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		undo_dev_pagemap(nr, nr_start, flags, pages);
+	if (unlikely(pmd_val(orig) != pmd_val(*pmdp)))
 		return 0;
-	}
+
 	return 1;
 }
 
@@ -2473,16 +2431,13 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
-	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
-	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		undo_dev_pagemap(nr, nr_start, flags, pages);
+	if (unlikely(pud_val(orig) != pud_val(*pudp)))
 		return 0;
-	}
 	return 1;
 }
 #else
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7c1b344abe..ef68296f2158 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1031,55 +1031,6 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
 		update_mmu_cache_pmd(vma, addr, pmd);
 }
 
-struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
-		pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
-{
-	unsigned long pfn = pmd_pfn(*pmd);
-	struct mm_struct *mm = vma->vm_mm;
-	struct page *page;
-
-	assert_spin_locked(pmd_lockptr(mm, pmd));
-
-	/*
-	 * When we COW a devmap PMD entry, we split it into PTEs, so we should
-	 * not be in this function with `flags & FOLL_COW` set.
-	 */
-	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
-
-	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
-	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
-			 (FOLL_PIN | FOLL_GET)))
-		return NULL;
-
-	if (flags & FOLL_WRITE && !pmd_write(*pmd))
-		return NULL;
-
-	if (pmd_present(*pmd) && pmd_devmap(*pmd))
-		/* pass */;
-	else
-		return NULL;
-
-	if (flags & FOLL_TOUCH)
-		touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
-	/*
-	 * device mapped pages can only be returned if the
-	 * caller will manage the page reference count.
-	 */
-	if (!(flags & (FOLL_GET | FOLL_PIN)))
-		return ERR_PTR(-EEXIST);
-
-	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
-	page = pfn_to_page(pfn);
-	if (!try_grab_page(page, flags))
-		page = ERR_PTR(-ENOMEM);
-
-	return page;
-}
-
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
@@ -1196,7 +1147,7 @@ static void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 }
 
 struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-		pud_t *pud, int flags, struct dev_pagemap **pgmap)
+			       pud_t *pud, int flags)
 {
 	unsigned long pfn = pud_pfn(*pud);
 	struct mm_struct *mm = vma->vm_mm;
@@ -1230,9 +1181,6 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-	*pgmap = get_dev_pagemap(pfn, *pgmap);
-	if (!*pgmap)
-		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
 	if (!try_grab_page(page, flags))
 		page = ERR_PTR(-ENOMEM);


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-16  3:35 ` [PATCH v2 05/18] xfs: Add xfs_break_layouts() " Dan Williams
@ 2022-09-18 22:57   ` Dave Chinner
  2022-09-19 16:11     ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Chinner @ 2022-09-18 22:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote:
> In preparation for moving DAX pages to be 0-based rather than 1-based
> for the idle refcount, the fsdax core wants to have all mappings in a
> "zapped" state before truncate. For typical pages this happens naturally
> via unmap_mapping_range(), for DAX pages some help is needed to record
> this state in the 'struct address_space' of the inode(s) where the page
> is mapped.
> 
> That "zapped" state is recorded in DAX entries as a side effect of
> xfs_break_layouts(). Arrange for it to be called before all truncation
> events which already happens for truncate() and PUNCH_HOLE, but not
> truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
> truncate_inode_pages_final().

Ugh. That's nasty and awful.



> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_file.c  |   13 +++++++++----
>  fs/xfs/xfs_inode.c |    3 ++-
>  fs/xfs/xfs_inode.h |    6 ++++--
>  fs/xfs/xfs_super.c |   22 ++++++++++++++++++++++
>  4 files changed, 37 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 556e28d06788..d3ff692d5546 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -816,7 +816,8 @@ xfs_wait_dax_page(
>  int
>  xfs_break_dax_layouts(
>  	struct inode		*inode,
> -	bool			*retry)
> +	bool			*retry,
> +	int			state)
>  {
>  	struct page		*page;
>  
> @@ -827,8 +828,8 @@ xfs_break_dax_layouts(
>  		return 0;
>  
>  	*retry = true;
> -	return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE,
> -				 0, 0, xfs_wait_dax_page(inode));
> +	return ___wait_var_event(page, dax_page_idle(page), state, 0, 0,
> +				 xfs_wait_dax_page(inode));
>  }
>  
>  int
> @@ -839,14 +840,18 @@ xfs_break_layouts(
>  {
>  	bool			retry;
>  	int			error;
> +	int			state = TASK_INTERRUPTIBLE;
>  
>  	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>  
>  	do {
>  		retry = false;
>  		switch (reason) {
> +		case BREAK_UNMAP_FINAL:
> +			state = TASK_UNINTERRUPTIBLE;
> +			fallthrough;
>  		case BREAK_UNMAP:
> -			error = xfs_break_dax_layouts(inode, &retry);
> +			error = xfs_break_dax_layouts(inode, &retry, state);
>  			if (error || retry)
>  				break;
>  			fallthrough;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 28493c8e9bb2..72ce1cb72736 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3452,6 +3452,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
>  	struct xfs_inode	*ip1,
>  	struct xfs_inode	*ip2)
>  {
> +	int			state = TASK_INTERRUPTIBLE;
>  	int			error;
>  	bool			retry;
>  	struct page		*page;
> @@ -3463,7 +3464,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
>  	retry = false;
>  	/* Lock the first inode */
>  	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> -	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> +	error = xfs_break_dax_layouts(VFS_I(ip1), &retry, state);
>  	if (error || retry) {
>  		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
>  		if (error == 0 && retry)
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index fa780f08dc89..e4994eb6e521 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -454,11 +454,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>   * layout-holder has a consistent view of the file's extent map. While
>   * BREAK_WRITE breaks can be satisfied by recalling FL_LAYOUT leases,
>   * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
> - * go idle.
> + * go idle. BREAK_UNMAP_FINAL is an uninterruptible version of
> + * BREAK_UNMAP.
>   */
>  enum layout_break_reason {
>          BREAK_WRITE,
>          BREAK_UNMAP,
> +        BREAK_UNMAP_FINAL,
>  };
>  
>  /*
> @@ -531,7 +533,7 @@ xfs_itruncate_extents(
>  }
>  
>  /* from xfs_file.c */
> -int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
> +int	xfs_break_dax_layouts(struct inode *inode, bool *retry, int state);
>  int	xfs_break_layouts(struct inode *inode, uint *iolock,
>  		enum layout_break_reason reason);
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 9ac59814bbb6..ebb4a6eba3fc 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -725,6 +725,27 @@ xfs_fs_drop_inode(
>  	return generic_drop_inode(inode);
>  }
>  
> +STATIC void
> +xfs_fs_evict_inode(
> +	struct inode		*inode)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> +	long			error;
> +
> +	xfs_ilock(ip, iolock);

I'm guessing you never ran this through lockdep.

The general rule is that XFS should not take inode locks directly in
the inode eviction path because lockdep tends to throw all manner of
memory reclaim related false positives when we do this. We most
definitely don't want to be doing this for anything other than
regular files that are DAX enabled, yes?

We also don't want to arbitrarily block memory reclaim for long
periods of time waiting on an inode lock.  People seem to get very
upset when we introduce unbound latencies into the memory reclaim
path...

Indeed, what are you actually trying to serialise against here?
Nothing should have a reference to the inode, nor should anything be
able to find and take a new reference to the inode while it is being
evicted....

> +	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP_FINAL);
> +
> +	/* The final layout break is uninterruptible */
> +	ASSERT_ALWAYS(!error);

We don't do error handling with BUG(). If xfs_break_layouts() truly
can't fail (what happens if the fs is shut down and some internal
call path now detects that and returns -EFSCORRUPTED?), theni
WARN_ON_ONCE() and continuing to tear down the inode so the system
is not immediately compromised is the appropriate action here.

> +
> +	truncate_inode_pages_final(&inode->i_data);
> +	clear_inode(inode);
> +
> +	xfs_iunlock(ip, iolock);
> +}

That all said, this really looks like a bit of a band-aid.

I can't work out why would we we ever have an actual layout lease
here that needs breaking given they are file based and active files
hold a reference to the inode. If we ever break that, then I suspect
this change will cause major problems for anyone using pNFS with XFS
as xfs_break_layouts() can end up waiting for NFS delegation
revocation. This is something we should never be doing in inode
eviction/memory reclaim.

Hence I have to ask why this lease break is being done
unconditionally for all inodes, instead of only calling
xfs_break_dax_layouts() directly on DAX enabled regular files?  I
also wonder what exciting new system deadlocks this will create
because BREAK_UNMAP_FINAL can essentially block forever waiting on
dax mappings going away. If that DAX mapping reclaim requires memory
allocations.....

/me looks deeper into the dax_layout_busy_page() stuff and realises
that both ext4 and XFS implementations of ext4_break_layouts() and
xfs_break_dax_layouts() are actually identical.

That is, filemap_invalidate_unlock() and xfs_iunlock(ip,
XFS_MMAPLOCK_EXCL) operate on exactly the same
inode->i_mapping->invalidate_lock. Hence the implementations in ext4
and XFS are both functionally identical. Further, when the inode is
in the eviction path there is no reason for needing to take that
mapping->invalidation_lock to invalidate remaining stale DAX
mappings before truncate blasts them away.

IOWs, I don't see why fixing this problem needs to add new code to
XFS or ext4 at all. The DAX mapping invalidation and waiting can be
done enitrely within truncate_inode_pages_final() (conditional on
IS_DAX()) after mapping_set_exiting() has been set with generic code
and it should not require locking at all. I also think that
ext4_break_layouts() and xfs_break_dax_layouts() should be merged
into a generic dax infrastructure function so the filesystems don't
need to care about the internal details of DAX mappings at all...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-18 22:57   ` Dave Chinner
@ 2022-09-19 16:11     ` Dan Williams
  2022-09-19 21:29       ` Dave Chinner
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-19 16:11 UTC (permalink / raw)
  To: Dave Chinner, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Dave Chinner wrote:
> On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote:
> > In preparation for moving DAX pages to be 0-based rather than 1-based
> > for the idle refcount, the fsdax core wants to have all mappings in a
> > "zapped" state before truncate. For typical pages this happens naturally
> > via unmap_mapping_range(), for DAX pages some help is needed to record
> > this state in the 'struct address_space' of the inode(s) where the page
> > is mapped.
> > 
> > That "zapped" state is recorded in DAX entries as a side effect of
> > xfs_break_layouts(). Arrange for it to be called before all truncation
> > events which already happens for truncate() and PUNCH_HOLE, but not
> > truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
> > truncate_inode_pages_final().
> 
> Ugh. That's nasty and awful.
> 
> 
> 
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: "Darrick J. Wong" <djwong@kernel.org>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  fs/xfs/xfs_file.c  |   13 +++++++++----
> >  fs/xfs/xfs_inode.c |    3 ++-
> >  fs/xfs/xfs_inode.h |    6 ++++--
> >  fs/xfs/xfs_super.c |   22 ++++++++++++++++++++++
> >  4 files changed, 37 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 556e28d06788..d3ff692d5546 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -816,7 +816,8 @@ xfs_wait_dax_page(
> >  int
> >  xfs_break_dax_layouts(
> >  	struct inode		*inode,
> > -	bool			*retry)
> > +	bool			*retry,
> > +	int			state)
> >  {
> >  	struct page		*page;
> >  
> > @@ -827,8 +828,8 @@ xfs_break_dax_layouts(
> >  		return 0;
> >  
> >  	*retry = true;
> > -	return ___wait_var_event(page, dax_page_idle(page), TASK_INTERRUPTIBLE,
> > -				 0, 0, xfs_wait_dax_page(inode));
> > +	return ___wait_var_event(page, dax_page_idle(page), state, 0, 0,
> > +				 xfs_wait_dax_page(inode));
> >  }
> >  
> >  int
> > @@ -839,14 +840,18 @@ xfs_break_layouts(
> >  {
> >  	bool			retry;
> >  	int			error;
> > +	int			state = TASK_INTERRUPTIBLE;
> >  
> >  	ASSERT(xfs_isilocked(XFS_I(inode), XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
> >  
> >  	do {
> >  		retry = false;
> >  		switch (reason) {
> > +		case BREAK_UNMAP_FINAL:
> > +			state = TASK_UNINTERRUPTIBLE;
> > +			fallthrough;
> >  		case BREAK_UNMAP:
> > -			error = xfs_break_dax_layouts(inode, &retry);
> > +			error = xfs_break_dax_layouts(inode, &retry, state);
> >  			if (error || retry)
> >  				break;
> >  			fallthrough;
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 28493c8e9bb2..72ce1cb72736 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3452,6 +3452,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> >  	struct xfs_inode	*ip1,
> >  	struct xfs_inode	*ip2)
> >  {
> > +	int			state = TASK_INTERRUPTIBLE;
> >  	int			error;
> >  	bool			retry;
> >  	struct page		*page;
> > @@ -3463,7 +3464,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> >  	retry = false;
> >  	/* Lock the first inode */
> >  	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> > -	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> > +	error = xfs_break_dax_layouts(VFS_I(ip1), &retry, state);
> >  	if (error || retry) {
> >  		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> >  		if (error == 0 && retry)
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index fa780f08dc89..e4994eb6e521 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -454,11 +454,13 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
> >   * layout-holder has a consistent view of the file's extent map. While
> >   * BREAK_WRITE breaks can be satisfied by recalling FL_LAYOUT leases,
> >   * BREAK_UNMAP breaks additionally require waiting for busy dax-pages to
> > - * go idle.
> > + * go idle. BREAK_UNMAP_FINAL is an uninterruptible version of
> > + * BREAK_UNMAP.
> >   */
> >  enum layout_break_reason {
> >          BREAK_WRITE,
> >          BREAK_UNMAP,
> > +        BREAK_UNMAP_FINAL,
> >  };
> >  
> >  /*
> > @@ -531,7 +533,7 @@ xfs_itruncate_extents(
> >  }
> >  
> >  /* from xfs_file.c */
> > -int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
> > +int	xfs_break_dax_layouts(struct inode *inode, bool *retry, int state);
> >  int	xfs_break_layouts(struct inode *inode, uint *iolock,
> >  		enum layout_break_reason reason);
> >  
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 9ac59814bbb6..ebb4a6eba3fc 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -725,6 +725,27 @@ xfs_fs_drop_inode(
> >  	return generic_drop_inode(inode);
> >  }
> >  
> > +STATIC void
> > +xfs_fs_evict_inode(
> > +	struct inode		*inode)
> > +{
> > +	struct xfs_inode	*ip = XFS_I(inode);
> > +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > +	long			error;
> > +
> > +	xfs_ilock(ip, iolock);
> 
> I'm guessing you never ran this through lockdep.

I always run with lockdep enabled in my development kernels, but maybe my
testing was insufficient? Somewhat moot with your concerns below...

> The general rule is that XFS should not take inode locks directly in
> the inode eviction path because lockdep tends to throw all manner of
> memory reclaim related false positives when we do this. We most
> definitely don't want to be doing this for anything other than
> regular files that are DAX enabled, yes?

Guilty. I sought to satisfy the locking expectations of the
break_layouts internals rather than drop the unnecessary locking.

> 
> We also don't want to arbitrarily block memory reclaim for long
> periods of time waiting on an inode lock.  People seem to get very
> upset when we introduce unbound latencies into the memory reclaim
> path...
> 
> Indeed, what are you actually trying to serialise against here?
> Nothing should have a reference to the inode, nor should anything be
> able to find and take a new reference to the inode while it is being
> evicted....

Ok.

> 
> > +	error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP_FINAL);
> > +
> > +	/* The final layout break is uninterruptible */
> > +	ASSERT_ALWAYS(!error);
> 
> We don't do error handling with BUG(). If xfs_break_layouts() truly
> can't fail (what happens if the fs is shut down and some internal
> call path now detects that and returns -EFSCORRUPTED?), theni
> WARN_ON_ONCE() and continuing to tear down the inode so the system
> is not immediately compromised is the appropriate action here.
> 
> > +
> > +	truncate_inode_pages_final(&inode->i_data);
> > +	clear_inode(inode);
> > +
> > +	xfs_iunlock(ip, iolock);
> > +}
> 
> That all said, this really looks like a bit of a band-aid.

It definitely is since DAX is in this transitory state between doing
some activities page-less and others with page metadata. If DAX was
fully committed to behaving like a typical page then
unmap_mapping_range() would have already satisfied this reference
counting situation.

> I can't work out why would we we ever have an actual layout lease
> here that needs breaking given they are file based and active files
> hold a reference to the inode. If we ever break that, then I suspect
> this change will cause major problems for anyone using pNFS with XFS
> as xfs_break_layouts() can end up waiting for NFS delegation
> revocation. This is something we should never be doing in inode
> eviction/memory reclaim.
> 
> Hence I have to ask why this lease break is being done
> unconditionally for all inodes, instead of only calling
> xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> also wonder what exciting new system deadlocks this will create
> because BREAK_UNMAP_FINAL can essentially block forever waiting on
> dax mappings going away. If that DAX mapping reclaim requires memory
> allocations.....

There should be no memory allocations in the DAX mapping reclaim path.
Also, the page pins it waits for are precluded from being GUP_LONGTERM.

> 
> /me looks deeper into the dax_layout_busy_page() stuff and realises
> that both ext4 and XFS implementations of ext4_break_layouts() and
> xfs_break_dax_layouts() are actually identical.
> 
> That is, filemap_invalidate_unlock() and xfs_iunlock(ip,
> XFS_MMAPLOCK_EXCL) operate on exactly the same
> inode->i_mapping->invalidate_lock. Hence the implementations in ext4
> and XFS are both functionally identical.

I assume you mean for the purposes of this "final" break since
xfs_file_allocate() holds XFS_IOLOCK_EXCL over xfs_break_layouts().

> Further, when the inode is
> in the eviction path there is no reason for needing to take that
> mapping->invalidation_lock to invalidate remaining stale DAX
> mappings before truncate blasts them away.
> 
> IOWs, I don't see why fixing this problem needs to add new code to
> XFS or ext4 at all. The DAX mapping invalidation and waiting can be
> done enitrely within truncate_inode_pages_final() (conditional on
> IS_DAX()) after mapping_set_exiting() has been set with generic code
> and it should not require locking at all. I also think that
> ext4_break_layouts() and xfs_break_dax_layouts() should be merged
> into a generic dax infrastructure function so the filesystems don't
> need to care about the internal details of DAX mappings at all...

Yes, I think I can make that happen. Thanks Dave.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-19 16:11     ` Dan Williams
@ 2022-09-19 21:29       ` Dave Chinner
  2022-09-20 16:44         ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Chinner @ 2022-09-19 21:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Mon, Sep 19, 2022 at 09:11:48AM -0700, Dan Williams wrote:
> Dave Chinner wrote:
> > On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote:
> > > In preparation for moving DAX pages to be 0-based rather than 1-based
> > > for the idle refcount, the fsdax core wants to have all mappings in a
> > > "zapped" state before truncate. For typical pages this happens naturally
> > > via unmap_mapping_range(), for DAX pages some help is needed to record
> > > this state in the 'struct address_space' of the inode(s) where the page
> > > is mapped.
> > > 
> > > That "zapped" state is recorded in DAX entries as a side effect of
> > > xfs_break_layouts(). Arrange for it to be called before all truncation
> > > events which already happens for truncate() and PUNCH_HOLE, but not
> > > truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
> > > truncate_inode_pages_final().
....
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 9ac59814bbb6..ebb4a6eba3fc 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -725,6 +725,27 @@ xfs_fs_drop_inode(
> > >  	return generic_drop_inode(inode);
> > >  }
> > >  
> > > +STATIC void
> > > +xfs_fs_evict_inode(
> > > +	struct inode		*inode)
> > > +{
> > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > +	long			error;
> > > +
> > > +	xfs_ilock(ip, iolock);
> > 
> > I'm guessing you never ran this through lockdep.
> 
> I always run with lockdep enabled in my development kernels, but maybe my
> testing was insufficient? Somewhat moot with your concerns below...

I'm guessing your testing doesn't generate inode cache pressure and
then have direct memory reclaim inodes. e.g. on a directory inode
this will trigger lockdep immediately because readdir locks with
XFS_IOLOCK_SHARED and then does GFP_KERNEL memory reclaim. If we try
to take XFS_IOLOCK_EXCL from memory reclaim of directory inodes,
lockdep will then shout from the rooftops...

> > > +
> > > +	truncate_inode_pages_final(&inode->i_data);
> > > +	clear_inode(inode);
> > > +
> > > +	xfs_iunlock(ip, iolock);
> > > +}
> > 
> > That all said, this really looks like a bit of a band-aid.
> 
> It definitely is since DAX is in this transitory state between doing
> some activities page-less and others with page metadata. If DAX was
> fully committed to behaving like a typical page then
> unmap_mapping_range() would have already satisfied this reference
> counting situation.
> 
> > I can't work out why would we we ever have an actual layout lease
> > here that needs breaking given they are file based and active files
> > hold a reference to the inode. If we ever break that, then I suspect
> > this change will cause major problems for anyone using pNFS with XFS
> > as xfs_break_layouts() can end up waiting for NFS delegation
> > revocation. This is something we should never be doing in inode
> > eviction/memory reclaim.
> > 
> > Hence I have to ask why this lease break is being done
> > unconditionally for all inodes, instead of only calling
> > xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> > also wonder what exciting new system deadlocks this will create
> > because BREAK_UNMAP_FINAL can essentially block forever waiting on
> > dax mappings going away. If that DAX mapping reclaim requires memory
> > allocations.....
> 
> There should be no memory allocations in the DAX mapping reclaim path.
> Also, the page pins it waits for are precluded from being GUP_LONGTERM.

So if the task that holds the pin needs memory allocation before it
can unpin the page to allow direct inode reclaim to make progress?

> > /me looks deeper into the dax_layout_busy_page() stuff and realises
> > that both ext4 and XFS implementations of ext4_break_layouts() and
> > xfs_break_dax_layouts() are actually identical.
> > 
> > That is, filemap_invalidate_unlock() and xfs_iunlock(ip,
> > XFS_MMAPLOCK_EXCL) operate on exactly the same
> > inode->i_mapping->invalidate_lock. Hence the implementations in ext4
> > and XFS are both functionally identical.
> 
> I assume you mean for the purposes of this "final" break since
> xfs_file_allocate() holds XFS_IOLOCK_EXCL over xfs_break_layouts().

No, I'm just looking at the two *dax* functions - we don't care what
locks xfs_break_layouts() requires - dax mapping manipulation is
covered by the mapping->invalidate_lock and not the inode->i_rwsem.
This is explicitly documented in the code by the the asserts in both
ext4_break_layouts() and xfs_break_dax_layouts().

XFS holds the inode->i_rwsem over xfs_break_layouts() because we
have to break *file layout leases* from there, too. These are
serialised by the inode->i_rwsem, not the mapping->invalidate_lock.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/18] Fix the DAX-gup mistake
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (17 preceding siblings ...)
  2022-09-16  3:36 ` [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting Dan Williams
@ 2022-09-20 14:29 ` Jason Gunthorpe
  2022-09-20 16:50   ` Dan Williams
  2022-11-09  0:20 ` Andrew Morton
  19 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-20 14:29 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jan Kara, Christoph Hellwig, Darrick J. Wong,
	Matthew Wilcox, John Hubbard, linux-fsdevel, nvdimm, linux-xfs,
	linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:35:08PM -0700, Dan Williams wrote:

> This hackery continues the status of DAX pages as special cases in the
> VM. The thought being carrying the Xarray / mapping infrastructure
> forward still allows for the continuation of the page-less DAX effort.
> Otherwise, the work to convert DAX pages to behave like typical
> vm_normal_page() needs more investigation to untangle transparent huge
> page assumptions.

I see it differently, ZONE_DEVICE by definition is page-based. As long
as DAX is using ZONE_DEVICE it should follow the normal struct page
rules, including proper reference counting everywhere.

By not doing this DAX is causing all ZONE_DEVICE users to suffer
because we haven't really special cased just DAX out of all the other
users.

If there is some kind of non-struct page future, then it will not be
ZONE_DEVICE and it will have its own mechanisms, somehow.

So, we should be systematically stripping away all the half-backed
non-struct page stuff from ZONE_DEVICE as a matter of principle. DAX
included, whatever DAX's future may hold.

The pte bit and the missing refcounting in the page table paths is the
remaining big issue and I hope we fix it. The main problem is that
FS-DAX must create compound pages for the 2M page size.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount
  2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
@ 2022-09-20 14:30   ` Jason Gunthorpe
  0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-20 14:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:35:15PM -0700, Dan Williams wrote:
> The __wait_var_event facility calculates a wait queue from a hash of the
> address of the variable being passed. Use the @page argument directly as
> it is less to type and is the object that is being waited upon.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/ext4/inode.c   |    8 ++++----
>  fs/fuse/dax.c     |    6 +++---
>  fs/xfs/xfs_file.c |    6 +++---
>  mm/memremap.c     |    2 +-
>  4 files changed, 11 insertions(+), 11 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking
  2022-09-16  3:35 ` [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
@ 2022-09-20 14:31   ` Jason Gunthorpe
  0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-20 14:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:35:21PM -0700, Dan Williams wrote:
> In advance of converting DAX pages to be 0-based, use a new
> dax_page_idle() helper to both simplify that future conversion, but also
> document all the kernel locations that are watching for DAX page idle
> events.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c            |    4 ++--
>  fs/ext4/inode.c     |    3 +--
>  fs/fuse/dax.c       |    5 ++---
>  fs/xfs/xfs_file.c   |    5 ++---
>  include/linux/dax.h |    9 +++++++++
>  5 files changed, 16 insertions(+), 10 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-19 21:29       ` Dave Chinner
@ 2022-09-20 16:44         ` Dan Williams
  2022-09-21 22:14           ` Dave Chinner
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-20 16:44 UTC (permalink / raw)
  To: Dave Chinner, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Dave Chinner wrote:
> On Mon, Sep 19, 2022 at 09:11:48AM -0700, Dan Williams wrote:
> > Dave Chinner wrote:
> > > On Thu, Sep 15, 2022 at 08:35:38PM -0700, Dan Williams wrote:
> > > > In preparation for moving DAX pages to be 0-based rather than 1-based
> > > > for the idle refcount, the fsdax core wants to have all mappings in a
> > > > "zapped" state before truncate. For typical pages this happens naturally
> > > > via unmap_mapping_range(), for DAX pages some help is needed to record
> > > > this state in the 'struct address_space' of the inode(s) where the page
> > > > is mapped.
> > > > 
> > > > That "zapped" state is recorded in DAX entries as a side effect of
> > > > xfs_break_layouts(). Arrange for it to be called before all truncation
> > > > events which already happens for truncate() and PUNCH_HOLE, but not
> > > > truncate_inode_pages_final(). Arrange for xfs_break_layouts() before
> > > > truncate_inode_pages_final().
> ....
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index 9ac59814bbb6..ebb4a6eba3fc 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -725,6 +725,27 @@ xfs_fs_drop_inode(
> > > >  	return generic_drop_inode(inode);
> > > >  }
> > > >  
> > > > +STATIC void
> > > > +xfs_fs_evict_inode(
> > > > +	struct inode		*inode)
> > > > +{
> > > > +	struct xfs_inode	*ip = XFS_I(inode);
> > > > +	uint			iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > > +	long			error;
> > > > +
> > > > +	xfs_ilock(ip, iolock);
> > > 
> > > I'm guessing you never ran this through lockdep.
> > 
> > I always run with lockdep enabled in my development kernels, but maybe my
> > testing was insufficient? Somewhat moot with your concerns below...
> 
> I'm guessing your testing doesn't generate inode cache pressure and
> then have direct memory reclaim inodes. e.g. on a directory inode
> this will trigger lockdep immediately because readdir locks with
> XFS_IOLOCK_SHARED and then does GFP_KERNEL memory reclaim. If we try
> to take XFS_IOLOCK_EXCL from memory reclaim of directory inodes,
> lockdep will then shout from the rooftops...

Got it.

> 
> > > > +
> > > > +	truncate_inode_pages_final(&inode->i_data);
> > > > +	clear_inode(inode);
> > > > +
> > > > +	xfs_iunlock(ip, iolock);
> > > > +}
> > > 
> > > That all said, this really looks like a bit of a band-aid.
> > 
> > It definitely is since DAX is in this transitory state between doing
> > some activities page-less and others with page metadata. If DAX was
> > fully committed to behaving like a typical page then
> > unmap_mapping_range() would have already satisfied this reference
> > counting situation.
> > 
> > > I can't work out why would we we ever have an actual layout lease
> > > here that needs breaking given they are file based and active files
> > > hold a reference to the inode. If we ever break that, then I suspect
> > > this change will cause major problems for anyone using pNFS with XFS
> > > as xfs_break_layouts() can end up waiting for NFS delegation
> > > revocation. This is something we should never be doing in inode
> > > eviction/memory reclaim.
> > > 
> > > Hence I have to ask why this lease break is being done
> > > unconditionally for all inodes, instead of only calling
> > > xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> > > also wonder what exciting new system deadlocks this will create
> > > because BREAK_UNMAP_FINAL can essentially block forever waiting on
> > > dax mappings going away. If that DAX mapping reclaim requires memory
> > > allocations.....
> > 
> > There should be no memory allocations in the DAX mapping reclaim path.
> > Also, the page pins it waits for are precluded from being GUP_LONGTERM.
> 
> So if the task that holds the pin needs memory allocation before it
> can unpin the page to allow direct inode reclaim to make progress?

No, it couldn't, and I realize now that GUP_LONGTERM has nothing to do
with this hang since any GFP_KERNEL in a path that took a DAX page pin
path could run afoul of this need to wait.

So, this has me looking at invalidate_inodes() and iput_final(), where I
did not see the reclaim entanglement, and thinking DAX has the unique
requirement to make sure that no access to a page outlives the hosting
inode.

Not that I need to tell you, but to get my own thinking straight,
compare that to typical page cache as the pinner can keep a pinned
page-cache page as long as it wants even after it has been truncated.
DAX needs to make sure that truncate_inode_pages() ceases all access to
the page synchronous with the truncate. The typical page-cache will
ensure that the next mapping of the file will get a new page if the page
previously pinned for that offset is still in use, DAX can not offer
that as the same page that was previously pinned is always used.

So I think this means something like this:

diff --git a/fs/inode.c b/fs/inode.c
index 6462276dfdf0..ab16772b9a8d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -784,6 +784,11 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
                        continue;
                }
 
+               if (dax_inode_busy(inode)) {
+                       busy = 1;
+                       continue;
+               }
+
                inode->i_state |= I_FREEING;
                inode_lru_list_del(inode);
                spin_unlock(&inode->i_lock);
@@ -1733,6 +1738,8 @@ static void iput_final(struct inode *inode)
                spin_unlock(&inode->i_lock);
 
                write_inode_now(inode, 1);
+               if (IS_DAX(inode))
+                       dax_break_layouts(inode);
 
                spin_lock(&inode->i_lock);
                state = inode->i_state;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..e4a74ab310b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3028,8 +3028,20 @@ extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
 extern int inode_needs_sync(struct inode *inode);
 extern int generic_delete_inode(struct inode *inode);
+
+static inline bool dax_inode_busy(struct inode *inode)
+{
+       if (!IS_DAX(inode))
+               return false;
+
+       return dax_zap_pages(inode) != NULL;
+}
+
 static inline int generic_drop_inode(struct inode *inode)
 {
+       if (dax_inode_busy(inode))
+               return 0;
+
        return !inode->i_nlink || inode_unhashed(inode);
 }
 extern void d_mark_dontcache(struct inode *inode);

...where generic code skips over dax-inodes with pinned pages.

> 
> > > /me looks deeper into the dax_layout_busy_page() stuff and realises
> > > that both ext4 and XFS implementations of ext4_break_layouts() and
> > > xfs_break_dax_layouts() are actually identical.
> > > 
> > > That is, filemap_invalidate_unlock() and xfs_iunlock(ip,
> > > XFS_MMAPLOCK_EXCL) operate on exactly the same
> > > inode->i_mapping->invalidate_lock. Hence the implementations in ext4
> > > and XFS are both functionally identical.
> > 
> > I assume you mean for the purposes of this "final" break since
> > xfs_file_allocate() holds XFS_IOLOCK_EXCL over xfs_break_layouts().
> 
> No, I'm just looking at the two *dax* functions - we don't care what
> locks xfs_break_layouts() requires - dax mapping manipulation is
> covered by the mapping->invalidate_lock and not the inode->i_rwsem.
> This is explicitly documented in the code by the the asserts in both
> ext4_break_layouts() and xfs_break_dax_layouts().
> 
> XFS holds the inode->i_rwsem over xfs_break_layouts() because we
> have to break *file layout leases* from there, too. These are
> serialised by the inode->i_rwsem, not the mapping->invalidate_lock.

Got it, will make generic helpers for the scenario where only
dax-break-layouts needs to be performed.

^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/18] Fix the DAX-gup mistake
  2022-09-20 14:29 ` [PATCH v2 00/18] Fix the DAX-gup mistake Jason Gunthorpe
@ 2022-09-20 16:50   ` Dan Williams
  0 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-20 16:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Jan Kara, Christoph Hellwig, Darrick J. Wong,
	Matthew Wilcox, John Hubbard, linux-fsdevel, nvdimm, linux-xfs,
	linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:35:08PM -0700, Dan Williams wrote:
> 
> > This hackery continues the status of DAX pages as special cases in the
> > VM. The thought being carrying the Xarray / mapping infrastructure
> > forward still allows for the continuation of the page-less DAX effort.
> > Otherwise, the work to convert DAX pages to behave like typical
> > vm_normal_page() needs more investigation to untangle transparent huge
> > page assumptions.
> 
> I see it differently, ZONE_DEVICE by definition is page-based. As long
> as DAX is using ZONE_DEVICE it should follow the normal struct page
> rules, including proper reference counting everywhere.
> 
> By not doing this DAX is causing all ZONE_DEVICE users to suffer
> because we haven't really special cased just DAX out of all the other
> users.
> 
> If there is some kind of non-struct page future, then it will not be
> ZONE_DEVICE and it will have its own mechanisms, somehow.
> 
> So, we should be systematically stripping away all the half-backed
> non-struct page stuff from ZONE_DEVICE as a matter of principle. DAX
> included, whatever DAX's future may hold.
> 
> The pte bit and the missing refcounting in the page table paths is the
> remaining big issue and I hope we fix it. The main problem is that
> FS-DAX must create compound pages for the 2M page size.

Yes, this is how I see it too. Without serious help from folks that want
to kill struct-page usage with DAX the next step will be dynamic
compound page metadata initialization whenever a PMD entry is installed.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-16  3:36 ` [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Dan Williams
@ 2022-09-21 14:03   ` Jason Gunthorpe
  2022-09-21 15:18     ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 14:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> mappings of device-memory with the device-removal / unbind path. It
> enables the semantic that initiating device-removal (or
> device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> mapping revocation or inflight DMA to complete.

This seems strange to me

The pagemap should be ref'd as long as the filesystem is mounted over
the dax. The ref should be incrd when the filesystem is mounted and
decrd when it is unmounted.

When the filesystem unmounts it should zap all the mappings (actually
I don't think you can even unmount a filesystem while mappings are
open) and wait for all page references to go to zero, then put the
final pagemap back.

The rule is nothing can touch page->pgmap while page->refcount == 0,
and if page->refcount != 0 then page->pgmap must be valid, without any
refcounting on the page map itself.

So, why do we need pgmap refcounting all over the place? It seems like
it only existed before because of the abuse of the page->refcount?

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 13/18] dax: Prep mapping helpers for compound pages
  2022-09-16  3:36 ` [PATCH v2 13/18] dax: Prep mapping helpers for compound pages Dan Williams
@ 2022-09-21 14:06   ` Jason Gunthorpe
  2022-09-21 15:19     ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 14:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:36:25PM -0700, Dan Williams wrote:
> In preparation for device-dax to use the same mapping machinery as
> fsdax, add support for device-dax compound pages.
> 
> Presently this is handled by dax_set_mapping() which is careful to only
> update page->mapping for head pages. However, it does that by looking at
> properties in the 'struct dev_dax' instance associated with the page.
> Switch to just checking PageHead() directly in the functions that
> iterate over pages in a large mapping.
> 
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: "Darrick J. Wong" <djwong@kernel.org>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  drivers/dax/Kconfig   |    1 +
>  drivers/dax/mapping.c |   16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> index 205e9dda8928..2eddd32c51f4 100644
> --- a/drivers/dax/Kconfig
> +++ b/drivers/dax/Kconfig
> @@ -9,6 +9,7 @@ if DAX
>  config DEV_DAX
>  	tristate "Device DAX: direct access mapping device"
>  	depends on TRANSPARENT_HUGEPAGE
> +	depends on !FS_DAX_LIMITED
>  	help
>  	  Support raw access to differentiated (persistence, bandwidth,
>  	  latency...) memory via an mmap(2) capable character
> diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c
> index 70576aa02148..5d4b9601f183 100644
> --- a/drivers/dax/mapping.c
> +++ b/drivers/dax/mapping.c
> @@ -345,6 +345,8 @@ static vm_fault_t dax_associate_entry(void *entry,
>  	for_each_mapped_pfn(entry, pfn) {
>  		struct page *page = pfn_to_page(pfn);
>  
> +		page = compound_head(page);

I feel like the word folio is need here.. pfn_to_folio() or something?

At the very least we should have a struct folio after doing the
compound_head, right?

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
  2022-09-16  3:36 ` [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
@ 2022-09-21 14:10   ` Jason Gunthorpe
  2022-09-21 15:48     ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 14:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:36:37PM -0700, Dan Williams wrote:
> Track entries and take pgmap references at mapping insertion time.
> Revoke mappings (dax_zap_mappings()) and drop the associated pgmap
> references at device destruction or inode eviction time. With this in
> place, and the fsdax equivalent already in place, the gup code no longer
> needs to consider PTE_DEVMAP as an indicator to get a pgmap reference
> before taking a page reference.
> 
> In other words, GUP takes additional references on mapped pages. Until
> now, DAX in all its forms was failing to take references at mapping
> time. With that fixed there is no longer a requirement for gup to manage
> @pgmap references. However, that cleanup is saved for a follow-on patch.

A page->pgmap must be valid and stable so long as a page has a
positive refcount. Once we fixed the refcount GUP is automatically
fine. So this explanation seems confusing.

If dax code needs other changes to maintain that invarient it should
be spelled out what those are and why, but the instant we fix the
refcount we can delete the stuff in gup.c and everywhere else.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-21 14:03   ` Jason Gunthorpe
@ 2022-09-21 15:18     ` Dan Williams
  2022-09-21 21:38       ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-21 15:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > mappings of device-memory with the device-removal / unbind path. It
> > enables the semantic that initiating device-removal (or
> > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > mapping revocation or inflight DMA to complete.
> 
> This seems strange to me
> 
> The pagemap should be ref'd as long as the filesystem is mounted over
> the dax. The ref should be incrd when the filesystem is mounted and
> decrd when it is unmounted.
> 
> When the filesystem unmounts it should zap all the mappings (actually
> I don't think you can even unmount a filesystem while mappings are
> open) and wait for all page references to go to zero, then put the
> final pagemap back.
> 
> The rule is nothing can touch page->pgmap while page->refcount == 0,
> and if page->refcount != 0 then page->pgmap must be valid, without any
> refcounting on the page map itself.
> 
> So, why do we need pgmap refcounting all over the place? It seems like
> it only existed before because of the abuse of the page->refcount?

Recall that this percpu_ref is mirroring the same function as
blk_queue_enter() whereby every new request is checking to make sure the
device is still alive, or whether it has started exiting.

So pgmap 'live' reference taking in fs/dax.c allows the core to start
failing fault requests once device teardown has started. It is a 'block
new, and drain old' semantic.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 13/18] dax: Prep mapping helpers for compound pages
  2022-09-21 14:06   ` Jason Gunthorpe
@ 2022-09-21 15:19     ` Dan Williams
  0 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-21 15:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:36:25PM -0700, Dan Williams wrote:
> > In preparation for device-dax to use the same mapping machinery as
> > fsdax, add support for device-dax compound pages.
> > 
> > Presently this is handled by dax_set_mapping() which is careful to only
> > update page->mapping for head pages. However, it does that by looking at
> > properties in the 'struct dev_dax' instance associated with the page.
> > Switch to just checking PageHead() directly in the functions that
> > iterate over pages in a large mapping.
> > 
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: "Darrick J. Wong" <djwong@kernel.org>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  drivers/dax/Kconfig   |    1 +
> >  drivers/dax/mapping.c |   16 ++++++++++++++++
> >  2 files changed, 17 insertions(+)
> > 
> > diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
> > index 205e9dda8928..2eddd32c51f4 100644
> > --- a/drivers/dax/Kconfig
> > +++ b/drivers/dax/Kconfig
> > @@ -9,6 +9,7 @@ if DAX
> >  config DEV_DAX
> >  	tristate "Device DAX: direct access mapping device"
> >  	depends on TRANSPARENT_HUGEPAGE
> > +	depends on !FS_DAX_LIMITED
> >  	help
> >  	  Support raw access to differentiated (persistence, bandwidth,
> >  	  latency...) memory via an mmap(2) capable character
> > diff --git a/drivers/dax/mapping.c b/drivers/dax/mapping.c
> > index 70576aa02148..5d4b9601f183 100644
> > --- a/drivers/dax/mapping.c
> > +++ b/drivers/dax/mapping.c
> > @@ -345,6 +345,8 @@ static vm_fault_t dax_associate_entry(void *entry,
> >  	for_each_mapped_pfn(entry, pfn) {
> >  		struct page *page = pfn_to_page(pfn);
> >  
> > +		page = compound_head(page);
> 
> I feel like the word folio is need here.. pfn_to_folio() or something?
> 
> At the very least we should have a struct folio after doing the
> compound_head, right?

True, I can move this to the folio helpers.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-16  3:36 ` [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count Dan Williams
@ 2022-09-21 15:24   ` Jason Gunthorpe
  2022-09-21 23:45     ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 15:24 UTC (permalink / raw)
  To: Dan Williams, Alistair Popple
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> The initial memremap_pages() implementation inherited the
> __init_single_page() default of pages starting life with an elevated
> reference count. This originally allowed for the page->pgmap pointer to
> alias with the storage for page->lru since a page was only allowed to be
> on an lru list when its reference count was zero.
> 
> Since then, 'struct page' definition cleanups have arranged for
> dedicated space for the ZONE_DEVICE page metadata, and the
> MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> page->_refcount transition to route the page to free_zone_device_page()
> and not the core-mm page-free. With those cleanups in place and with
> filesystem-dax and device-dax now converted to take and drop references
> at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> and MEMORY_DEVICE_GENERIC reference counts at 0.
> 
> MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> pages start life at _refcount 1, so make that the default if
> pgmap->init_mode is left at zero.

I'm shocked to read this - how does it make any sense?

dev_pagemap_ops->page_free() is only called on the 1->0 transition, so
any driver which implements it must be expecting pages to have a 0
refcount.

Looking around everything but only fsdax_pagemap_ops implements
page_free()

So, how does it work? Surely the instant the page map is created all
the pages must be considered 'free', and after page_free() is called I
would also expect the page to be considered free.

How on earth can a free'd page have both a 0 and 1 refcount??

eg look at the simple hmm_test, it threads pages on to the
mdevice->free_pages list immediately after memremap_pages and then
again inside page_free() - it is completely wrong that they would have
different refcounts while on the free_pages list.

I would expect that after the page is removed from the free_pages list
it will have its recount set to 1 to make it non-free then it will go
through the migration.

Alistair how should the refcounting be working here in hmm_test?

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
  2022-09-21 14:10   ` Jason Gunthorpe
@ 2022-09-21 15:48     ` Dan Williams
  2022-09-21 22:23       ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-21 15:48 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:36:37PM -0700, Dan Williams wrote:
> > Track entries and take pgmap references at mapping insertion time.
> > Revoke mappings (dax_zap_mappings()) and drop the associated pgmap
> > references at device destruction or inode eviction time. With this in
> > place, and the fsdax equivalent already in place, the gup code no longer
> > needs to consider PTE_DEVMAP as an indicator to get a pgmap reference
> > before taking a page reference.
> > 
> > In other words, GUP takes additional references on mapped pages. Until
> > now, DAX in all its forms was failing to take references at mapping
> > time. With that fixed there is no longer a requirement for gup to manage
> > @pgmap references. However, that cleanup is saved for a follow-on patch.
> 
> A page->pgmap must be valid and stable so long as a page has a
> positive refcount. Once we fixed the refcount GUP is automatically
> fine. So this explanation seems confusing.

I think while trying to describe the history I made this patch
description confusing.

> If dax code needs other changes to maintain that invarient it should
> be spelled out what those are and why, but the instant we fix the
> refcount we can delete the stuff in gup.c and everywhere else.

How about the following, note that this incorporates new changes I have
in flight after Dave pointed out the problem DAX has with page pins
versus inode lifetime:

---

The fsdax core now manages pgmap references when servicing faults that
install new mappings, and elevates the page reference until it is
zapped. It coordinates with the VFS to make sure that all page
references are dropped before the hosting inode goes out of scope
(iput_final()).

In order to delete the unnecessary pgmap reference taking in mm/gup.c
devdax needs to move to the same model.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-21 15:18     ` Dan Williams
@ 2022-09-21 21:38       ` Dan Williams
  2022-09-21 22:07         ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-21 21:38 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > > mappings of device-memory with the device-removal / unbind path. It
> > > enables the semantic that initiating device-removal (or
> > > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > > mapping revocation or inflight DMA to complete.
> > 
> > This seems strange to me
> > 
> > The pagemap should be ref'd as long as the filesystem is mounted over
> > the dax. The ref should be incrd when the filesystem is mounted and
> > decrd when it is unmounted.
> > 
> > When the filesystem unmounts it should zap all the mappings (actually
> > I don't think you can even unmount a filesystem while mappings are
> > open) and wait for all page references to go to zero, then put the
> > final pagemap back.
> > 
> > The rule is nothing can touch page->pgmap while page->refcount == 0,
> > and if page->refcount != 0 then page->pgmap must be valid, without any
> > refcounting on the page map itself.
> > 
> > So, why do we need pgmap refcounting all over the place? It seems like
> > it only existed before because of the abuse of the page->refcount?
> 
> Recall that this percpu_ref is mirroring the same function as
> blk_queue_enter() whereby every new request is checking to make sure the
> device is still alive, or whether it has started exiting.
> 
> So pgmap 'live' reference taking in fs/dax.c allows the core to start
> failing fault requests once device teardown has started. It is a 'block
> new, and drain old' semantic.

However this line of questioning has me realizing that I have the
put_dev_pagemap() in the wrong place. It needs to go in
free_zone_device_page(), so that gup extends the lifetime of the device.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-21 21:38       ` Dan Williams
@ 2022-09-21 22:07         ` Jason Gunthorpe
  2022-09-22  0:14           ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 22:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Wed, Sep 21, 2022 at 02:38:56PM -0700, Dan Williams wrote:
> Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > > > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > > > mappings of device-memory with the device-removal / unbind path. It
> > > > enables the semantic that initiating device-removal (or
> > > > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > > > mapping revocation or inflight DMA to complete.
> > > 
> > > This seems strange to me
> > > 
> > > The pagemap should be ref'd as long as the filesystem is mounted over
> > > the dax. The ref should be incrd when the filesystem is mounted and
> > > decrd when it is unmounted.
> > > 
> > > When the filesystem unmounts it should zap all the mappings (actually
> > > I don't think you can even unmount a filesystem while mappings are
> > > open) and wait for all page references to go to zero, then put the
> > > final pagemap back.
> > > 
> > > The rule is nothing can touch page->pgmap while page->refcount == 0,
> > > and if page->refcount != 0 then page->pgmap must be valid, without any
> > > refcounting on the page map itself.
> > > 
> > > So, why do we need pgmap refcounting all over the place? It seems like
> > > it only existed before because of the abuse of the page->refcount?
> > 
> > Recall that this percpu_ref is mirroring the same function as
> > blk_queue_enter() whereby every new request is checking to make sure the
> > device is still alive, or whether it has started exiting.
> > 
> > So pgmap 'live' reference taking in fs/dax.c allows the core to start
> > failing fault requests once device teardown has started. It is a 'block
> > new, and drain old' semantic.

It is weird this email never arrived for me..

I think that is all fine, but it would be much more logically
expressed as a simple 'is pgmap alive' call before doing a new mapping
than mucking with the refcount logic. Such a test could simply
READ_ONCE a bool value in the pgmap struct.

Indeed, you could reasonably put such a liveness test at the moment
every driver takes a 0 refcount struct page and turns it into a 1
refcount struct page.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-20 16:44         ` Dan Williams
@ 2022-09-21 22:14           ` Dave Chinner
  2022-09-21 22:28             ` Jason Gunthorpe
  2022-09-22  0:02             ` Dan Williams
  0 siblings, 2 replies; 84+ messages in thread
From: Dave Chinner @ 2022-09-21 22:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Tue, Sep 20, 2022 at 09:44:52AM -0700, Dan Williams wrote:
> Dave Chinner wrote:
> > On Mon, Sep 19, 2022 at 09:11:48AM -0700, Dan Williams wrote:
> > > Dave Chinner wrote:
> > > > That all said, this really looks like a bit of a band-aid.
> > > 
> > > It definitely is since DAX is in this transitory state between doing
> > > some activities page-less and others with page metadata. If DAX was
> > > fully committed to behaving like a typical page then
> > > unmap_mapping_range() would have already satisfied this reference
> > > counting situation.
> > > 
> > > > I can't work out why would we we ever have an actual layout lease
> > > > here that needs breaking given they are file based and active files
> > > > hold a reference to the inode. If we ever break that, then I suspect
> > > > this change will cause major problems for anyone using pNFS with XFS
> > > > as xfs_break_layouts() can end up waiting for NFS delegation
> > > > revocation. This is something we should never be doing in inode
> > > > eviction/memory reclaim.
> > > > 
> > > > Hence I have to ask why this lease break is being done
> > > > unconditionally for all inodes, instead of only calling
> > > > xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> > > > also wonder what exciting new system deadlocks this will create
> > > > because BREAK_UNMAP_FINAL can essentially block forever waiting on
> > > > dax mappings going away. If that DAX mapping reclaim requires memory
> > > > allocations.....
> > > 
> > > There should be no memory allocations in the DAX mapping reclaim path.
> > > Also, the page pins it waits for are precluded from being GUP_LONGTERM.
> > 
> > So if the task that holds the pin needs memory allocation before it
> > can unpin the page to allow direct inode reclaim to make progress?
> 
> No, it couldn't, and I realize now that GUP_LONGTERM has nothing to do
> with this hang since any GFP_KERNEL in a path that took a DAX page pin
> path could run afoul of this need to wait.
> 
> So, this has me looking at invalidate_inodes() and iput_final(), where I
> did not see the reclaim entanglement, and thinking DAX has the unique
> requirement to make sure that no access to a page outlives the hosting
> inode.
> 
> Not that I need to tell you, but to get my own thinking straight,
> compare that to typical page cache as the pinner can keep a pinned
> page-cache page as long as it wants even after it has been truncated.

Right, because the page pin prevents the page from being freed
after the page references the page cache keeps have been released.

But page cache page != DAX page. The DAX page is a direct reference
to the storage media, not a generic reference counted kernel page
that the kernel will keep alive as long as there is a reference to
it.

Hence for a DAX page, we have to revoke all access to the page
before the controlling owner context is torn down, otherwise we have
a use-after-free scenario at the storage media level. For a FSDAX
file data page, that owner context is the inode...

> DAX needs to make sure that truncate_inode_pages() ceases all access to
> the page synchronous with the truncate.

Yes, exactly.

>
> The typical page-cache will
> ensure that the next mapping of the file will get a new page if the page
> previously pinned for that offset is still in use, DAX can not offer
> that as the same page that was previously pinned is always used.

Yes, because the new DAX ipage lookup will return the original page
in the storage media, not a newly instantiated page cache page.

> So I think this means something like this:
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 6462276dfdf0..ab16772b9a8d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -784,6 +784,11 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
>                         continue;
>                 }
>  
> +               if (dax_inode_busy(inode)) {
> +                       busy = 1;
> +                       continue;
> +               }

That this does more than a check (i.e. it runs whatever
dax_zap_pages() does) means it cannot be run under the inode
spinlock.

As this is called from the block device code when a bdev is being
removed (i.e. will only find a superblock and inodes to invalidate
on hot-unplug), shouldn't this DAX mapping invalidation actually be
handled by the pmem failure notification infrastructure we've just
added for reflink?

> +
>                 inode->i_state |= I_FREEING;
>                 inode_lru_list_del(inode);
>                 spin_unlock(&inode->i_lock);
> @@ -1733,6 +1738,8 @@ static void iput_final(struct inode *inode)
>                 spin_unlock(&inode->i_lock);
>  
>                 write_inode_now(inode, 1);
> +               if (IS_DAX(inode))
> +                       dax_break_layouts(inode);
>  
>                 spin_lock(&inode->i_lock);
>                 state = inode->i_state;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9eced4cc286e..e4a74ab310b5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3028,8 +3028,20 @@ extern struct inode * igrab(struct inode *);
>  extern ino_t iunique(struct super_block *, ino_t);
>  extern int inode_needs_sync(struct inode *inode);
>  extern int generic_delete_inode(struct inode *inode);
> +
> +static inline bool dax_inode_busy(struct inode *inode)
> +{
> +       if (!IS_DAX(inode))
> +               return false;
> +
> +       return dax_zap_pages(inode) != NULL;
> +}
> +
>  static inline int generic_drop_inode(struct inode *inode)
>  {
> +       if (dax_inode_busy(inode))
> +               return 0;
> +
>         return !inode->i_nlink || inode_unhashed(inode);
>  }

I don't think that's valid. This can result in unreferenced unlinked
inodes that should be torn down immediately being placed in the LRU
and cached in memory and potentially not processed until there is
future memory pressure or an unmount....

i.e. dropping the final reference on an unlinked inode needs to
reclaim the inode immediately and allow the filesystem to free the
inode, regardless of any other factor. Nothing should have an active
reference to the inode or inode related data/metadata at this point
in time.

Honestly, this still seems like a band-aid because it doesn't appear
to address that something has pinned the storage media without
having an active reference to the object that arbitrates access to
that storage media (i.e. the inode and, by proxy, then filesystem).
Where are these DAX page pins that don't require the pin holder to
also hold active references to the filesystem objects coming from?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
  2022-09-21 15:48     ` Dan Williams
@ 2022-09-21 22:23       ` Jason Gunthorpe
  2022-09-22  0:15         ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 22:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Wed, Sep 21, 2022 at 08:48:22AM -0700, Dan Williams wrote:

> The fsdax core now manages pgmap references when servicing faults that
> install new mappings, and elevates the page reference until it is
> zapped. It coordinates with the VFS to make sure that all page
> references are dropped before the hosting inode goes out of scope
> (iput_final()).
>
> In order to delete the unnecessary pgmap reference taking in mm/gup.c
> devdax needs to move to the same model.

I think this patch is more about making devdax and fsdax use the same
set of functions and logic so that when it gets to patch 16/17 devdax
doesn't break. That understanding matches the first paragraph, at
least.

I would delete the remark about gup since it is really patch 17 that
allows gup to be fixed by making it so that refcount == 0 means not to
look at the pgmap (instead of refcount == 1 as is now) ?

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-21 22:14           ` Dave Chinner
@ 2022-09-21 22:28             ` Jason Gunthorpe
  2022-09-23  0:18               ` Dave Chinner
  2022-09-22  0:02             ` Dan Williams
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-21 22:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:

> Where are these DAX page pins that don't require the pin holder to
> also hold active references to the filesystem objects coming from?

O_DIRECT and things like it.

The concept has been that what you called revoke is just generic code
to wait until all short term users put back their pins, as short term
uses must do according to the !FOLL_LONGTERM contract.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-21 15:24   ` Jason Gunthorpe
@ 2022-09-21 23:45     ` Dan Williams
  2022-09-22  0:03       ` Alistair Popple
                         ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-21 23:45 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams, Alistair Popple
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> > The initial memremap_pages() implementation inherited the
> > __init_single_page() default of pages starting life with an elevated
> > reference count. This originally allowed for the page->pgmap pointer to
> > alias with the storage for page->lru since a page was only allowed to be
> > on an lru list when its reference count was zero.
> > 
> > Since then, 'struct page' definition cleanups have arranged for
> > dedicated space for the ZONE_DEVICE page metadata, and the
> > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> > page->_refcount transition to route the page to free_zone_device_page()
> > and not the core-mm page-free. With those cleanups in place and with
> > filesystem-dax and device-dax now converted to take and drop references
> > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> > and MEMORY_DEVICE_GENERIC reference counts at 0.
> > 
> > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> > pages start life at _refcount 1, so make that the default if
> > pgmap->init_mode is left at zero.
> 
> I'm shocked to read this - how does it make any sense?

I think what happened is that since memremap_pages() historically
produced pages with an elevated reference count that GPU drivers skipped
taking a reference on first allocation and just passed along an elevated
reference count page to the first user.

So either we keep that assumption or update all users to be prepared for
idle pages coming out of memremap_pages().

This is all in reaction to the "set_page_count(page, 1);" in
free_zone_device_page(). Which I am happy to get rid of but need from
help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
memremap_pages() starting all pages at reference count 0.

> dev_pagemap_ops->page_free() is only called on the 1->0 transition, so
> any driver which implements it must be expecting pages to have a 0
> refcount.
> 
> Looking around everything but only fsdax_pagemap_ops implements
> page_free()

Right.

> So, how does it work? Surely the instant the page map is created all
> the pages must be considered 'free', and after page_free() is called I
> would also expect the page to be considered free.

The GPU drivers need to increment reference counts when they hand out
the page rather than reuse the reference count that they get by default.

> How on earth can a free'd page have both a 0 and 1 refcount??

This is residual wonkiness from memremap_pages() handing out pages with
elevated reference counts at the outset.

> eg look at the simple hmm_test, it threads pages on to the
> mdevice->free_pages list immediately after memremap_pages and then
> again inside page_free() - it is completely wrong that they would have
> different refcounts while on the free_pages list.

I do not see any page_ref_inc() in that test, only put_page() so it is
assuming non-idle pages at the outset.

> I would expect that after the page is removed from the free_pages list
> it will have its recount set to 1 to make it non-free then it will go
> through the migration.
> 
> Alistair how should the refcounting be working here in hmm_test?
> 
> Jason



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-21 22:14           ` Dave Chinner
  2022-09-21 22:28             ` Jason Gunthorpe
@ 2022-09-22  0:02             ` Dan Williams
  2022-09-22  0:10               ` Jason Gunthorpe
  1 sibling, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-22  0:02 UTC (permalink / raw)
  To: Dave Chinner, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Dave Chinner wrote:
> On Tue, Sep 20, 2022 at 09:44:52AM -0700, Dan Williams wrote:
> > Dave Chinner wrote:
> > > On Mon, Sep 19, 2022 at 09:11:48AM -0700, Dan Williams wrote:
> > > > Dave Chinner wrote:
> > > > > That all said, this really looks like a bit of a band-aid.
> > > > 
> > > > It definitely is since DAX is in this transitory state between doing
> > > > some activities page-less and others with page metadata. If DAX was
> > > > fully committed to behaving like a typical page then
> > > > unmap_mapping_range() would have already satisfied this reference
> > > > counting situation.
> > > > 
> > > > > I can't work out why would we we ever have an actual layout lease
> > > > > here that needs breaking given they are file based and active files
> > > > > hold a reference to the inode. If we ever break that, then I suspect
> > > > > this change will cause major problems for anyone using pNFS with XFS
> > > > > as xfs_break_layouts() can end up waiting for NFS delegation
> > > > > revocation. This is something we should never be doing in inode
> > > > > eviction/memory reclaim.
> > > > > 
> > > > > Hence I have to ask why this lease break is being done
> > > > > unconditionally for all inodes, instead of only calling
> > > > > xfs_break_dax_layouts() directly on DAX enabled regular files?  I
> > > > > also wonder what exciting new system deadlocks this will create
> > > > > because BREAK_UNMAP_FINAL can essentially block forever waiting on
> > > > > dax mappings going away. If that DAX mapping reclaim requires memory
> > > > > allocations.....
> > > > 
> > > > There should be no memory allocations in the DAX mapping reclaim path.
> > > > Also, the page pins it waits for are precluded from being GUP_LONGTERM.
> > > 
> > > So if the task that holds the pin needs memory allocation before it
> > > can unpin the page to allow direct inode reclaim to make progress?
> > 
> > No, it couldn't, and I realize now that GUP_LONGTERM has nothing to do
> > with this hang since any GFP_KERNEL in a path that took a DAX page pin
> > path could run afoul of this need to wait.
> > 
> > So, this has me looking at invalidate_inodes() and iput_final(), where I
> > did not see the reclaim entanglement, and thinking DAX has the unique
> > requirement to make sure that no access to a page outlives the hosting
> > inode.
> > 
> > Not that I need to tell you, but to get my own thinking straight,
> > compare that to typical page cache as the pinner can keep a pinned
> > page-cache page as long as it wants even after it has been truncated.
> 
> Right, because the page pin prevents the page from being freed
> after the page references the page cache keeps have been released.
> 
> But page cache page != DAX page. The DAX page is a direct reference
> to the storage media, not a generic reference counted kernel page
> that the kernel will keep alive as long as there is a reference to
> it.
> 
> Hence for a DAX page, we have to revoke all access to the page
> before the controlling owner context is torn down, otherwise we have
> a use-after-free scenario at the storage media level. For a FSDAX
> file data page, that owner context is the inode...
> 
> > DAX needs to make sure that truncate_inode_pages() ceases all access to
> > the page synchronous with the truncate.
> 
> Yes, exactly.
> 
> >
> > The typical page-cache will
> > ensure that the next mapping of the file will get a new page if the page
> > previously pinned for that offset is still in use, DAX can not offer
> > that as the same page that was previously pinned is always used.
> 
> Yes, because the new DAX ipage lookup will return the original page
> in the storage media, not a newly instantiated page cache page.
> 
> > So I think this means something like this:
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 6462276dfdf0..ab16772b9a8d 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -784,6 +784,11 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
> >                         continue;
> >                 }
> >  
> > +               if (dax_inode_busy(inode)) {
> > +                       busy = 1;
> > +                       continue;
> > +               }
> 
> That this does more than a check (i.e. it runs whatever
> dax_zap_pages() does) means it cannot be run under the inode
> spinlock.

Here lockdep did immediately scream at me for what can be done under the
inode lock.

> As this is called from the block device code when a bdev is being
> removed (i.e. will only find a superblock and inodes to invalidate
> on hot-unplug), shouldn't this DAX mapping invalidation actually be
> handled by the pmem failure notification infrastructure we've just
> added for reflink?

Perhaps. I think the patch I have in the works now is simpler without
having to require ext4 to add notify_failure infrastructure, but that
may be where this ends up.

> 
> > +
> >                 inode->i_state |= I_FREEING;
> >                 inode_lru_list_del(inode);
> >                 spin_unlock(&inode->i_lock);
> > @@ -1733,6 +1738,8 @@ static void iput_final(struct inode *inode)
> >                 spin_unlock(&inode->i_lock);
> >  
> >                 write_inode_now(inode, 1);
> > +               if (IS_DAX(inode))
> > +                       dax_break_layouts(inode);
> >  
> >                 spin_lock(&inode->i_lock);
> >                 state = inode->i_state;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 9eced4cc286e..e4a74ab310b5 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3028,8 +3028,20 @@ extern struct inode * igrab(struct inode *);
> >  extern ino_t iunique(struct super_block *, ino_t);
> >  extern int inode_needs_sync(struct inode *inode);
> >  extern int generic_delete_inode(struct inode *inode);
> > +
> > +static inline bool dax_inode_busy(struct inode *inode)
> > +{
> > +       if (!IS_DAX(inode))
> > +               return false;
> > +
> > +       return dax_zap_pages(inode) != NULL;
> > +}
> > +
> >  static inline int generic_drop_inode(struct inode *inode)
> >  {
> > +       if (dax_inode_busy(inode))
> > +               return 0;
> > +
> >         return !inode->i_nlink || inode_unhashed(inode);
> >  }
> 
> I don't think that's valid. This can result in unreferenced unlinked
> inodes that should be torn down immediately being placed in the LRU
> and cached in memory and potentially not processed until there is
> future memory pressure or an unmount....
> 
> i.e. dropping the final reference on an unlinked inode needs to
> reclaim the inode immediately and allow the filesystem to free the
> inode, regardless of any other factor. Nothing should have an active
> reference to the inode or inode related data/metadata at this point
> in time.
> 
> Honestly, this still seems like a band-aid because it doesn't appear
> to address that something has pinned the storage media without
> having an active reference to the object that arbitrates access to
> that storage media (i.e. the inode and, by proxy, then filesystem).
> Where are these DAX page pins that don't require the pin holder to
> also hold active references to the filesystem objects coming from?

I do not have a practical exploit for this only the observation that
iput_final() triggers truncate_inode_pages(). Then the follow-on
assumption that *if* pages are still recorded in inode->i_pages then
those pages could have an elevated reference count.

Certainly GUP_LONGTERM can set up these "memory registration" scenarios,
but those are forbidden.

The scenario I cannot convince myself is impossible is a driver that
goes into interruptible sleep while operating on a page it got from
get_user_pages(). Where the eventual driver completion path will clean
up the pinned page, but the process that launched the I/O has already
exited and dropped all the inode references it was holding. That's not
buggy on its face since the driver still cleans up everything it was
handed, but if this type of disconnect happens (closing mappings and
files while I/O is in-flight) then iput_final() needs to check.

The block-I/O submission path seems to be uninterruptible to prevent
this type of disconnect, but who knows what other drivers do with their
pages.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-21 23:45     ` Dan Williams
@ 2022-09-22  0:03       ` Alistair Popple
  2022-09-22  0:04       ` Jason Gunthorpe
  2022-09-22  0:13       ` John Hubbard
  2 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2022-09-22  0:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4


Dan Williams <dan.j.williams@intel.com> writes:

> Jason Gunthorpe wrote:
>> On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
>> > The initial memremap_pages() implementation inherited the
>> > __init_single_page() default of pages starting life with an elevated
>> > reference count. This originally allowed for the page->pgmap pointer to
>> > alias with the storage for page->lru since a page was only allowed to be
>> > on an lru list when its reference count was zero.
>> >
>> > Since then, 'struct page' definition cleanups have arranged for
>> > dedicated space for the ZONE_DEVICE page metadata, and the
>> > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
>> > page->_refcount transition to route the page to free_zone_device_page()
>> > and not the core-mm page-free. With those cleanups in place and with
>> > filesystem-dax and device-dax now converted to take and drop references
>> > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
>> > and MEMORY_DEVICE_GENERIC reference counts at 0.
>> >
>> > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
>> > pages start life at _refcount 1, so make that the default if
>> > pgmap->init_mode is left at zero.
>>
>> I'm shocked to read this - how does it make any sense?
>
> I think what happened is that since memremap_pages() historically
> produced pages with an elevated reference count that GPU drivers skipped
> taking a reference on first allocation and just passed along an elevated
> reference count page to the first user.
>
> So either we keep that assumption or update all users to be prepared for
> idle pages coming out of memremap_pages().
>
> This is all in reaction to the "set_page_count(page, 1);" in
> free_zone_device_page(). Which I am happy to get rid of but need from
> help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> memremap_pages() starting all pages at reference count 0.

This is all rather good timing - This week I've been in the middle of
getting a series together which fixes this among other things. So I'm
all for fixing it and can help with that - my motivation was I needed to
be able to tell if a page is free or not with get_page_unless_zero()
etc. which doesn't work at the moment because free device
private/coherent pages have an elevated refcount.

 - Alistair

>> dev_pagemap_ops->page_free() is only called on the 1->0 transition, so
>> any driver which implements it must be expecting pages to have a 0
>> refcount.
>>
>> Looking around everything but only fsdax_pagemap_ops implements
>> page_free()
>
> Right.
>
>> So, how does it work? Surely the instant the page map is created all
>> the pages must be considered 'free', and after page_free() is called I
>> would also expect the page to be considered free.
>
> The GPU drivers need to increment reference counts when they hand out
> the page rather than reuse the reference count that they get by default.
>
>> How on earth can a free'd page have both a 0 and 1 refcount??
>
> This is residual wonkiness from memremap_pages() handing out pages with
> elevated reference counts at the outset.
>
>> eg look at the simple hmm_test, it threads pages on to the
>> mdevice->free_pages list immediately after memremap_pages and then
>> again inside page_free() - it is completely wrong that they would have
>> different refcounts while on the free_pages list.
>
> I do not see any page_ref_inc() in that test, only put_page() so it is
> assuming non-idle pages at the outset.
>
>> I would expect that after the page is removed from the free_pages list
>> it will have its recount set to 1 to make it non-free then it will go
>> through the migration.
>>
>> Alistair how should the refcounting be working here in hmm_test?
>>
>> Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-21 23:45     ` Dan Williams
  2022-09-22  0:03       ` Alistair Popple
@ 2022-09-22  0:04       ` Jason Gunthorpe
  2022-09-22  0:34         ` Dan Williams
  2022-09-22  0:13       ` John Hubbard
  2 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-22  0:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Alistair Popple, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> > > The initial memremap_pages() implementation inherited the
> > > __init_single_page() default of pages starting life with an elevated
> > > reference count. This originally allowed for the page->pgmap pointer to
> > > alias with the storage for page->lru since a page was only allowed to be
> > > on an lru list when its reference count was zero.
> > > 
> > > Since then, 'struct page' definition cleanups have arranged for
> > > dedicated space for the ZONE_DEVICE page metadata, and the
> > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> > > page->_refcount transition to route the page to free_zone_device_page()
> > > and not the core-mm page-free. With those cleanups in place and with
> > > filesystem-dax and device-dax now converted to take and drop references
> > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> > > and MEMORY_DEVICE_GENERIC reference counts at 0.
> > > 
> > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> > > pages start life at _refcount 1, so make that the default if
> > > pgmap->init_mode is left at zero.
> > 
> > I'm shocked to read this - how does it make any sense?
> 
> I think what happened is that since memremap_pages() historically
> produced pages with an elevated reference count that GPU drivers skipped
> taking a reference on first allocation and just passed along an elevated
> reference count page to the first user.
> 
> So either we keep that assumption or update all users to be prepared for
> idle pages coming out of memremap_pages().
> 
> This is all in reaction to the "set_page_count(page, 1);" in
> free_zone_device_page(). Which I am happy to get rid of but need from
> help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> memremap_pages() starting all pages at reference count 0.

But, but this is all racy, it can't do this:

+	if (pgmap->ops && pgmap->ops->page_free)
+		pgmap->ops->page_free(page);
 
 	/*
+	 * Reset the page count to the @init_mode value to prepare for
+	 * handing out the page again.
 	 */
+	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
+		set_page_count(page, 1);

after the fact! Something like that hmm_test has already threaded the
"freed" page into the free list via ops->page_free(), it can't have a
0 ref count and be on the free list, even temporarily :(

Maybe it nees to be re-ordered?

> > How on earth can a free'd page have both a 0 and 1 refcount??
> 
> This is residual wonkiness from memremap_pages() handing out pages with
> elevated reference counts at the outset.

I think the answer to my question is the above troubled code where we
still set the page refcount back to 1 even in the page_free path, so
there is some consistency "a freed paged may have a refcount of 1" for
the driver.

So, I guess this patch makes sense but I would put more noise around
INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
explicit constant) and alert people that they need to fix their stuff
to get rid of it.

We should definately try to fix hmm_test as well so people have a good
reference code to follow in fixing the other drivers :(

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-22  0:02             ` Dan Williams
@ 2022-09-22  0:10               ` Jason Gunthorpe
  0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-22  0:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Wed, Sep 21, 2022 at 05:02:37PM -0700, Dan Williams wrote:

> The scenario I cannot convince myself is impossible is a driver that
> goes into interruptible sleep while operating on a page it got from
> get_user_pages(). Where the eventual driver completion path will clean
> up the pinned page, but the process that launched the I/O has already
> exited and dropped all the inode references it was holding. That's not
> buggy on its face since the driver still cleans up everything it was
> handed, but if this type of disconnect happens (closing mappings and
> files while I/O is in-flight) then iput_final() needs to check.

I don't think you can make this argument. The inode you are talking
about is held in the vma of the mm_struct, it is not just a process
exit or interrupted sleep that could cause the vma to drop the inode
reference, but any concurrent thread doing memunmap/close can destroy
the VMA, close the FD and release the inode.

So userspace can certainly create races where something has safely
done GUP/PUP !FOLL_LONGTERM but the VMA that sourced the page is
destroyed while the thread is still processing the post-GUP work.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-21 23:45     ` Dan Williams
  2022-09-22  0:03       ` Alistair Popple
  2022-09-22  0:04       ` Jason Gunthorpe
@ 2022-09-22  0:13       ` John Hubbard
  2 siblings, 0 replies; 84+ messages in thread
From: John Hubbard @ 2022-09-22  0:13 UTC (permalink / raw)
  To: Dan Williams, Jason Gunthorpe, Alistair Popple
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, linux-fsdevel, nvdimm, linux-xfs, linux-mm,
	linux-ext4

On 9/21/22 16:45, Dan Williams wrote:
>> I'm shocked to read this - how does it make any sense?
> 
> I think what happened is that since memremap_pages() historically
> produced pages with an elevated reference count that GPU drivers skipped
> taking a reference on first allocation and just passed along an elevated
> reference count page to the first user.
> 
> So either we keep that assumption or update all users to be prepared for
> idle pages coming out of memremap_pages().
> 
> This is all in reaction to the "set_page_count(page, 1);" in
> free_zone_device_page(). Which I am happy to get rid of but need from
> help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> memremap_pages() starting all pages at reference count 0.
> 

Just one tiny thing to contribute to this difficult story: I think that
we can make this slightly clearer by saying things like this:

"The device driver is the allocator for device pages. And allocators
keep pages at a refcount == 0, until they hand out the pages in response
to allocation requests."

To me at least, this makes it easier to see why pages are 0 or > 0 
refcounts. In case that helps at all.


thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-21 22:07         ` Jason Gunthorpe
@ 2022-09-22  0:14           ` Dan Williams
  2022-09-22  0:25             ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-22  0:14 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 02:38:56PM -0700, Dan Williams wrote:
> > Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Thu, Sep 15, 2022 at 08:36:07PM -0700, Dan Williams wrote:
> > > > > The percpu_ref in 'struct dev_pagemap' is used to coordinate active
> > > > > mappings of device-memory with the device-removal / unbind path. It
> > > > > enables the semantic that initiating device-removal (or
> > > > > device-driver-unbind) blocks new mapping and DMA attempts, and waits for
> > > > > mapping revocation or inflight DMA to complete.
> > > > 
> > > > This seems strange to me
> > > > 
> > > > The pagemap should be ref'd as long as the filesystem is mounted over
> > > > the dax. The ref should be incrd when the filesystem is mounted and
> > > > decrd when it is unmounted.
> > > > 
> > > > When the filesystem unmounts it should zap all the mappings (actually
> > > > I don't think you can even unmount a filesystem while mappings are
> > > > open) and wait for all page references to go to zero, then put the
> > > > final pagemap back.
> > > > 
> > > > The rule is nothing can touch page->pgmap while page->refcount == 0,
> > > > and if page->refcount != 0 then page->pgmap must be valid, without any
> > > > refcounting on the page map itself.
> > > > 
> > > > So, why do we need pgmap refcounting all over the place? It seems like
> > > > it only existed before because of the abuse of the page->refcount?
> > > 
> > > Recall that this percpu_ref is mirroring the same function as
> > > blk_queue_enter() whereby every new request is checking to make sure the
> > > device is still alive, or whether it has started exiting.
> > > 
> > > So pgmap 'live' reference taking in fs/dax.c allows the core to start
> > > failing fault requests once device teardown has started. It is a 'block
> > > new, and drain old' semantic.
> 
> It is weird this email never arrived for me..
> 
> I think that is all fine, but it would be much more logically
> expressed as a simple 'is pgmap alive' call before doing a new mapping
> than mucking with the refcount logic. Such a test could simply
> READ_ONCE a bool value in the pgmap struct.
> 
> Indeed, you could reasonably put such a liveness test at the moment
> every driver takes a 0 refcount struct page and turns it into a 1
> refcount struct page.

I could do it with a flag, but the reason to have pgmap->ref managed at
the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
time memunmap_pages() can look at the one counter rather than scanning
and rescanning all the pages to see when they go to final idle.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry()
  2022-09-21 22:23       ` Jason Gunthorpe
@ 2022-09-22  0:15         ` Dan Williams
  0 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-22  0:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 08:48:22AM -0700, Dan Williams wrote:
> 
> > The fsdax core now manages pgmap references when servicing faults that
> > install new mappings, and elevates the page reference until it is
> > zapped. It coordinates with the VFS to make sure that all page
> > references are dropped before the hosting inode goes out of scope
> > (iput_final()).
> >
> > In order to delete the unnecessary pgmap reference taking in mm/gup.c
> > devdax needs to move to the same model.
> 
> I think this patch is more about making devdax and fsdax use the same
> set of functions and logic so that when it gets to patch 16/17 devdax
> doesn't break. That understanding matches the first paragraph, at
> least.
> 
> I would delete the remark about gup since it is really patch 17 that
> allows gup to be fixed by making it so that refcount == 0 means not to
> look at the pgmap (instead of refcount == 1 as is now) ?

Yeah, makes sense.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-22  0:14           ` Dan Williams
@ 2022-09-22  0:25             ` Jason Gunthorpe
  2022-09-22  2:17               ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-22  0:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:

> > Indeed, you could reasonably put such a liveness test at the moment
> > every driver takes a 0 refcount struct page and turns it into a 1
> > refcount struct page.
> 
> I could do it with a flag, but the reason to have pgmap->ref managed at
> the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> time memunmap_pages() can look at the one counter rather than scanning
> and rescanning all the pages to see when they go to final idle.

That makes some sense too, but the logical way to do that is to put some
counter along the page_free() path, and establish a 'make a page not
free' path that does the other side.

ie it should not be in DAX code, it should be all in common pgmap
code. The pgmap should never be freed while any page->refcount != 0
and that should be an intrinsic property of pgmap, not relying on
external parties.

Though I suspect if we were to look at performance it is probably
better to scan the memory on the unlikely case of pgmap removal than
to put more code in hot paths to keep track of refcounts.. It doesn't
need rescanning, just one sweep where it waits on every non-zero page
to become zero.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-22  0:04       ` Jason Gunthorpe
@ 2022-09-22  0:34         ` Dan Williams
  2022-09-22  1:36           ` Alistair Popple
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-22  0:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Alistair Popple, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> > > > The initial memremap_pages() implementation inherited the
> > > > __init_single_page() default of pages starting life with an elevated
> > > > reference count. This originally allowed for the page->pgmap pointer to
> > > > alias with the storage for page->lru since a page was only allowed to be
> > > > on an lru list when its reference count was zero.
> > > > 
> > > > Since then, 'struct page' definition cleanups have arranged for
> > > > dedicated space for the ZONE_DEVICE page metadata, and the
> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> > > > page->_refcount transition to route the page to free_zone_device_page()
> > > > and not the core-mm page-free. With those cleanups in place and with
> > > > filesystem-dax and device-dax now converted to take and drop references
> > > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> > > > and MEMORY_DEVICE_GENERIC reference counts at 0.
> > > > 
> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> > > > pages start life at _refcount 1, so make that the default if
> > > > pgmap->init_mode is left at zero.
> > > 
> > > I'm shocked to read this - how does it make any sense?
> > 
> > I think what happened is that since memremap_pages() historically
> > produced pages with an elevated reference count that GPU drivers skipped
> > taking a reference on first allocation and just passed along an elevated
> > reference count page to the first user.
> > 
> > So either we keep that assumption or update all users to be prepared for
> > idle pages coming out of memremap_pages().
> > 
> > This is all in reaction to the "set_page_count(page, 1);" in
> > free_zone_device_page(). Which I am happy to get rid of but need from
> > help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> > memremap_pages() starting all pages at reference count 0.
> 
> But, but this is all racy, it can't do this:
> 
> +	if (pgmap->ops && pgmap->ops->page_free)
> +		pgmap->ops->page_free(page);
>  
>  	/*
> +	 * Reset the page count to the @init_mode value to prepare for
> +	 * handing out the page again.
>  	 */
> +	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
> +		set_page_count(page, 1);
> 
> after the fact! Something like that hmm_test has already threaded the
> "freed" page into the free list via ops->page_free(), it can't have a
> 0 ref count and be on the free list, even temporarily :(
> 
> Maybe it nees to be re-ordered?
> 
> > > How on earth can a free'd page have both a 0 and 1 refcount??
> > 
> > This is residual wonkiness from memremap_pages() handing out pages with
> > elevated reference counts at the outset.
> 
> I think the answer to my question is the above troubled code where we
> still set the page refcount back to 1 even in the page_free path, so
> there is some consistency "a freed paged may have a refcount of 1" for
> the driver.
> 
> So, I guess this patch makes sense but I would put more noise around
> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
> explicit constant) and alert people that they need to fix their stuff
> to get rid of it.

Sounds reasonable.

> We should definately try to fix hmm_test as well so people have a good
> reference code to follow in fixing the other drivers :(

Oh, that's a good idea. I can probably fix that up and leave it to the
GPU driver folks to catch up with that example so we can kill off
INIT_PAGEMAP_BUSY.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-22  0:34         ` Dan Williams
@ 2022-09-22  1:36           ` Alistair Popple
  2022-09-22  2:34             ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Alistair Popple @ 2022-09-22  1:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4


Dan Williams <dan.j.williams@intel.com> writes:

> Jason Gunthorpe wrote:
>> On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
>> > Jason Gunthorpe wrote:
>> > > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
>> > > > The initial memremap_pages() implementation inherited the
>> > > > __init_single_page() default of pages starting life with an elevated
>> > > > reference count. This originally allowed for the page->pgmap pointer to
>> > > > alias with the storage for page->lru since a page was only allowed to be
>> > > > on an lru list when its reference count was zero.
>> > > >
>> > > > Since then, 'struct page' definition cleanups have arranged for
>> > > > dedicated space for the ZONE_DEVICE page metadata, and the
>> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
>> > > > page->_refcount transition to route the page to free_zone_device_page()
>> > > > and not the core-mm page-free. With those cleanups in place and with
>> > > > filesystem-dax and device-dax now converted to take and drop references
>> > > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
>> > > > and MEMORY_DEVICE_GENERIC reference counts at 0.
>> > > >
>> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
>> > > > pages start life at _refcount 1, so make that the default if
>> > > > pgmap->init_mode is left at zero.
>> > >
>> > > I'm shocked to read this - how does it make any sense?
>> >
>> > I think what happened is that since memremap_pages() historically
>> > produced pages with an elevated reference count that GPU drivers skipped
>> > taking a reference on first allocation and just passed along an elevated
>> > reference count page to the first user.
>> >
>> > So either we keep that assumption or update all users to be prepared for
>> > idle pages coming out of memremap_pages().
>> >
>> > This is all in reaction to the "set_page_count(page, 1);" in
>> > free_zone_device_page(). Which I am happy to get rid of but need from
>> > help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
>> > memremap_pages() starting all pages at reference count 0.
>>
>> But, but this is all racy, it can't do this:
>>
>> +	if (pgmap->ops && pgmap->ops->page_free)
>> +		pgmap->ops->page_free(page);
>>
>>  	/*
>> +	 * Reset the page count to the @init_mode value to prepare for
>> +	 * handing out the page again.
>>  	 */
>> +	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
>> +		set_page_count(page, 1);
>>
>> after the fact! Something like that hmm_test has already threaded the
>> "freed" page into the free list via ops->page_free(), it can't have a
>> 0 ref count and be on the free list, even temporarily :(
>>
>> Maybe it nees to be re-ordered?
>>
>> > > How on earth can a free'd page have both a 0 and 1 refcount??
>> >
>> > This is residual wonkiness from memremap_pages() handing out pages with
>> > elevated reference counts at the outset.
>>
>> I think the answer to my question is the above troubled code where we
>> still set the page refcount back to 1 even in the page_free path, so
>> there is some consistency "a freed paged may have a refcount of 1" for
>> the driver.
>>
>> So, I guess this patch makes sense but I would put more noise around
>> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
>> explicit constant) and alert people that they need to fix their stuff
>> to get rid of it.
>
> Sounds reasonable.
>
>> We should definately try to fix hmm_test as well so people have a good
>> reference code to follow in fixing the other drivers :(
>
> Oh, that's a good idea. I can probably fix that up and leave it to the
> GPU driver folks to catch up with that example so we can kill off
> INIT_PAGEMAP_BUSY.

I'm hoping to send my series that fixes up all drivers using device
coherent/private later this week or early next. So you could also just
wait for that and remove INIT_PAGEMAP_BUSY entirely.

 - Alistair

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-22  0:25             ` Jason Gunthorpe
@ 2022-09-22  2:17               ` Dan Williams
  2022-09-22 17:55                 ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-22  2:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> 
> > > Indeed, you could reasonably put such a liveness test at the moment
> > > every driver takes a 0 refcount struct page and turns it into a 1
> > > refcount struct page.
> > 
> > I could do it with a flag, but the reason to have pgmap->ref managed at
> > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > time memunmap_pages() can look at the one counter rather than scanning
> > and rescanning all the pages to see when they go to final idle.
> 
> That makes some sense too, but the logical way to do that is to put some
> counter along the page_free() path, and establish a 'make a page not
> free' path that does the other side.
> 
> ie it should not be in DAX code, it should be all in common pgmap
> code. The pgmap should never be freed while any page->refcount != 0
> and that should be an intrinsic property of pgmap, not relying on
> external parties.

I just do not know where to put such intrinsics since there is nothing
today that requires going through the pgmap object to discover the pfn
and 'allocate' the page.

I think you may be asking to unify dax_direct_access() with pgmap
management where all dax_direct_access() users are required to take a
page reference if the pfn it returns is going to be used outside of
dax_read_lock().

In other words make dax_direct_access() the 'allocation' event that pins
the pgmap? I might be speaking a foreign language if you're not familiar
with the relationship of 'struct dax_device' to 'struct dev_pagemap'
instances. This is not the first time I have considered making them one
in the same.

> Though I suspect if we were to look at performance it is probably
> better to scan the memory on the unlikely case of pgmap removal than
> to put more code in hot paths to keep track of refcounts.. It doesn't
> need rescanning, just one sweep where it waits on every non-zero page
> to become zero.

True, on the way down nothing should be elevating page references, just
waiting for the last one to drain. I am just not sure that pgmap removal
is that unlikely going forward with things like the dax_kmem driver and
CXL Dynamic Capacity Devices where tearing down DAX devices happens.
Perhaps something to revisit if the pgmap percpu_ref ever shows up in
profiles.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-22  1:36           ` Alistair Popple
@ 2022-09-22  2:34             ` Dan Williams
  2022-09-26  6:17               ` Alistair Popple
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-22  2:34 UTC (permalink / raw)
  To: Alistair Popple, Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> > Jason Gunthorpe wrote:
> >> On Wed, Sep 21, 2022 at 04:45:22PM -0700, Dan Williams wrote:
> >> > Jason Gunthorpe wrote:
> >> > > On Thu, Sep 15, 2022 at 08:36:43PM -0700, Dan Williams wrote:
> >> > > > The initial memremap_pages() implementation inherited the
> >> > > > __init_single_page() default of pages starting life with an elevated
> >> > > > reference count. This originally allowed for the page->pgmap pointer to
> >> > > > alias with the storage for page->lru since a page was only allowed to be
> >> > > > on an lru list when its reference count was zero.
> >> > > >
> >> > > > Since then, 'struct page' definition cleanups have arranged for
> >> > > > dedicated space for the ZONE_DEVICE page metadata, and the
> >> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} work has arranged for the 1 -> 0
> >> > > > page->_refcount transition to route the page to free_zone_device_page()
> >> > > > and not the core-mm page-free. With those cleanups in place and with
> >> > > > filesystem-dax and device-dax now converted to take and drop references
> >> > > > at map and truncate time, it is possible to start MEMORY_DEVICE_FS_DAX
> >> > > > and MEMORY_DEVICE_GENERIC reference counts at 0.
> >> > > >
> >> > > > MEMORY_DEVICE_{PRIVATE,COHERENT} still expect that their ZONE_DEVICE
> >> > > > pages start life at _refcount 1, so make that the default if
> >> > > > pgmap->init_mode is left at zero.
> >> > >
> >> > > I'm shocked to read this - how does it make any sense?
> >> >
> >> > I think what happened is that since memremap_pages() historically
> >> > produced pages with an elevated reference count that GPU drivers skipped
> >> > taking a reference on first allocation and just passed along an elevated
> >> > reference count page to the first user.
> >> >
> >> > So either we keep that assumption or update all users to be prepared for
> >> > idle pages coming out of memremap_pages().
> >> >
> >> > This is all in reaction to the "set_page_count(page, 1);" in
> >> > free_zone_device_page(). Which I am happy to get rid of but need from
> >> > help from MEMORY_DEVICE_{PRIVATE,COHERENT} folks to react to
> >> > memremap_pages() starting all pages at reference count 0.
> >>
> >> But, but this is all racy, it can't do this:
> >>
> >> +	if (pgmap->ops && pgmap->ops->page_free)
> >> +		pgmap->ops->page_free(page);
> >>
> >>  	/*
> >> +	 * Reset the page count to the @init_mode value to prepare for
> >> +	 * handing out the page again.
> >>  	 */
> >> +	if (pgmap->init_mode == INIT_PAGEMAP_BUSY)
> >> +		set_page_count(page, 1);
> >>
> >> after the fact! Something like that hmm_test has already threaded the
> >> "freed" page into the free list via ops->page_free(), it can't have a
> >> 0 ref count and be on the free list, even temporarily :(
> >>
> >> Maybe it nees to be re-ordered?
> >>
> >> > > How on earth can a free'd page have both a 0 and 1 refcount??
> >> >
> >> > This is residual wonkiness from memremap_pages() handing out pages with
> >> > elevated reference counts at the outset.
> >>
> >> I think the answer to my question is the above troubled code where we
> >> still set the page refcount back to 1 even in the page_free path, so
> >> there is some consistency "a freed paged may have a refcount of 1" for
> >> the driver.
> >>
> >> So, I guess this patch makes sense but I would put more noise around
> >> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
> >> explicit constant) and alert people that they need to fix their stuff
> >> to get rid of it.
> >
> > Sounds reasonable.
> >
> >> We should definately try to fix hmm_test as well so people have a good
> >> reference code to follow in fixing the other drivers :(
> >
> > Oh, that's a good idea. I can probably fix that up and leave it to the
> > GPU driver folks to catch up with that example so we can kill off
> > INIT_PAGEMAP_BUSY.
> 
> I'm hoping to send my series that fixes up all drivers using device
> coherent/private later this week or early next. So you could also just
> wait for that and remove INIT_PAGEMAP_BUSY entirely.

Oh, perfect, thanks!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-22  2:17               ` Dan Williams
@ 2022-09-22 17:55                 ` Jason Gunthorpe
  2022-09-22 21:54                   ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 17:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > 
> > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > refcount struct page.
> > > 
> > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > time memunmap_pages() can look at the one counter rather than scanning
> > > and rescanning all the pages to see when they go to final idle.
> > 
> > That makes some sense too, but the logical way to do that is to put some
> > counter along the page_free() path, and establish a 'make a page not
> > free' path that does the other side.
> > 
> > ie it should not be in DAX code, it should be all in common pgmap
> > code. The pgmap should never be freed while any page->refcount != 0
> > and that should be an intrinsic property of pgmap, not relying on
> > external parties.
> 
> I just do not know where to put such intrinsics since there is nothing
> today that requires going through the pgmap object to discover the pfn
> and 'allocate' the page.

I think that is just a new API that wrappers the set refcount = 1,
percpu refcount and maybe building appropriate compound pages too.

Eg maybe something like:

  struct folio *pgmap_alloc_folios(pgmap, start, length)

And you get back maximally sized allocated folios with refcount = 1
that span the requested range.

> In other words make dax_direct_access() the 'allocation' event that pins
> the pgmap? I might be speaking a foreign language if you're not familiar
> with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> instances. This is not the first time I have considered making them one
> in the same.

I don't know enough about dax, so yes very foreign :)

I'm thinking broadly about how to make pgmap usable to all the other
drivers in a safe and robust way that makes some kind of logical sense.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-22 17:55                 ` Jason Gunthorpe
@ 2022-09-22 21:54                   ` Dan Williams
  2022-09-23  1:36                     ` Dave Chinner
  2022-09-23 13:24                     ` Jason Gunthorpe
  0 siblings, 2 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-22 21:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > > 
> > > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > > refcount struct page.
> > > > 
> > > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > > time memunmap_pages() can look at the one counter rather than scanning
> > > > and rescanning all the pages to see when they go to final idle.
> > > 
> > > That makes some sense too, but the logical way to do that is to put some
> > > counter along the page_free() path, and establish a 'make a page not
> > > free' path that does the other side.
> > > 
> > > ie it should not be in DAX code, it should be all in common pgmap
> > > code. The pgmap should never be freed while any page->refcount != 0
> > > and that should be an intrinsic property of pgmap, not relying on
> > > external parties.
> > 
> > I just do not know where to put such intrinsics since there is nothing
> > today that requires going through the pgmap object to discover the pfn
> > and 'allocate' the page.
> 
> I think that is just a new API that wrappers the set refcount = 1,
> percpu refcount and maybe building appropriate compound pages too.
> 
> Eg maybe something like:
> 
>   struct folio *pgmap_alloc_folios(pgmap, start, length)
> 
> And you get back maximally sized allocated folios with refcount = 1
> that span the requested range.
> 
> > In other words make dax_direct_access() the 'allocation' event that pins
> > the pgmap? I might be speaking a foreign language if you're not familiar
> > with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> > instances. This is not the first time I have considered making them one
> > in the same.
> 
> I don't know enough about dax, so yes very foreign :)
> 
> I'm thinking broadly about how to make pgmap usable to all the other
> drivers in a safe and robust way that makes some kind of logical sense.

I think the API should be pgmap_folio_get() because, at least for DAX,
the memory is already allocated. The 'allocator' for fsdax is the
filesystem block allocator, and pgmap_folio_get() grants access to a
folio in the pgmap by a pfn that the block allocator knows about. If the
GPU use case wants to wrap an allocator around that they can, but the
fundamental requirement is check if the pgmap is dead and if not elevate
the page reference.

So something like:

/**
 * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
 * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
 * @pfn: page frame number covered by @pgmap
 */
struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, unsigned long pfn)
{
        struct page *page;
        
        VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
        
        if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
                return NULL;
        page = pfn_to_page(pfn);
        return page_folio(page);
}

This does not create compound folios, that needs to be coordinated with
the caller and likely needs an explicit

    pgmap_construct_folio(pgmap, pfn, order)

...call that can be done while holding locks against operations that
will cause the folio to be broken down.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-21 22:28             ` Jason Gunthorpe
@ 2022-09-23  0:18               ` Dave Chinner
  2022-09-23  0:41                 ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Chinner @ 2022-09-23  0:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> 
> > Where are these DAX page pins that don't require the pin holder to
> > also hold active references to the filesystem objects coming from?
> 
> O_DIRECT and things like it.

O_DIRECT IO to a file holds a reference to a struct file which holds
an active reference to the struct inode. Hence you can't reclaim an
inode while an O_DIRECT IO is in progress to it. 

Similarly, file-backed pages pinned from user vmas have the inode
pinned by the VMA having a reference to the struct file passed to
them when they are instantiated. Hence anything using mmap() to pin
file-backed pages (i.e. applications using FSDAX access from
userspace) should also have a reference to the inode that prevents
the inode from being reclaimed.

So I'm at a loss to understand what "things like it" might actually
mean. Can you actually describe a situation where we actually permit
(even temporarily) these use-after-free scenarios?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-23  0:18               ` Dave Chinner
@ 2022-09-23  0:41                 ` Dan Williams
  2022-09-23  2:10                   ` Dave Chinner
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-23  0:41 UTC (permalink / raw)
  To: Dave Chinner, Jason Gunthorpe
  Cc: Dan Williams, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Dave Chinner wrote:
> On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> > 
> > > Where are these DAX page pins that don't require the pin holder to
> > > also hold active references to the filesystem objects coming from?
> > 
> > O_DIRECT and things like it.
> 
> O_DIRECT IO to a file holds a reference to a struct file which holds
> an active reference to the struct inode. Hence you can't reclaim an
> inode while an O_DIRECT IO is in progress to it. 
> 
> Similarly, file-backed pages pinned from user vmas have the inode
> pinned by the VMA having a reference to the struct file passed to
> them when they are instantiated. Hence anything using mmap() to pin
> file-backed pages (i.e. applications using FSDAX access from
> userspace) should also have a reference to the inode that prevents
> the inode from being reclaimed.
> 
> So I'm at a loss to understand what "things like it" might actually
> mean. Can you actually describe a situation where we actually permit
> (even temporarily) these use-after-free scenarios?

Jason mentioned a scenario here:

https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/

Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
thread2 does memunmap()+close() while the read() is inflight.

Sounds plausible to me, but I have not tried to trigger it with a focus
test.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-22 21:54                   ` Dan Williams
@ 2022-09-23  1:36                     ` Dave Chinner
  2022-09-23  2:01                       ` Dan Williams
  2022-09-23 13:24                     ` Jason Gunthorpe
  1 sibling, 1 reply; 84+ messages in thread
From: Dave Chinner @ 2022-09-23  1:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > > > 
> > > > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > > > refcount struct page.
> > > > > 
> > > > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > > > time memunmap_pages() can look at the one counter rather than scanning
> > > > > and rescanning all the pages to see when they go to final idle.
> > > > 
> > > > That makes some sense too, but the logical way to do that is to put some
> > > > counter along the page_free() path, and establish a 'make a page not
> > > > free' path that does the other side.
> > > > 
> > > > ie it should not be in DAX code, it should be all in common pgmap
> > > > code. The pgmap should never be freed while any page->refcount != 0
> > > > and that should be an intrinsic property of pgmap, not relying on
> > > > external parties.
> > > 
> > > I just do not know where to put such intrinsics since there is nothing
> > > today that requires going through the pgmap object to discover the pfn
> > > and 'allocate' the page.
> > 
> > I think that is just a new API that wrappers the set refcount = 1,
> > percpu refcount and maybe building appropriate compound pages too.
> > 
> > Eg maybe something like:
> > 
> >   struct folio *pgmap_alloc_folios(pgmap, start, length)
> > 
> > And you get back maximally sized allocated folios with refcount = 1
> > that span the requested range.
> > 
> > > In other words make dax_direct_access() the 'allocation' event that pins
> > > the pgmap? I might be speaking a foreign language if you're not familiar
> > > with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> > > instances. This is not the first time I have considered making them one
> > > in the same.
> > 
> > I don't know enough about dax, so yes very foreign :)
> > 
> > I'm thinking broadly about how to make pgmap usable to all the other
> > drivers in a safe and robust way that makes some kind of logical sense.
> 
> I think the API should be pgmap_folio_get() because, at least for DAX,
> the memory is already allocated. The 'allocator' for fsdax is the
> filesystem block allocator, and pgmap_folio_get() grants access to a

No, the "allocator" for fsdax is the inode iomap interface, not the
filesystem block allocator. The filesystem block allocator is only
involved in iomapping if we have to allocate a new mapping for a
given file offset.

A better name for this is "arbiter", not allocator.  To get an
active mapping of the DAX pages backing a file, we need to ask the
inode iomap subsystem to *map a file offset* and it will return
kaddr and/or pfns for the backing store the file offset maps to.

IOWs, for FSDAX, access to the backing store (i.e. the physical pages) is
arbitrated by the *inode*, not the filesystem allocator or the dax
device. Hence if a subsystem needs to pin the backing store for some
use, it must first ensure that it holds an inode reference (direct
or indirect) for that range of the backing store that will spans the
life of the pin. When the pin is done, it can tear down the mappings
it was using and then the inode reference can be released.

This ensures that any racing unlink of the inode will not result in
the backing store being freed from under the application that has a
pin. It will prevent the inode from being reclaimed and so
potentially accessing stale or freed in-memory structures. And it
will prevent the filesytem from being unmounted while the
application using FSDAX access is still actively using that
functionality even if it's already closed all it's fds....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-23  1:36                     ` Dave Chinner
@ 2022-09-23  2:01                       ` Dan Williams
  0 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-23  2:01 UTC (permalink / raw)
  To: Dave Chinner, Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Dave Chinner wrote:
> On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Wed, Sep 21, 2022 at 07:17:40PM -0700, Dan Williams wrote:
> > > > Jason Gunthorpe wrote:
> > > > > On Wed, Sep 21, 2022 at 05:14:34PM -0700, Dan Williams wrote:
> > > > > 
> > > > > > > Indeed, you could reasonably put such a liveness test at the moment
> > > > > > > every driver takes a 0 refcount struct page and turns it into a 1
> > > > > > > refcount struct page.
> > > > > > 
> > > > > > I could do it with a flag, but the reason to have pgmap->ref managed at
> > > > > > the page->_refcount 0 -> 1 and 1 -> 0 transitions is so at the end of
> > > > > > time memunmap_pages() can look at the one counter rather than scanning
> > > > > > and rescanning all the pages to see when they go to final idle.
> > > > > 
> > > > > That makes some sense too, but the logical way to do that is to put some
> > > > > counter along the page_free() path, and establish a 'make a page not
> > > > > free' path that does the other side.
> > > > > 
> > > > > ie it should not be in DAX code, it should be all in common pgmap
> > > > > code. The pgmap should never be freed while any page->refcount != 0
> > > > > and that should be an intrinsic property of pgmap, not relying on
> > > > > external parties.
> > > > 
> > > > I just do not know where to put such intrinsics since there is nothing
> > > > today that requires going through the pgmap object to discover the pfn
> > > > and 'allocate' the page.
> > > 
> > > I think that is just a new API that wrappers the set refcount = 1,
> > > percpu refcount and maybe building appropriate compound pages too.
> > > 
> > > Eg maybe something like:
> > > 
> > >   struct folio *pgmap_alloc_folios(pgmap, start, length)
> > > 
> > > And you get back maximally sized allocated folios with refcount = 1
> > > that span the requested range.
> > > 
> > > > In other words make dax_direct_access() the 'allocation' event that pins
> > > > the pgmap? I might be speaking a foreign language if you're not familiar
> > > > with the relationship of 'struct dax_device' to 'struct dev_pagemap'
> > > > instances. This is not the first time I have considered making them one
> > > > in the same.
> > > 
> > > I don't know enough about dax, so yes very foreign :)
> > > 
> > > I'm thinking broadly about how to make pgmap usable to all the other
> > > drivers in a safe and robust way that makes some kind of logical sense.
> > 
> > I think the API should be pgmap_folio_get() because, at least for DAX,
> > the memory is already allocated. The 'allocator' for fsdax is the
> > filesystem block allocator, and pgmap_folio_get() grants access to a
> 
> No, the "allocator" for fsdax is the inode iomap interface, not the
> filesystem block allocator. The filesystem block allocator is only
> involved in iomapping if we have to allocate a new mapping for a
> given file offset.
> 
> A better name for this is "arbiter", not allocator.  To get an
> active mapping of the DAX pages backing a file, we need to ask the
> inode iomap subsystem to *map a file offset* and it will return
> kaddr and/or pfns for the backing store the file offset maps to.
> 
> IOWs, for FSDAX, access to the backing store (i.e. the physical pages) is
> arbitrated by the *inode*, not the filesystem allocator or the dax
> device. Hence if a subsystem needs to pin the backing store for some
> use, it must first ensure that it holds an inode reference (direct
> or indirect) for that range of the backing store that will spans the
> life of the pin. When the pin is done, it can tear down the mappings
> it was using and then the inode reference can be released.
> 
> This ensures that any racing unlink of the inode will not result in
> the backing store being freed from under the application that has a
> pin. It will prevent the inode from being reclaimed and so
> potentially accessing stale or freed in-memory structures. And it
> will prevent the filesytem from being unmounted while the
> application using FSDAX access is still actively using that
> functionality even if it's already closed all it's fds....

Sounds so simple when you put it that way. I'll give it a shot and stop
the gymnastics of trying to get in front of truncate_inode_pages_final()
with a 'dax break layouts', just hold it off until final unpin.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-23  0:41                 ` Dan Williams
@ 2022-09-23  2:10                   ` Dave Chinner
  2022-09-23  9:38                     ` Jan Kara
  2022-09-23 12:39                     ` Jason Gunthorpe
  0 siblings, 2 replies; 84+ messages in thread
From: Dave Chinner @ 2022-09-23  2:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 22, 2022 at 05:41:08PM -0700, Dan Williams wrote:
> Dave Chinner wrote:
> > On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> > > 
> > > > Where are these DAX page pins that don't require the pin holder to
> > > > also hold active references to the filesystem objects coming from?
> > > 
> > > O_DIRECT and things like it.
> > 
> > O_DIRECT IO to a file holds a reference to a struct file which holds
> > an active reference to the struct inode. Hence you can't reclaim an
> > inode while an O_DIRECT IO is in progress to it. 
> > 
> > Similarly, file-backed pages pinned from user vmas have the inode
> > pinned by the VMA having a reference to the struct file passed to
> > them when they are instantiated. Hence anything using mmap() to pin
> > file-backed pages (i.e. applications using FSDAX access from
> > userspace) should also have a reference to the inode that prevents
> > the inode from being reclaimed.
> > 
> > So I'm at a loss to understand what "things like it" might actually
> > mean. Can you actually describe a situation where we actually permit
> > (even temporarily) these use-after-free scenarios?
> 
> Jason mentioned a scenario here:
> 
> https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> 
> Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> thread2 does memunmap()+close() while the read() is inflight.

And, ah, what production application does this and expects to be
able to process the result of the read() operation without getting a
SEGV?

There's a huge difference between an unlikely scenario which we need
to work (such as O_DIRECT IO to/from a mmap() buffer at a different
offset on the same file) and this sort of scenario where even if we
handle it correctly, the application can't do anything with the
result and will crash immediately....

> Sounds plausible to me, but I have not tried to trigger it with a focus
> test.

If there really are applications this .... broken, then it's not the
responsibility of the filesystem to paper over the low level page
reference tracking issues that cause it.

i.e. The underlying problem here is that memunmap() frees the VMA
while there are still active task-based references to the pages in
that VMA. IOWs, the VMA should not be torn down until the O_DIRECT
read has released all the references to the pages mapped into the
task address space.

This just doesn't seem like an issue that we should be trying to fix
by adding band-aids to the inode life-cycle management.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-23  2:10                   ` Dave Chinner
@ 2022-09-23  9:38                     ` Jan Kara
  2022-09-23 23:06                       ` Dan Williams
  2022-09-25 23:54                       ` Dave Chinner
  2022-09-23 12:39                     ` Jason Gunthorpe
  1 sibling, 2 replies; 84+ messages in thread
From: Jan Kara @ 2022-09-23  9:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

On Fri 23-09-22 12:10:12, Dave Chinner wrote:
> On Thu, Sep 22, 2022 at 05:41:08PM -0700, Dan Williams wrote:
> > Dave Chinner wrote:
> > > On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> > > > 
> > > > > Where are these DAX page pins that don't require the pin holder to
> > > > > also hold active references to the filesystem objects coming from?
> > > > 
> > > > O_DIRECT and things like it.
> > > 
> > > O_DIRECT IO to a file holds a reference to a struct file which holds
> > > an active reference to the struct inode. Hence you can't reclaim an
> > > inode while an O_DIRECT IO is in progress to it. 
> > > 
> > > Similarly, file-backed pages pinned from user vmas have the inode
> > > pinned by the VMA having a reference to the struct file passed to
> > > them when they are instantiated. Hence anything using mmap() to pin
> > > file-backed pages (i.e. applications using FSDAX access from
> > > userspace) should also have a reference to the inode that prevents
> > > the inode from being reclaimed.
> > > 
> > > So I'm at a loss to understand what "things like it" might actually
> > > mean. Can you actually describe a situation where we actually permit
> > > (even temporarily) these use-after-free scenarios?
> > 
> > Jason mentioned a scenario here:
> > 
> > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> > 
> > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> > thread2 does memunmap()+close() while the read() is inflight.
> 
> And, ah, what production application does this and expects to be
> able to process the result of the read() operation without getting a
> SEGV?
> 
> There's a huge difference between an unlikely scenario which we need
> to work (such as O_DIRECT IO to/from a mmap() buffer at a different
> offset on the same file) and this sort of scenario where even if we
> handle it correctly, the application can't do anything with the
> result and will crash immediately....

I'm not sure I fully follow what we are concerned about here. As you've
written above direct IO holds reference to the inode until it is completed
(through kiocb->file->inode chain). So direct IO should be safe?

I'd be more worried about stuff like vmsplice() that can add file pages
into pipe without holding inode alive in any way and keeping them there for
arbitrarily long time. Didn't we want to add FOLL_LONGTERM to gup executed
from vmsplice() to avoid issues like this?

> > Sounds plausible to me, but I have not tried to trigger it with a focus
> > test.
> 
> If there really are applications this .... broken, then it's not the
> responsibility of the filesystem to paper over the low level page
> reference tracking issues that cause it.
> 
> i.e. The underlying problem here is that memunmap() frees the VMA
> while there are still active task-based references to the pages in
> that VMA. IOWs, the VMA should not be torn down until the O_DIRECT
> read has released all the references to the pages mapped into the
> task address space.
> 
> This just doesn't seem like an issue that we should be trying to fix
> by adding band-aids to the inode life-cycle management.

I agree that freeing VMA while there are pinned pages is ... inconvenient.
But that is just how gup works since the beginning - the moment you have
struct page reference, you completely forget about the mapping you've used
to get to the page. So anything can happen with the mapping after that
moment. And in case of pages mapped by multiple processes I can easily see
that one of the processes decides to unmap the page (and it may well be
that was the initial process that acquired page references) while others
still keep accessing the page using page references stored in some internal
structure (RDMA anyone?). I think it will be rather difficult to come up
with some scheme keeping VMA alive while there are pages pinned without
regressing userspace which over the years became very much tailored to the
peculiar gup behavior.

I can imagine we would keep *inode* referenced while there are its pages
pinned. That should not be that difficult but at least in naive
implementation that would put rather heavy stress on inode refcount under
some loads so I don't think that's useful either.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-23  2:10                   ` Dave Chinner
  2022-09-23  9:38                     ` Jan Kara
@ 2022-09-23 12:39                     ` Jason Gunthorpe
  2022-09-26  0:34                       ` Dave Chinner
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 12:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Fri, Sep 23, 2022 at 12:10:12PM +1000, Dave Chinner wrote:

> > Jason mentioned a scenario here:
> > 
> > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> > 
> > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> > thread2 does memunmap()+close() while the read() is inflight.
> 
> And, ah, what production application does this and expects to be
> able to process the result of the read() operation without getting a
> SEGV?

The read() will do GUP and get a pined page, next the memunmap()/close
will release the inode the VMA was holding open. The read() FD is NOT
a DAX FD.

We are now UAFing the DAX storage. There is no SEGV.

It is not about sane applications, it is about kernel security against
hostile userspace.

> i.e. The underlying problem here is that memunmap() frees the VMA
> while there are still active task-based references to the pages in
> that VMA. IOWs, the VMA should not be torn down until the O_DIRECT
> read has released all the references to the pages mapped into the
> task address space.

This is Jan's suggestion, I think we are still far from being able to
do that for O_DIRECT paths.

Even if you fix the close() this way, doesn't truncate still have the
same problem?

At the end of the day the rule is a DAX page must not be re-used until
its refcount is 0. At some point the FS should wait for.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-22 21:54                   ` Dan Williams
  2022-09-23  1:36                     ` Dave Chinner
@ 2022-09-23 13:24                     ` Jason Gunthorpe
  2022-09-23 16:29                       ` Dan Williams
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 13:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:

> > I'm thinking broadly about how to make pgmap usable to all the other
> > drivers in a safe and robust way that makes some kind of logical sense.
> 
> I think the API should be pgmap_folio_get() because, at least for DAX,
> the memory is already allocated. 

I would pick a name that has some logical connection to
ops->page_free()

This function is starting a pairing where once it completes page_free
will eventually be called.

> /**
>  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
>  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
>  * @pfn: page frame number covered by @pgmap
>  */
> struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, unsigned long pfn)
> {
>         struct page *page;
>         
>         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
>
>         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
>                 return NULL;

This shouldn't be a WARN?

>         page = pfn_to_page(pfn);
>         return page_folio(page);
> }

Yeah, makes sense to me, but I would do a len as well to amortize the
cost of all these checks..

> This does not create compound folios, that needs to be coordinated with
> the caller and likely needs an explicit

Does it? What situations do you think the caller needs to coordinate
the folio size? Caller should call the function for each logical unit
of storage it wants to allocate from the pgmap..

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-23 13:24                     ` Jason Gunthorpe
@ 2022-09-23 16:29                       ` Dan Williams
  2022-09-23 17:42                         ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-23 16:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Thu, Sep 22, 2022 at 02:54:42PM -0700, Dan Williams wrote:
> 
> > > I'm thinking broadly about how to make pgmap usable to all the other
> > > drivers in a safe and robust way that makes some kind of logical sense.
> > 
> > I think the API should be pgmap_folio_get() because, at least for DAX,
> > the memory is already allocated. 
> 
> I would pick a name that has some logical connection to
> ops->page_free()
> 
> This function is starting a pairing where once it completes page_free
> will eventually be called.

Following Dave's note that this is an 'arbitration' mechanism I think
request/release is more appropriate than alloc/free for what this is doing.

> 
> > /**
> >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
> >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
> >  * @pfn: page frame number covered by @pgmap
> >  */
> > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap, unsigned long pfn)
> > {
> >         struct page *page;
> >         
> >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
> >
> >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> >                 return NULL;
> 
> This shouldn't be a WARN?

It's a bug if someone calls this after killing the pgmap. I.e.  the
expectation is that the caller is synchronzing this. The only reason
this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
not expect it to fire on anything but a development kernel.

> 
> >         page = pfn_to_page(pfn);
> >         return page_folio(page);
> > }
> 
> Yeah, makes sense to me, but I would do a len as well to amortize the
> cost of all these checks..
> 
> > This does not create compound folios, that needs to be coordinated with
> > the caller and likely needs an explicit
> 
> Does it? What situations do you think the caller needs to coordinate
> the folio size? Caller should call the function for each logical unit
> of storage it wants to allocate from the pgmap..

The problem for fsdax is that it needs to gather all the PTEs, hold a
lock to synchronize against events that would shatter a huge page, and
then build up the compound folio metadata before inserting the PMD. So I
think that flow is request all pfns, lock, fixup refcounts, build up
compound folio, insert huge i_pages entry, unlock and install the pmd.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-23 16:29                       ` Dan Williams
@ 2022-09-23 17:42                         ` Jason Gunthorpe
  2022-09-23 19:03                           ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 17:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote:
> > > /**
> > >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
> > >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
> > >  * @pfn: page frame number covered by @pgmap
> > >  */
> > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap,
> > > unsigned long pfn)

Maybe should be not be pfn but be 'offset from the first page of the
pgmap' ? Then we don't need the xa_load stuff, since it cann't be
wrong by definition.

> > > {
> > >         struct page *page;
> > >         
> > >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
> > >
> > >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> > >                 return NULL;
> > 
> > This shouldn't be a WARN?
> 
> It's a bug if someone calls this after killing the pgmap. I.e.  the
> expectation is that the caller is synchronzing this. The only reason
> this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
> not expect it to fire on anything but a development kernel.

OK, that makes sense

But shouldn't this get the pgmap refcount here? The reason we started
talking about this was to make all the pgmap logic self contained so
that the pgmap doesn't pass its own destroy until all the all the
page_free()'s have been done.

> > > This does not create compound folios, that needs to be coordinated with
> > > the caller and likely needs an explicit
> > 
> > Does it? What situations do you think the caller needs to coordinate
> > the folio size? Caller should call the function for each logical unit
> > of storage it wants to allocate from the pgmap..
> 
> The problem for fsdax is that it needs to gather all the PTEs, hold a
> lock to synchronize against events that would shatter a huge page, and
> then build up the compound folio metadata before inserting the PMD. 

Er, at this point we are just talking about acquiring virgin pages
nobody else is using, not inserting things. There is no possibility of
conurrent shattering because, by definition, nothing else can
reference these struct pages at this instant.

Also, the caller must already be serializating pgmap_get_folio()
against concurrent calls on the same pfn (since it is an error to call
pgmap_get_folio() on an non-free pfn)

So, I would expect the caller must already have all the necessary
locking to accept maximally sized folios.

eg if it has some reason to punch a hole in the contiguous range
(shatter the folio) it must *already* serialize against
pgmap_get_folio(), since something like punching a hole must know with
certainty if any struct pages are refcount != 0 or not, and must not
race with something trying to set their refcount to 1.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-23 17:42                         ` Jason Gunthorpe
@ 2022-09-23 19:03                           ` Dan Williams
  2022-09-23 19:23                             ` Jason Gunthorpe
  2022-09-27  6:07                             ` Alistair Popple
  0 siblings, 2 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-23 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote:
> > > > /**
> > > >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
> > > >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
> > > >  * @pfn: page frame number covered by @pgmap
> > > >  */
> > > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap,
> > > > unsigned long pfn)
> 
> Maybe should be not be pfn but be 'offset from the first page of the
> pgmap' ? Then we don't need the xa_load stuff, since it cann't be
> wrong by definition.
> 
> > > > {
> > > >         struct page *page;
> > > >         
> > > >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
> > > >
> > > >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
> > > >                 return NULL;
> > > 
> > > This shouldn't be a WARN?
> > 
> > It's a bug if someone calls this after killing the pgmap. I.e.  the
> > expectation is that the caller is synchronzing this. The only reason
> > this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
> > not expect it to fire on anything but a development kernel.
> 
> OK, that makes sense
> 
> But shouldn't this get the pgmap refcount here? The reason we started
> talking about this was to make all the pgmap logic self contained so
> that the pgmap doesn't pass its own destroy until all the all the
> page_free()'s have been done.
> 
> > > > This does not create compound folios, that needs to be coordinated with
> > > > the caller and likely needs an explicit
> > > 
> > > Does it? What situations do you think the caller needs to coordinate
> > > the folio size? Caller should call the function for each logical unit
> > > of storage it wants to allocate from the pgmap..
> > 
> > The problem for fsdax is that it needs to gather all the PTEs, hold a
> > lock to synchronize against events that would shatter a huge page, and
> > then build up the compound folio metadata before inserting the PMD. 
> 
> Er, at this point we are just talking about acquiring virgin pages
> nobody else is using, not inserting things. There is no possibility of
> conurrent shattering because, by definition, nothing else can
> reference these struct pages at this instant.
> 
> Also, the caller must already be serializating pgmap_get_folio()
> against concurrent calls on the same pfn (since it is an error to call
> pgmap_get_folio() on an non-free pfn)
> 
> So, I would expect the caller must already have all the necessary
> locking to accept maximally sized folios.
> 
> eg if it has some reason to punch a hole in the contiguous range
> (shatter the folio) it must *already* serialize against
> pgmap_get_folio(), since something like punching a hole must know with
> certainty if any struct pages are refcount != 0 or not, and must not
> race with something trying to set their refcount to 1.

Perhaps, I'll take a look. The scenario I am more concerned about is
processA sets up a VMA of PAGE_SIZE and races processB to fault in the
same filesystem block with a VMA of PMD_SIZE. Right now processA gets a
PTE mapping and processB gets a PMD mapping, but the refcounting is all
handled in small pages. I need to investigate more what is needed for
fsdax to support folio_size() > mapping entry size.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-23 19:03                           ` Dan Williams
@ 2022-09-23 19:23                             ` Jason Gunthorpe
  2022-09-27  6:07                             ` Alistair Popple
  1 sibling, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 19:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Fri, Sep 23, 2022 at 12:03:53PM -0700, Dan Williams wrote:

> Perhaps, I'll take a look. The scenario I am more concerned about is
> processA sets up a VMA of PAGE_SIZE and races processB to fault in the
> same filesystem block with a VMA of PMD_SIZE. Right now processA gets a
> PTE mapping and processB gets a PMD mapping, but the refcounting is all
> handled in small pages. I need to investigate more what is needed for
> fsdax to support folio_size() > mapping entry size.

This is fine actually.

The PMD/PTE can hold a tail page. So the page cache will hold a PMD
sized folio, procesA will have a PTE pointing to a tail page and
processB will have a PMD pointing at the head page.

For the immediate instant you can keep accounting for each tail page
as you do now, just with folio wrappers. Once you have proper folios
you shift the accounting responsibility to the core code and the core
will faster with one ref per PMD/PTE.

The trick with folios is probably going to be breaking up a folio. THP
has some nasty stuff for that, but I think a FS would be better to
just revoke the entire folio, bring the refcount to 0, change the
underling physical mapping, and then fault will naturally restore a
properly sized folio to accomodate the new physical layout.

ie you never break up a folio once it is created from the pgmap.

What you want is to have largest possibile folios because it optimizes
all the handling logic.

.. and then you are well positioned to do some kind of trick where the
FS asserts at mount time that it never needs a folio less than order X
and you can then trigger the devdax optimization of folding struct
page memory and significantly reducing the wastage for struct page..

Jason


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-23  9:38                     ` Jan Kara
@ 2022-09-23 23:06                       ` Dan Williams
  2022-09-25 23:54                       ` Dave Chinner
  1 sibling, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-23 23:06 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Dan Williams, Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

Jan Kara wrote:
> On Fri 23-09-22 12:10:12, Dave Chinner wrote:
> > On Thu, Sep 22, 2022 at 05:41:08PM -0700, Dan Williams wrote:
> > > Dave Chinner wrote:
> > > > On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > Where are these DAX page pins that don't require the pin holder to
> > > > > > also hold active references to the filesystem objects coming from?
> > > > > 
> > > > > O_DIRECT and things like it.
> > > > 
> > > > O_DIRECT IO to a file holds a reference to a struct file which holds
> > > > an active reference to the struct inode. Hence you can't reclaim an
> > > > inode while an O_DIRECT IO is in progress to it. 
> > > > 
> > > > Similarly, file-backed pages pinned from user vmas have the inode
> > > > pinned by the VMA having a reference to the struct file passed to
> > > > them when they are instantiated. Hence anything using mmap() to pin
> > > > file-backed pages (i.e. applications using FSDAX access from
> > > > userspace) should also have a reference to the inode that prevents
> > > > the inode from being reclaimed.
> > > > 
> > > > So I'm at a loss to understand what "things like it" might actually
> > > > mean. Can you actually describe a situation where we actually permit
> > > > (even temporarily) these use-after-free scenarios?
> > > 
> > > Jason mentioned a scenario here:
> > > 
> > > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> > > 
> > > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> > > thread2 does memunmap()+close() while the read() is inflight.
> > 
> > And, ah, what production application does this and expects to be
> > able to process the result of the read() operation without getting a
> > SEGV?
> > 
> > There's a huge difference between an unlikely scenario which we need
> > to work (such as O_DIRECT IO to/from a mmap() buffer at a different
> > offset on the same file) and this sort of scenario where even if we
> > handle it correctly, the application can't do anything with the
> > result and will crash immediately....
> 
> I'm not sure I fully follow what we are concerned about here. As you've
> written above direct IO holds reference to the inode until it is completed
> (through kiocb->file->inode chain). So direct IO should be safe?
> 
> I'd be more worried about stuff like vmsplice() that can add file pages
> into pipe without holding inode alive in any way and keeping them there for
> arbitrarily long time. Didn't we want to add FOLL_LONGTERM to gup executed
> from vmsplice() to avoid issues like this?
> 
> > > Sounds plausible to me, but I have not tried to trigger it with a focus
> > > test.
> > 
> > If there really are applications this .... broken, then it's not the
> > responsibility of the filesystem to paper over the low level page
> > reference tracking issues that cause it.
> > 
> > i.e. The underlying problem here is that memunmap() frees the VMA
> > while there are still active task-based references to the pages in
> > that VMA. IOWs, the VMA should not be torn down until the O_DIRECT
> > read has released all the references to the pages mapped into the
> > task address space.
> > 
> > This just doesn't seem like an issue that we should be trying to fix
> > by adding band-aids to the inode life-cycle management.
> 
> I agree that freeing VMA while there are pinned pages is ... inconvenient.
> But that is just how gup works since the beginning - the moment you have
> struct page reference, you completely forget about the mapping you've used
> to get to the page. So anything can happen with the mapping after that
> moment. And in case of pages mapped by multiple processes I can easily see
> that one of the processes decides to unmap the page (and it may well be
> that was the initial process that acquired page references) while others
> still keep accessing the page using page references stored in some internal
> structure (RDMA anyone?). I think it will be rather difficult to come up
> with some scheme keeping VMA alive while there are pages pinned without
> regressing userspace which over the years became very much tailored to the
> peculiar gup behavior.
> 
> I can imagine we would keep *inode* referenced while there are its pages
> pinned. That should not be that difficult but at least in naive
> implementation that would put rather heavy stress on inode refcount under
> some loads so I don't think that's useful either.

What about instead of keeping the inode *referenced* while there might
be pinned pages, keep the inode *dirty*? Then
dax_writeback_mapping_range() can watch for inodes in the I_WILL_FREE
state and know this is the last chance to break layouts and truncate
mappings before the inode goes out of scope.

That feels clean and unburdensome to me, but this would not be the first
time I have overlooked a filesystem constraint.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-23  9:38                     ` Jan Kara
  2022-09-23 23:06                       ` Dan Williams
@ 2022-09-25 23:54                       ` Dave Chinner
  2022-09-26 14:10                         ` Jan Kara
  1 sibling, 1 reply; 84+ messages in thread
From: Dave Chinner @ 2022-09-25 23:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Jason Gunthorpe, akpm, Matthew Wilcox,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

On Fri, Sep 23, 2022 at 11:38:03AM +0200, Jan Kara wrote:
> On Fri 23-09-22 12:10:12, Dave Chinner wrote:
> > On Thu, Sep 22, 2022 at 05:41:08PM -0700, Dan Williams wrote:
> > > Dave Chinner wrote:
> > > > On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > Where are these DAX page pins that don't require the pin holder to
> > > > > > also hold active references to the filesystem objects coming from?
> > > > > 
> > > > > O_DIRECT and things like it.
> > > > 
> > > > O_DIRECT IO to a file holds a reference to a struct file which holds
> > > > an active reference to the struct inode. Hence you can't reclaim an
> > > > inode while an O_DIRECT IO is in progress to it. 
> > > > 
> > > > Similarly, file-backed pages pinned from user vmas have the inode
> > > > pinned by the VMA having a reference to the struct file passed to
> > > > them when they are instantiated. Hence anything using mmap() to pin
> > > > file-backed pages (i.e. applications using FSDAX access from
> > > > userspace) should also have a reference to the inode that prevents
> > > > the inode from being reclaimed.
> > > > 
> > > > So I'm at a loss to understand what "things like it" might actually
> > > > mean. Can you actually describe a situation where we actually permit
> > > > (even temporarily) these use-after-free scenarios?
> > > 
> > > Jason mentioned a scenario here:
> > > 
> > > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> > > 
> > > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> > > thread2 does memunmap()+close() while the read() is inflight.
> > 
> > And, ah, what production application does this and expects to be
> > able to process the result of the read() operation without getting a
> > SEGV?
> > 
> > There's a huge difference between an unlikely scenario which we need
> > to work (such as O_DIRECT IO to/from a mmap() buffer at a different
> > offset on the same file) and this sort of scenario where even if we
> > handle it correctly, the application can't do anything with the
> > result and will crash immediately....
> 
> I'm not sure I fully follow what we are concerned about here. As you've
> written above direct IO holds reference to the inode until it is completed
> (through kiocb->file->inode chain). So direct IO should be safe?

AFAICT, it's the user buffer allocated by mmap() that the direct IO
is DMAing into/out of that is the issue here. i.e. mmap() a file
that is DAX enabled, pass the mmap region to DIO on a non-dax file,
GUP in the DIO path takes a page pin on user pages that are DAX
mapped, the userspace application then unmaps the file pages and
unlinks the FSDAX file.

At this point the FSDAX mapped inode has no active references, so
the filesystem frees the inode and it's allocated storage space, and
now the DIO or whatever is holding the GUP reference is
now a moving storage UAF violation. What ever is holding the GUP
reference doesn't even have a reference to the FSDAX filesystem -
the DIO fd could point to a file in a different filesystem
altogether - and so the fsdax filesytem could be unmounted at this
point whilst the application is still actively using the storage
underlying the filesystem.

That's just .... broken.

> I'd be more worried about stuff like vmsplice() that can add file pages
> into pipe without holding inode alive in any way and keeping them there for
> arbitrarily long time. Didn't we want to add FOLL_LONGTERM to gup executed
> from vmsplice() to avoid issues like this?

Yes, ISTR that was part of the plan - use FOLL_LONGTERM to ensure
FSDAX can't run operations that pin pages but don't take fs
references. I think that's how we prevented RDMA users from pinning
FSDAX direct mapped storage media in this way. It does not, however,
prevent the above "short term" GUP UAF situation from occurring.

> > > Sounds plausible to me, but I have not tried to trigger it with a focus
> > > test.
> > 
> > If there really are applications this .... broken, then it's not the
> > responsibility of the filesystem to paper over the low level page
> > reference tracking issues that cause it.
> > 
> > i.e. The underlying problem here is that memunmap() frees the VMA
> > while there are still active task-based references to the pages in
> > that VMA. IOWs, the VMA should not be torn down until the O_DIRECT
> > read has released all the references to the pages mapped into the
> > task address space.
> > 
> > This just doesn't seem like an issue that we should be trying to fix
> > by adding band-aids to the inode life-cycle management.
> 
> I agree that freeing VMA while there are pinned pages is ... inconvenient.
> But that is just how gup works since the beginning - the moment you have
> struct page reference, you completely forget about the mapping you've used
> to get to the page. So anything can happen with the mapping after that
> moment. And in case of pages mapped by multiple processes I can easily see
> that one of the processes decides to unmap the page (and it may well be
> that was the initial process that acquired page references) while others
> still keep accessing the page using page references stored in some internal
> structure (RDMA anyone?).

Yup, and this is why RDMA on FSDAX using this method of pinning pages
will end up corrupting data and filesystems, hence FOLL_LONGTERM
protecting against most of these situations from even arising. But
that's that workaround, not a long term solution that allows RDMA to
be run on FSDAX managed storage media.

I said on #xfs a few days ago:

[23/9/22 10:23] * dchinner is getting deja vu over this latest round
of "dax mappings don't pin the filesystem objects that own the
storage media being mapped"

And I'm getting that feeling again right now...

> I think it will be rather difficult to come up
> with some scheme keeping VMA alive while there are pages pinned without
> regressing userspace which over the years became very much tailored to the
> peculiar gup behavior.

Perhaps all we should do is add a page flag for fsdax mapped pages
that says GUP must pin the VMA, so only mapped pages that fall into
this category take the perf penalty of VMA management.

> I can imagine we would keep *inode* referenced while there are its pages
> pinned.

We can do that by pinning the VMA, yes?

> That should not be that difficult but at least in naive
> implementation that would put rather heavy stress on inode refcount under
> some loads so I don't think that's useful either.

Having the workaround be sub-optimal for high performance workloads
is a good way of discouraging applications from doing fundamentally
broken crap without actually breaking anything....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-23 12:39                     ` Jason Gunthorpe
@ 2022-09-26  0:34                       ` Dave Chinner
  2022-09-26 13:04                         ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Dave Chinner @ 2022-09-26  0:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Fri, Sep 23, 2022 at 09:39:39AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 12:10:12PM +1000, Dave Chinner wrote:
> 
> > > Jason mentioned a scenario here:
> > > 
> > > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> > > 
> > > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> > > thread2 does memunmap()+close() while the read() is inflight.
> > 
> > And, ah, what production application does this and expects to be
> > able to process the result of the read() operation without getting a
> > SEGV?
> 
> The read() will do GUP and get a pined page, next the memunmap()/close
> will release the inode the VMA was holding open. The read() FD is NOT
> a DAX FD.
>
> We are now UAFing the DAX storage. There is no SEGV.

Yes, but what happens *after* the read().

The userspace application now tries to access the mmap() region to
access the data that was read, only to find that it's been unmapped.
That triggers a SEGV, yes?

IOWs, there's nothing *useful* a user application can do with a
pattern like this. All it provides is a vector for UAF of the DAX
storage. Now replace the read() with write(), and tell me why this
can't cause data corruption and/or fatal filesystem corruption that
can take the entire system down.....

> It is not about sane applications, it is about kernel security against
> hostile userspace.

Turning this into a UAF doesn't provide any security at all. It
makes this measurable worse from a security POV as it provides a
channel for data leakage (read() case) or system unstability or
compromise (the write() case).

> > i.e. The underlying problem here is that memunmap() frees the VMA
> > while there are still active task-based references to the pages in
> > that VMA. IOWs, the VMA should not be torn down until the O_DIRECT
> > read has released all the references to the pages mapped into the
> > task address space.
> 
> This is Jan's suggestion, I think we are still far from being able to
> do that for O_DIRECT paths.
> 
> Even if you fix the close() this way, doesn't truncate still have the
> same problem?

It sure does. Also fallocate().

The deja vu is strong right now.

If something truncate()s a file, the only safe thing for an
application that is using fsdax to directly access the underying
storage is to unmap the file and remap it once the layout change
operation has completed.

We've been doing this safely with pNFS for remote RDMA-based
direct access to the storage hardware for years. We have the
layout lease infrastructure already there for it...

I've pointed this out every time this conversation comes up. We have
a solution for this problem pretty much ready to go - it just needs
a UAPI to be defined for it. i.e. nothing has changed in the past 5
years - we have the same problems, we have the same solutions ready
to be hooked up and used....

> At the end of the day the rule is a DAX page must not be re-used until
> its refcount is 0. At some point the FS should wait for.

The page is *not owned by DAX*. How many times do I have to say that
FSDAX != DAX.

The *storage media* must not be reused until the filesystem says it
can be reused. And for that to work, nothing is allowed to keep an
anonymous, non-filesystem reference to the storage media. It has
nothing to do with struct page reference counts, and everything to
do with ensuring that filesystem objects are correctly referenced
while the storage media is in direct use by an application.

I gave up on FSDAX years ago because nobody was listening to me.
Here we are again, years down the track, with exactly the same
issues as we had years ago, with exactly the same people repeating
the same arguments for and against fixing the page reference
problems. I don't have time to repeat history all over again, so
I'm going to walk away from this train-wreck again so I can maintain
some semblence of my remaining sanity....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count
  2022-09-22  2:34             ` Dan Williams
@ 2022-09-26  6:17               ` Alistair Popple
  0 siblings, 0 replies; 84+ messages in thread
From: Alistair Popple @ 2022-09-26  6:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4


Dan Williams <dan.j.williams@intel.com> writes:

[...]

>> >> > > How on earth can a free'd page have both a 0 and 1 refcount??
>> >> >
>> >> > This is residual wonkiness from memremap_pages() handing out pages with
>> >> > elevated reference counts at the outset.
>> >>
>> >> I think the answer to my question is the above troubled code where we
>> >> still set the page refcount back to 1 even in the page_free path, so
>> >> there is some consistency "a freed paged may have a refcount of 1" for
>> >> the driver.
>> >>
>> >> So, I guess this patch makes sense but I would put more noise around
>> >> INIT_PAGEMAP_BUSY (eg annotate every driver that is using it with the
>> >> explicit constant) and alert people that they need to fix their stuff
>> >> to get rid of it.
>> >
>> > Sounds reasonable.
>> >
>> >> We should definately try to fix hmm_test as well so people have a good
>> >> reference code to follow in fixing the other drivers :(
>> >
>> > Oh, that's a good idea. I can probably fix that up and leave it to the
>> > GPU driver folks to catch up with that example so we can kill off
>> > INIT_PAGEMAP_BUSY.
>>
>> I'm hoping to send my series that fixes up all drivers using device
>> coherent/private later this week or early next. So you could also just
>> wait for that and remove INIT_PAGEMAP_BUSY entirely.
>
> Oh, perfect, thanks!

See
https://lore.kernel.org/linux-mm/3d74bb439723c7e46cbe47d1711795308aee4ae3.1664171943.git-series.apopple@nvidia.com/

I already had this in a series because the change was motivated by a
later patch there, but it's a standalone change and there's no reason it
couldn't be split out into it's own patch if that's better for you.

 - Alistair

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-26  0:34                       ` Dave Chinner
@ 2022-09-26 13:04                         ` Jason Gunthorpe
  0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-26 13:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Mon, Sep 26, 2022 at 10:34:30AM +1000, Dave Chinner wrote:

> > It is not about sane applications, it is about kernel security against
> > hostile userspace.
> 
> Turning this into a UAF doesn't provide any security at all. It
> makes this measurable worse from a security POV as it provides a
> channel for data leakage (read() case) or system unstability or
> compromise (the write() case).

You asked what the concern is, I think you get it, you explained it to
Jan in another email.

We have this issue where if we are not careful we can create a UAF bug
through GUP. It is not something a real application will hit, this is
kernel self-protection against hostile user space trying to trigger a
UAF. The issue arises from both the FS and the MM having their own
lifecycle models for the same memory page.

I'm still not clear on exactly what the current state of affairs is,
Dan?

The DAX/FSDAX stuff currently has a wait on the struct page - does
that wait protect against these UAFs? It looks to me like that is what
it is suppposed to do?

If so, that wait simply needs to be transformed into a wait for the
refcount to be 0 when you rework the refcounting. 

This is not the same FOLL_LONGTERM discussion rehashed, all the
FOLL_LONGTERM discussions were predicated on the idea that GUP
actually worked and doesn't have UAF bugs.

> The *storage media* must not be reused until the filesystem says it
> can be reused. And for that to work, nothing is allowed to keep an
> anonymous, non-filesystem reference to the storage media. It has
> nothing to do with struct page reference counts, and everything to
> do with ensuring that filesystem objects are correctly referenced
> while the storage media is in direct use by an application.

The trouble is we have *two* things that think they own the media -
the mm through pgmap clearly is the owner of the struct page and has
its own well defined lifecycle model for it.

And the FS has its model. We have to ensure the two models are tied
together, a page in the media cannot be considered free until both
lifecycle models agree it is free.

This is a side effect of using the struct pages in the first place,
currently the FS can't use struct page but opt out of the mm's
lifecycle model for struct page!

If we want the FS to own everything exclusively we should purge the
struct pages completely and give up all the features that come with
them (like GUP)

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-25 23:54                       ` Dave Chinner
@ 2022-09-26 14:10                         ` Jan Kara
  2022-09-29 23:33                           ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Kara @ 2022-09-26 14:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dan Williams, Jason Gunthorpe, akpm, Matthew Wilcox,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

On Mon 26-09-22 09:54:07, Dave Chinner wrote:
> On Fri, Sep 23, 2022 at 11:38:03AM +0200, Jan Kara wrote:
> > On Fri 23-09-22 12:10:12, Dave Chinner wrote:
> > > On Thu, Sep 22, 2022 at 05:41:08PM -0700, Dan Williams wrote:
> > > > Dave Chinner wrote:
> > > > > On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> > > > > > On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > Where are these DAX page pins that don't require the pin holder to
> > > > > > > also hold active references to the filesystem objects coming from?
> > > > > > 
> > > > > > O_DIRECT and things like it.
> > > > > 
> > > > > O_DIRECT IO to a file holds a reference to a struct file which holds
> > > > > an active reference to the struct inode. Hence you can't reclaim an
> > > > > inode while an O_DIRECT IO is in progress to it. 
> > > > > 
> > > > > Similarly, file-backed pages pinned from user vmas have the inode
> > > > > pinned by the VMA having a reference to the struct file passed to
> > > > > them when they are instantiated. Hence anything using mmap() to pin
> > > > > file-backed pages (i.e. applications using FSDAX access from
> > > > > userspace) should also have a reference to the inode that prevents
> > > > > the inode from being reclaimed.
> > > > > 
> > > > > So I'm at a loss to understand what "things like it" might actually
> > > > > mean. Can you actually describe a situation where we actually permit
> > > > > (even temporarily) these use-after-free scenarios?
> > > > 
> > > > Jason mentioned a scenario here:
> > > > 
> > > > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> > > > 
> > > > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> > > > thread2 does memunmap()+close() while the read() is inflight.
> > > 
> > > And, ah, what production application does this and expects to be
> > > able to process the result of the read() operation without getting a
> > > SEGV?
> > > 
> > > There's a huge difference between an unlikely scenario which we need
> > > to work (such as O_DIRECT IO to/from a mmap() buffer at a different
> > > offset on the same file) and this sort of scenario where even if we
> > > handle it correctly, the application can't do anything with the
> > > result and will crash immediately....
> > 
> > I'm not sure I fully follow what we are concerned about here. As you've
> > written above direct IO holds reference to the inode until it is completed
> > (through kiocb->file->inode chain). So direct IO should be safe?
> 
> AFAICT, it's the user buffer allocated by mmap() that the direct IO
> is DMAing into/out of that is the issue here. i.e. mmap() a file
> that is DAX enabled, pass the mmap region to DIO on a non-dax file,
> GUP in the DIO path takes a page pin on user pages that are DAX
> mapped, the userspace application then unmaps the file pages and
> unlinks the FSDAX file.
> 
> At this point the FSDAX mapped inode has no active references, so
> the filesystem frees the inode and it's allocated storage space, and
> now the DIO or whatever is holding the GUP reference is
> now a moving storage UAF violation. What ever is holding the GUP
> reference doesn't even have a reference to the FSDAX filesystem -
> the DIO fd could point to a file in a different filesystem
> altogether - and so the fsdax filesytem could be unmounted at this
> point whilst the application is still actively using the storage
> underlying the filesystem.
> 
> That's just .... broken.

Hum, so I'm confused (and my last email probably was as well). So let me
spell out the details here so that I can get on the same page about what we
are trying to solve:

For FSDAX, backing storage for a page must not be freed (i.e., filesystem
must to free corresponding block) while there are some references to the
page. This is achieved by calls to dax_layout_busy_page() from the
filesystem before truncating file / punching hole into a file. So AFAICT
this is working correctly and I don't think the patch series under
discussion aims to change this besides the change in how page without
references is detected.

Now there is a separate question that while someone holds a reference to
FSDAX page, the inode this page belongs to can get evicted from memory. For
FSDAX nothing prevents that AFAICT. If this happens, we loose track of the
page<->inode association so if somebody later comes and truncates the
inode, we will not detect the page belonging to the inode is still in use
(dax_layout_busy_page() does not find the page) and we have a problem.
Correct?

> > I'd be more worried about stuff like vmsplice() that can add file pages
> > into pipe without holding inode alive in any way and keeping them there for
> > arbitrarily long time. Didn't we want to add FOLL_LONGTERM to gup executed
> > from vmsplice() to avoid issues like this?
> 
> Yes, ISTR that was part of the plan - use FOLL_LONGTERM to ensure
> FSDAX can't run operations that pin pages but don't take fs
> references. I think that's how we prevented RDMA users from pinning
> FSDAX direct mapped storage media in this way. It does not, however,
> prevent the above "short term" GUP UAF situation from occurring.

If what I wrote above is correct, then I understand and agree.

> > I agree that freeing VMA while there are pinned pages is ... inconvenient.
> > But that is just how gup works since the beginning - the moment you have
> > struct page reference, you completely forget about the mapping you've used
> > to get to the page. So anything can happen with the mapping after that
> > moment. And in case of pages mapped by multiple processes I can easily see
> > that one of the processes decides to unmap the page (and it may well be
> > that was the initial process that acquired page references) while others
> > still keep accessing the page using page references stored in some internal
> > structure (RDMA anyone?).
> 
> Yup, and this is why RDMA on FSDAX using this method of pinning pages
> will end up corrupting data and filesystems, hence FOLL_LONGTERM
> protecting against most of these situations from even arising. But
> that's that workaround, not a long term solution that allows RDMA to
> be run on FSDAX managed storage media.
> 
> I said on #xfs a few days ago:
> 
> [23/9/22 10:23] * dchinner is getting deja vu over this latest round
> of "dax mappings don't pin the filesystem objects that own the
> storage media being mapped"
> 
> And I'm getting that feeling again right now...
> 
> > I think it will be rather difficult to come up
> > with some scheme keeping VMA alive while there are pages pinned without
> > regressing userspace which over the years became very much tailored to the
> > peculiar gup behavior.
> 
> Perhaps all we should do is add a page flag for fsdax mapped pages
> that says GUP must pin the VMA, so only mapped pages that fall into
> this category take the perf penalty of VMA management.

Possibly. But my concern with VMA pinning was not only about performance
but also about applications relying on being able to unmap pages that are
currently pinned. At least from some processes one of which may be the one
doing the original pinning. But yeah, the fact that FOLL_LONGTERM is
forbidden with DAX somewhat restricts the insanity we have to deal with. So
maybe pinning the VMA for DAX mappings might actually be a workable
solution.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-23 19:03                           ` Dan Williams
  2022-09-23 19:23                             ` Jason Gunthorpe
@ 2022-09-27  6:07                             ` Alistair Popple
  2022-09-27 12:56                               ` Jason Gunthorpe
  1 sibling, 1 reply; 84+ messages in thread
From: Alistair Popple @ 2022-09-27  6:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4


Dan Williams <dan.j.williams@intel.com> writes:

> Jason Gunthorpe wrote:
>> On Fri, Sep 23, 2022 at 09:29:51AM -0700, Dan Williams wrote:
>> > > > /**
>> > > >  * pgmap_get_folio() - reference a folio in a live @pgmap by @pfn
>> > > >  * @pgmap: live pgmap instance, caller ensures this does not race @pgmap death
>> > > >  * @pfn: page frame number covered by @pgmap
>> > > >  */
>> > > > struct folio *pgmap_get_folio(struct dev_pagemap *pgmap,
>> > > > unsigned long pfn)
>>
>> Maybe should be not be pfn but be 'offset from the first page of the
>> pgmap' ? Then we don't need the xa_load stuff, since it cann't be
>> wrong by definition.
>>
>> > > > {
>> > > >         struct page *page;
>> > > >
>> > > >         VM_WARN_ONCE(pgmap != xa_load(&pgmap_array, PHYS_PFN(phys)));
>> > > >
>> > > >         if (WARN_ONCE(percpu_ref_is_dying(&pgmap->ref)))
>> > > >                 return NULL;
>> > >
>> > > This shouldn't be a WARN?
>> >
>> > It's a bug if someone calls this after killing the pgmap. I.e.  the
>> > expectation is that the caller is synchronzing this. The only reason
>> > this isn't a VM_WARN_ONCE is because the sanity check is cheap, but I do
>> > not expect it to fire on anything but a development kernel.
>>
>> OK, that makes sense
>>
>> But shouldn't this get the pgmap refcount here? The reason we started
>> talking about this was to make all the pgmap logic self contained so
>> that the pgmap doesn't pass its own destroy until all the all the
>> page_free()'s have been done.

That sounds good to me at least. I just noticed we introduced this exact
bug for device private/coherent pages when making their refcounts zero
based. Nothing currently takes pgmap->ref when a private/coherent page
is mapped. Therefore memunmap_pages() will complete and pgmap destroyed
while pgmap pages are still mapped.

So I think we need to call put_dev_pagemap() as part of
free_zone_device_page().

 - Alistair

>> > > > This does not create compound folios, that needs to be coordinated with
>> > > > the caller and likely needs an explicit
>> > >
>> > > Does it? What situations do you think the caller needs to coordinate
>> > > the folio size? Caller should call the function for each logical unit
>> > > of storage it wants to allocate from the pgmap..
>> >
>> > The problem for fsdax is that it needs to gather all the PTEs, hold a
>> > lock to synchronize against events that would shatter a huge page, and
>> > then build up the compound folio metadata before inserting the PMD.
>>
>> Er, at this point we are just talking about acquiring virgin pages
>> nobody else is using, not inserting things. There is no possibility of
>> conurrent shattering because, by definition, nothing else can
>> reference these struct pages at this instant.
>>
>> Also, the caller must already be serializating pgmap_get_folio()
>> against concurrent calls on the same pfn (since it is an error to call
>> pgmap_get_folio() on an non-free pfn)
>>
>> So, I would expect the caller must already have all the necessary
>> locking to accept maximally sized folios.
>>
>> eg if it has some reason to punch a hole in the contiguous range
>> (shatter the folio) it must *already* serialize against
>> pgmap_get_folio(), since something like punching a hole must know with
>> certainty if any struct pages are refcount != 0 or not, and must not
>> race with something trying to set their refcount to 1.
>
> Perhaps, I'll take a look. The scenario I am more concerned about is
> processA sets up a VMA of PAGE_SIZE and races processB to fault in the
> same filesystem block with a VMA of PMD_SIZE. Right now processA gets a
> PTE mapping and processB gets a PMD mapping, but the refcounting is all
> handled in small pages. I need to investigate more what is needed for
> fsdax to support folio_size() > mapping entry size.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core
  2022-09-16  3:36 ` [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core Dan Williams
@ 2022-09-27  6:20   ` Alistair Popple
  2022-09-29 22:38     ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Alistair Popple @ 2022-09-27  6:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4


Dan Williams <dan.j.williams@intel.com> writes:

[...]

> +/**
> + * dax_zap_mappings_range - find first pinned page in @mapping
> + * @mapping: address space to scan for a page with ref count > 1
> + * @start: Starting offset. Page containing 'start' is included.
> + * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
> + *       pages from 'start' till the end of file are included.
> + *
> + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> + * 'onlined' to the page allocator so they are considered idle when
> + * page->count == 1. A filesystem uses this interface to determine if

Minor nit-pick I noticed while reading this but shouldn't that be
"page->count == 0" now?

> + * any page in the mapping is busy, i.e. for DMA, or other
> + * get_user_pages() usages.
> + *
> + * It is expected that the filesystem is holding locks to block the
> + * establishment of new mappings in this address_space. I.e. it expects
> + * to be able to run unmap_mapping_range() and subsequently not race
> + * mapping_mapped() becoming true.
> + */
> +struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
> +				    loff_t end)
> +{
> +	void *entry;
> +	unsigned int scanned = 0;
> +	struct page *page = NULL;
> +	pgoff_t start_idx = start >> PAGE_SHIFT;
> +	pgoff_t end_idx;
> +	XA_STATE(xas, &mapping->i_pages, start_idx);
> +
> +	/*
> +	 * In the 'limited' case get_user_pages() for dax is disabled.
> +	 */
> +	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> +		return NULL;
> +
> +	if (!dax_mapping(mapping))
> +		return NULL;
> +
> +	/* If end == LLONG_MAX, all pages from start to till end of file */
> +	if (end == LLONG_MAX)
> +		end_idx = ULONG_MAX;
> +	else
> +		end_idx = end >> PAGE_SHIFT;
> +	/*
> +	 * If we race get_user_pages_fast() here either we'll see the
> +	 * elevated page count in the iteration and wait, or
> +	 * get_user_pages_fast() will see that the page it took a reference
> +	 * against is no longer mapped in the page tables and bail to the
> +	 * get_user_pages() slow path.  The slow path is protected by
> +	 * pte_lock() and pmd_lock(). New references are not taken without
> +	 * holding those locks, and unmap_mapping_pages() will not zero the
> +	 * pte or pmd without holding the respective lock, so we are
> +	 * guaranteed to either see new references or prevent new
> +	 * references from being established.
> +	 */
> +	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
> +
> +	xas_lock_irq(&xas);
> +	xas_for_each(&xas, entry, end_idx) {
> +		if (WARN_ON_ONCE(!xa_is_value(entry)))
> +			continue;
> +		if (unlikely(dax_is_locked(entry)))
> +			entry = get_unlocked_entry(&xas, 0);
> +		if (entry)
> +			page = dax_zap_pages(&xas, entry);
> +		put_unlocked_entry(&xas, entry, WAKE_NEXT);
> +		if (page)
> +			break;
> +		if (++scanned % XA_CHECK_SCHED)
> +			continue;
> +
> +		xas_pause(&xas);
> +		xas_unlock_irq(&xas);
> +		cond_resched();
> +		xas_lock_irq(&xas);
> +	}
> +	xas_unlock_irq(&xas);
> +	return page;
> +}
> +EXPORT_SYMBOL_GPL(dax_zap_mappings_range);
> +
> +struct page *dax_zap_mappings(struct address_space *mapping)
> +{
> +	return dax_zap_mappings_range(mapping, 0, LLONG_MAX);
> +}
> +EXPORT_SYMBOL_GPL(dax_zap_mappings);
> +
> +static int __dax_invalidate_entry(struct address_space *mapping, pgoff_t index,
> +				  bool trunc)
> +{
> +	XA_STATE(xas, &mapping->i_pages, index);
> +	int ret = 0;
> +	void *entry;
> +
> +	xas_lock_irq(&xas);
> +	entry = get_unlocked_entry(&xas, 0);
> +	if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> +		goto out;
> +	if (!trunc && (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
> +		       xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
> +		goto out;
> +	dax_disassociate_entry(entry, mapping, trunc);
> +	xas_store(&xas, NULL);
> +	mapping->nrpages -= 1UL << dax_entry_order(entry);
> +	ret = 1;
> +out:
> +	put_unlocked_entry(&xas, entry, WAKE_ALL);
> +	xas_unlock_irq(&xas);
> +	return ret;
> +}
> +
> +int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> +				      pgoff_t index)
> +{
> +	return __dax_invalidate_entry(mapping, index, false);
> +}
> +
> +/*
> + * Delete DAX entry at @index from @mapping.  Wait for it
> + * to be unlocked before deleting it.
> + */
> +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	int ret = __dax_invalidate_entry(mapping, index, true);
> +
> +	/*
> +	 * This gets called from truncate / punch_hole path. As such, the caller
> +	 * must hold locks protecting against concurrent modifications of the
> +	 * page cache (usually fs-private i_mmap_sem for writing). Since the
> +	 * caller has seen a DAX entry for this index, we better find it
> +	 * at that index as well...
> +	 */
> +	WARN_ON_ONCE(!ret);
> +	return ret;
> +}
> +
> +/*
> + * By this point dax_grab_mapping_entry() has ensured that we have a locked entry
> + * of the appropriate size so we don't have to worry about downgrading PMDs to
> + * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> + * already in the tree, we will skip the insertion and just dirty the PMD as
> + * appropriate.
> + */
> +vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> +			    void **pentry, pfn_t pfn, unsigned long flags)
> +{
> +	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> +	void *new_entry = dax_make_entry(pfn, flags);
> +	bool dirty = flags & DAX_DIRTY;
> +	bool cow = flags & DAX_COW;
> +	void *entry = *pentry;
> +
> +	if (dirty)
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
> +		unsigned long index = xas->xa_index;
> +		/* we are replacing a zero page with block mapping */
> +		if (dax_is_pmd_entry(entry))
> +			unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
> +					    PG_PMD_NR, false);
> +		else /* pte entry */
> +			unmap_mapping_pages(mapping, index, 1, false);
> +	}
> +
> +	xas_reset(xas);
> +	xas_lock_irq(xas);
> +	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> +		void *old;
> +
> +		dax_disassociate_entry(entry, mapping, false);
> +		dax_associate_entry(new_entry, mapping, vmf, flags);
> +		/*
> +		 * Only swap our new entry into the page cache if the current
> +		 * entry is a zero page or an empty entry.  If a normal PTE or
> +		 * PMD entry is already in the cache, we leave it alone.  This
> +		 * means that if we are trying to insert a PTE and the
> +		 * existing entry is a PMD, we will just leave the PMD in the
> +		 * tree and dirty it if necessary.
> +		 */
> +		old = dax_lock_entry(xas, new_entry);
> +		WARN_ON_ONCE(old !=
> +			     xa_mk_value(xa_to_value(entry) | DAX_LOCKED));
> +		entry = new_entry;
> +	} else {
> +		xas_load(xas); /* Walk the xa_state */
> +	}
> +
> +	if (dirty)
> +		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
> +
> +	if (cow)
> +		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
> +	xas_unlock_irq(xas);
> +	*pentry = entry;
> +	return 0;
> +}
> +
> +int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
> +		      struct address_space *mapping, void *entry)
> +{
> +	unsigned long pfn, index, count, end;
> +	long ret = 0;
> +	struct vm_area_struct *vma;
> +
> +	/*
> +	 * A page got tagged dirty in DAX mapping? Something is seriously
> +	 * wrong.
> +	 */
> +	if (WARN_ON(!xa_is_value(entry)))
> +		return -EIO;
> +
> +	if (unlikely(dax_is_locked(entry))) {
> +		void *old_entry = entry;
> +
> +		entry = get_unlocked_entry(xas, 0);
> +
> +		/* Entry got punched out / reallocated? */
> +		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> +			goto put_unlocked;
> +		/*
> +		 * Entry got reallocated elsewhere? No need to writeback.
> +		 * We have to compare pfns as we must not bail out due to
> +		 * difference in lockbit or entry type.
> +		 */
> +		if (dax_to_pfn(old_entry) != dax_to_pfn(entry))
> +			goto put_unlocked;
> +		if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
> +					dax_is_zero_entry(entry))) {
> +			ret = -EIO;
> +			goto put_unlocked;
> +		}
> +
> +		/* Another fsync thread may have already done this entry */
> +		if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE))
> +			goto put_unlocked;
> +	}
> +
> +	/* Lock the entry to serialize with page faults */
> +	dax_lock_entry(xas, entry);
> +
> +	/*
> +	 * We can clear the tag now but we have to be careful so that concurrent
> +	 * dax_writeback_one() calls for the same index cannot finish before we
> +	 * actually flush the caches. This is achieved as the calls will look
> +	 * at the entry only under the i_pages lock and once they do that
> +	 * they will see the entry locked and wait for it to unlock.
> +	 */
> +	xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE);
> +	xas_unlock_irq(xas);
> +
> +	/*
> +	 * If dax_writeback_mapping_range() was given a wbc->range_start
> +	 * in the middle of a PMD, the 'index' we use needs to be
> +	 * aligned to the start of the PMD.
> +	 * This allows us to flush for PMD_SIZE and not have to worry about
> +	 * partial PMD writebacks.
> +	 */
> +	pfn = dax_to_pfn(entry);
> +	count = 1UL << dax_entry_order(entry);
> +	index = xas->xa_index & ~(count - 1);
> +	end = index + count - 1;
> +
> +	/* Walk all mappings of a given index of a file and writeprotect them */
> +	i_mmap_lock_read(mapping);
> +	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
> +		pfn_mkclean_range(pfn, count, index, vma);
> +		cond_resched();
> +	}
> +	i_mmap_unlock_read(mapping);
> +
> +	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
> +	/*
> +	 * After we have flushed the cache, we can clear the dirty tag. There
> +	 * cannot be new dirty data in the pfn after the flush has completed as
> +	 * the pfn mappings are writeprotected and fault waits for mapping
> +	 * entry lock.
> +	 */
> +	xas_reset(xas);
> +	xas_lock_irq(xas);
> +	xas_store(xas, entry);
> +	xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
> +	dax_wake_entry(xas, entry, WAKE_NEXT);
> +
> +	trace_dax_writeback_one(mapping->host, index, count);
> +	return ret;
> +
> + put_unlocked:
> +	put_unlocked_entry(xas, entry, WAKE_NEXT);
> +	return ret;
> +}
> +
> +/*
> + * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables
> + * @vmf: The description of the fault
> + * @pfn: PFN to insert
> + * @order: Order of entry to insert.
> + *
> + * This function inserts a writeable PTE or PMD entry into the page tables
> + * for an mmaped DAX file.  It also marks the page cache entry as dirty.
> + */
> +vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn,
> +				  unsigned int order)
> +{
> +	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> +	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
> +	void *entry;
> +	vm_fault_t ret;
> +
> +	xas_lock_irq(&xas);
> +	entry = get_unlocked_entry(&xas, order);
> +	/* Did we race with someone splitting entry or so? */
> +	if (!entry || dax_is_conflict(entry) ||
> +	    (order == 0 && !dax_is_pte_entry(entry))) {
> +		put_unlocked_entry(&xas, entry, WAKE_NEXT);
> +		xas_unlock_irq(&xas);
> +		trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
> +						      VM_FAULT_NOPAGE);
> +		return VM_FAULT_NOPAGE;
> +	}
> +	xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
> +	dax_lock_entry(&xas, entry);
> +	xas_unlock_irq(&xas);
> +	if (order == 0)
> +		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> +#ifdef CONFIG_FS_DAX_PMD
> +	else if (order == PMD_ORDER)
> +		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
> +#endif
> +	else
> +		ret = VM_FAULT_FALLBACK;
> +	dax_unlock_entry(&xas, entry);
> +	trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
> +	return ret;
> +}
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 4909ad945a49..0976857ec7f2 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -564,6 +564,8 @@ static int __init dax_core_init(void)
>  	if (rc)
>  		return rc;
>
> +	dax_mapping_init();
> +
>  	rc = alloc_chrdev_region(&dax_devt, 0, MINORMASK+1, "dax");
>  	if (rc)
>  		goto err_chrdev;
> @@ -590,5 +592,5 @@ static void __exit dax_core_exit(void)
>
>  MODULE_AUTHOR("Intel Corporation");
>  MODULE_LICENSE("GPL v2");
> -subsys_initcall(dax_core_init);
> +fs_initcall(dax_core_init);
>  module_exit(dax_core_exit);
> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> index 5a29046e3319..3bb17448d1c8 100644
> --- a/drivers/nvdimm/Kconfig
> +++ b/drivers/nvdimm/Kconfig
> @@ -78,6 +78,7 @@ config NVDIMM_DAX
>  	bool "NVDIMM DAX: Raw access to persistent memory"
>  	default LIBNVDIMM
>  	depends on NVDIMM_PFN
> +	depends on DAX
>  	help
>  	  Support raw device dax access to a persistent memory
>  	  namespace.  For environments that want to hard partition
> diff --git a/fs/dax.c b/fs/dax.c
> index ee2568c8b135..79e49e718d33 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -27,847 +27,8 @@
>  #include <linux/rmap.h>
>  #include <asm/pgalloc.h>
>
> -#define CREATE_TRACE_POINTS
>  #include <trace/events/fs_dax.h>
>
> -static inline unsigned int pe_order(enum page_entry_size pe_size)
> -{
> -	if (pe_size == PE_SIZE_PTE)
> -		return PAGE_SHIFT - PAGE_SHIFT;
> -	if (pe_size == PE_SIZE_PMD)
> -		return PMD_SHIFT - PAGE_SHIFT;
> -	if (pe_size == PE_SIZE_PUD)
> -		return PUD_SHIFT - PAGE_SHIFT;
> -	return ~0;
> -}
> -
> -/* We choose 4096 entries - same as per-zone page wait tables */
> -#define DAX_WAIT_TABLE_BITS 12
> -#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
> -
> -/* The 'colour' (ie low bits) within a PMD of a page offset.  */
> -#define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
> -#define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
> -
> -/* The order of a PMD entry */
> -#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
> -
> -static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
> -
> -static int __init init_dax_wait_table(void)
> -{
> -	int i;
> -
> -	for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
> -		init_waitqueue_head(wait_table + i);
> -	return 0;
> -}
> -fs_initcall(init_dax_wait_table);
> -
> -/*
> - * DAX pagecache entries use XArray value entries so they can't be mistaken
> - * for pages.  We use one bit for locking, one bit for the entry size (PMD)
> - * and two more to tell us if the entry is a zero page or an empty entry that
> - * is just used for locking.  In total four special bits.
> - *
> - * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
> - * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
> - * block allocation.
> - */
> -#define DAX_SHIFT	(5)
> -#define DAX_MASK	((1UL << DAX_SHIFT) - 1)
> -#define DAX_LOCKED	(1UL << 0)
> -#define DAX_PMD		(1UL << 1)
> -#define DAX_ZERO_PAGE	(1UL << 2)
> -#define DAX_EMPTY	(1UL << 3)
> -#define DAX_ZAP		(1UL << 4)
> -
> -/*
> - * These flags are not conveyed in Xarray value entries, they are just
> - * modifiers to dax_insert_entry().
> - */
> -#define DAX_DIRTY (1UL << (DAX_SHIFT + 0))
> -#define DAX_COW   (1UL << (DAX_SHIFT + 1))
> -
> -static unsigned long dax_to_pfn(void *entry)
> -{
> -	return xa_to_value(entry) >> DAX_SHIFT;
> -}
> -
> -static void *dax_make_entry(pfn_t pfn, unsigned long flags)
> -{
> -	return xa_mk_value((flags & DAX_MASK) |
> -			   (pfn_t_to_pfn(pfn) << DAX_SHIFT));
> -}
> -
> -static bool dax_is_locked(void *entry)
> -{
> -	return xa_to_value(entry) & DAX_LOCKED;
> -}
> -
> -static bool dax_is_zapped(void *entry)
> -{
> -	return xa_to_value(entry) & DAX_ZAP;
> -}
> -
> -static unsigned int dax_entry_order(void *entry)
> -{
> -	if (xa_to_value(entry) & DAX_PMD)
> -		return PMD_ORDER;
> -	return 0;
> -}
> -
> -static unsigned long dax_is_pmd_entry(void *entry)
> -{
> -	return xa_to_value(entry) & DAX_PMD;
> -}
> -
> -static bool dax_is_pte_entry(void *entry)
> -{
> -	return !(xa_to_value(entry) & DAX_PMD);
> -}
> -
> -static int dax_is_zero_entry(void *entry)
> -{
> -	return xa_to_value(entry) & DAX_ZERO_PAGE;
> -}
> -
> -static int dax_is_empty_entry(void *entry)
> -{
> -	return xa_to_value(entry) & DAX_EMPTY;
> -}
> -
> -/*
> - * true if the entry that was found is of a smaller order than the entry
> - * we were looking for
> - */
> -static bool dax_is_conflict(void *entry)
> -{
> -	return entry == XA_RETRY_ENTRY;
> -}
> -
> -/*
> - * DAX page cache entry locking
> - */
> -struct exceptional_entry_key {
> -	struct xarray *xa;
> -	pgoff_t entry_start;
> -};
> -
> -struct wait_exceptional_entry_queue {
> -	wait_queue_entry_t wait;
> -	struct exceptional_entry_key key;
> -};
> -
> -/**
> - * enum dax_wake_mode: waitqueue wakeup behaviour
> - * @WAKE_ALL: wake all waiters in the waitqueue
> - * @WAKE_NEXT: wake only the first waiter in the waitqueue
> - */
> -enum dax_wake_mode {
> -	WAKE_ALL,
> -	WAKE_NEXT,
> -};
> -
> -static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
> -		void *entry, struct exceptional_entry_key *key)
> -{
> -	unsigned long hash;
> -	unsigned long index = xas->xa_index;
> -
> -	/*
> -	 * If 'entry' is a PMD, align the 'index' that we use for the wait
> -	 * queue to the start of that PMD.  This ensures that all offsets in
> -	 * the range covered by the PMD map to the same bit lock.
> -	 */
> -	if (dax_is_pmd_entry(entry))
> -		index &= ~PG_PMD_COLOUR;
> -	key->xa = xas->xa;
> -	key->entry_start = index;
> -
> -	hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS);
> -	return wait_table + hash;
> -}
> -
> -static int wake_exceptional_entry_func(wait_queue_entry_t *wait,
> -		unsigned int mode, int sync, void *keyp)
> -{
> -	struct exceptional_entry_key *key = keyp;
> -	struct wait_exceptional_entry_queue *ewait =
> -		container_of(wait, struct wait_exceptional_entry_queue, wait);
> -
> -	if (key->xa != ewait->key.xa ||
> -	    key->entry_start != ewait->key.entry_start)
> -		return 0;
> -	return autoremove_wake_function(wait, mode, sync, NULL);
> -}
> -
> -/*
> - * @entry may no longer be the entry at the index in the mapping.
> - * The important information it's conveying is whether the entry at
> - * this index used to be a PMD entry.
> - */
> -static void dax_wake_entry(struct xa_state *xas, void *entry,
> -			   enum dax_wake_mode mode)
> -{
> -	struct exceptional_entry_key key;
> -	wait_queue_head_t *wq;
> -
> -	wq = dax_entry_waitqueue(xas, entry, &key);
> -
> -	/*
> -	 * Checking for locked entry and prepare_to_wait_exclusive() happens
> -	 * under the i_pages lock, ditto for entry handling in our callers.
> -	 * So at this point all tasks that could have seen our entry locked
> -	 * must be in the waitqueue and the following check will see them.
> -	 */
> -	if (waitqueue_active(wq))
> -		__wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key);
> -}
> -
> -/*
> - * Look up entry in page cache, wait for it to become unlocked if it
> - * is a DAX entry and return it.  The caller must subsequently call
> - * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry()
> - * if it did.  The entry returned may have a larger order than @order.
> - * If @order is larger than the order of the entry found in i_pages, this
> - * function returns a dax_is_conflict entry.
> - *
> - * Must be called with the i_pages lock held.
> - */
> -static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
> -{
> -	void *entry;
> -	struct wait_exceptional_entry_queue ewait;
> -	wait_queue_head_t *wq;
> -
> -	init_wait(&ewait.wait);
> -	ewait.wait.func = wake_exceptional_entry_func;
> -
> -	for (;;) {
> -		entry = xas_find_conflict(xas);
> -		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> -			return entry;
> -		if (dax_entry_order(entry) < order)
> -			return XA_RETRY_ENTRY;
> -		if (!dax_is_locked(entry))
> -			return entry;
> -
> -		wq = dax_entry_waitqueue(xas, entry, &ewait.key);
> -		prepare_to_wait_exclusive(wq, &ewait.wait,
> -					  TASK_UNINTERRUPTIBLE);
> -		xas_unlock_irq(xas);
> -		xas_reset(xas);
> -		schedule();
> -		finish_wait(wq, &ewait.wait);
> -		xas_lock_irq(xas);
> -	}
> -}
> -
> -/*
> - * The only thing keeping the address space around is the i_pages lock
> - * (it's cycled in clear_inode() after removing the entries from i_pages)
> - * After we call xas_unlock_irq(), we cannot touch xas->xa.
> - */
> -static void wait_entry_unlocked(struct xa_state *xas, void *entry)
> -{
> -	struct wait_exceptional_entry_queue ewait;
> -	wait_queue_head_t *wq;
> -
> -	init_wait(&ewait.wait);
> -	ewait.wait.func = wake_exceptional_entry_func;
> -
> -	wq = dax_entry_waitqueue(xas, entry, &ewait.key);
> -	/*
> -	 * Unlike get_unlocked_entry() there is no guarantee that this
> -	 * path ever successfully retrieves an unlocked entry before an
> -	 * inode dies. Perform a non-exclusive wait in case this path
> -	 * never successfully performs its own wake up.
> -	 */
> -	prepare_to_wait(wq, &ewait.wait, TASK_UNINTERRUPTIBLE);
> -	xas_unlock_irq(xas);
> -	schedule();
> -	finish_wait(wq, &ewait.wait);
> -}
> -
> -static void put_unlocked_entry(struct xa_state *xas, void *entry,
> -			       enum dax_wake_mode mode)
> -{
> -	if (entry && !dax_is_conflict(entry))
> -		dax_wake_entry(xas, entry, mode);
> -}
> -
> -/*
> - * We used the xa_state to get the entry, but then we locked the entry and
> - * dropped the xa_lock, so we know the xa_state is stale and must be reset
> - * before use.
> - */
> -static void dax_unlock_entry(struct xa_state *xas, void *entry)
> -{
> -	void *old;
> -
> -	BUG_ON(dax_is_locked(entry));
> -	xas_reset(xas);
> -	xas_lock_irq(xas);
> -	old = xas_store(xas, entry);
> -	xas_unlock_irq(xas);
> -	BUG_ON(!dax_is_locked(old));
> -	dax_wake_entry(xas, entry, WAKE_NEXT);
> -}
> -
> -/*
> - * Return: The entry stored at this location before it was locked.
> - */
> -static void *dax_lock_entry(struct xa_state *xas, void *entry)
> -{
> -	unsigned long v = xa_to_value(entry);
> -	return xas_store(xas, xa_mk_value(v | DAX_LOCKED));
> -}
> -
> -static unsigned long dax_entry_size(void *entry)
> -{
> -	if (dax_is_zero_entry(entry))
> -		return 0;
> -	else if (dax_is_empty_entry(entry))
> -		return 0;
> -	else if (dax_is_pmd_entry(entry))
> -		return PMD_SIZE;
> -	else
> -		return PAGE_SIZE;
> -}
> -
> -static unsigned long dax_end_pfn(void *entry)
> -{
> -	return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
> -}
> -
> -/*
> - * Iterate through all mapped pfns represented by an entry, i.e. skip
> - * 'empty' and 'zero' entries.
> - */
> -#define for_each_mapped_pfn(entry, pfn) \
> -	for (pfn = dax_to_pfn(entry); \
> -			pfn < dax_end_pfn(entry); pfn++)
> -
> -static inline bool dax_mapping_is_cow(struct address_space *mapping)
> -{
> -	return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
> -}
> -
> -/*
> - * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
> - */
> -static inline void dax_mapping_set_cow(struct page *page)
> -{
> -	if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
> -		/*
> -		 * Reset the index if the page was already mapped
> -		 * regularly before.
> -		 */
> -		if (page->mapping)
> -			page->index = 1;
> -		page->mapping = (void *)PAGE_MAPPING_DAX_COW;
> -	}
> -	page->index++;
> -}
> -
> -/*
> - * When it is called in dax_insert_entry(), the cow flag will indicate that
> - * whether this entry is shared by multiple files.  If so, set the page->mapping
> - * FS_DAX_MAPPING_COW, and use page->index as refcount.
> - */
> -static vm_fault_t dax_associate_entry(void *entry,
> -				      struct address_space *mapping,
> -				      struct vm_fault *vmf, unsigned long flags)
> -{
> -	unsigned long size = dax_entry_size(entry), pfn, index;
> -	struct dev_pagemap *pgmap;
> -	int i = 0;
> -
> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> -		return 0;
> -
> -	if (!size)
> -		return 0;
> -
> -	if (!(flags & DAX_COW)) {
> -		pfn = dax_to_pfn(entry);
> -		pgmap = get_dev_pagemap_many(pfn, NULL, PHYS_PFN(size));
> -		if (!pgmap)
> -			return VM_FAULT_SIGBUS;
> -	}
> -
> -	index = linear_page_index(vmf->vma, ALIGN(vmf->address, size));
> -	for_each_mapped_pfn(entry, pfn) {
> -		struct page *page = pfn_to_page(pfn);
> -
> -		if (flags & DAX_COW) {
> -			dax_mapping_set_cow(page);
> -		} else {
> -			WARN_ON_ONCE(page->mapping);
> -			page->mapping = mapping;
> -			page->index = index + i++;
> -			page_ref_inc(page);
> -		}
> -	}
> -
> -	return 0;
> -}
> -
> -static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> -		bool trunc)
> -{
> -	unsigned long size = dax_entry_size(entry), pfn;
> -	struct page *page;
> -
> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> -		return;
> -
> -	if (!size)
> -		return;
> -
> -	for_each_mapped_pfn(entry, pfn) {
> -		page = pfn_to_page(pfn);
> -		if (dax_mapping_is_cow(page->mapping)) {
> -			/* keep the CoW flag if this page is still shared */
> -			if (page->index-- > 0)
> -				continue;
> -		} else {
> -			WARN_ON_ONCE(trunc && !dax_is_zapped(entry));
> -			WARN_ON_ONCE(trunc && !dax_page_idle(page));
> -			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> -		}
> -		page->mapping = NULL;
> -		page->index = 0;
> -	}
> -
> -	if (trunc && !dax_mapping_is_cow(page->mapping)) {
> -		page = pfn_to_page(dax_to_pfn(entry));
> -		put_dev_pagemap_many(page->pgmap, PHYS_PFN(size));
> -	}
> -}
> -
> -/*
> - * dax_lock_page - Lock the DAX entry corresponding to a page
> - * @page: The page whose entry we want to lock
> - *
> - * Context: Process context.
> - * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could
> - * not be locked.
> - */
> -dax_entry_t dax_lock_page(struct page *page)
> -{
> -	XA_STATE(xas, NULL, 0);
> -	void *entry;
> -
> -	/* Ensure page->mapping isn't freed while we look at it */
> -	rcu_read_lock();
> -	for (;;) {
> -		struct address_space *mapping = READ_ONCE(page->mapping);
> -
> -		entry = NULL;
> -		if (!mapping || !dax_mapping(mapping))
> -			break;
> -
> -		/*
> -		 * In the device-dax case there's no need to lock, a
> -		 * struct dev_pagemap pin is sufficient to keep the
> -		 * inode alive, and we assume we have dev_pagemap pin
> -		 * otherwise we would not have a valid pfn_to_page()
> -		 * translation.
> -		 */
> -		entry = (void *)~0UL;
> -		if (S_ISCHR(mapping->host->i_mode))
> -			break;
> -
> -		xas.xa = &mapping->i_pages;
> -		xas_lock_irq(&xas);
> -		if (mapping != page->mapping) {
> -			xas_unlock_irq(&xas);
> -			continue;
> -		}
> -		xas_set(&xas, page->index);
> -		entry = xas_load(&xas);
> -		if (dax_is_locked(entry)) {
> -			rcu_read_unlock();
> -			wait_entry_unlocked(&xas, entry);
> -			rcu_read_lock();
> -			continue;
> -		}
> -		dax_lock_entry(&xas, entry);
> -		xas_unlock_irq(&xas);
> -		break;
> -	}
> -	rcu_read_unlock();
> -	return (dax_entry_t)entry;
> -}
> -
> -void dax_unlock_page(struct page *page, dax_entry_t cookie)
> -{
> -	struct address_space *mapping = page->mapping;
> -	XA_STATE(xas, &mapping->i_pages, page->index);
> -
> -	if (S_ISCHR(mapping->host->i_mode))
> -		return;
> -
> -	dax_unlock_entry(&xas, (void *)cookie);
> -}
> -
> -/*
> - * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
> - * @mapping: the file's mapping whose entry we want to lock
> - * @index: the offset within this file
> - * @page: output the dax page corresponding to this dax entry
> - *
> - * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
> - * could not be locked.
> - */
> -dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
> -		struct page **page)
> -{
> -	XA_STATE(xas, NULL, 0);
> -	void *entry;
> -
> -	rcu_read_lock();
> -	for (;;) {
> -		entry = NULL;
> -		if (!dax_mapping(mapping))
> -			break;
> -
> -		xas.xa = &mapping->i_pages;
> -		xas_lock_irq(&xas);
> -		xas_set(&xas, index);
> -		entry = xas_load(&xas);
> -		if (dax_is_locked(entry)) {
> -			rcu_read_unlock();
> -			wait_entry_unlocked(&xas, entry);
> -			rcu_read_lock();
> -			continue;
> -		}
> -		if (!entry ||
> -		    dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> -			/*
> -			 * Because we are looking for entry from file's mapping
> -			 * and index, so the entry may not be inserted for now,
> -			 * or even a zero/empty entry.  We don't think this is
> -			 * an error case.  So, return a special value and do
> -			 * not output @page.
> -			 */
> -			entry = (void *)~0UL;
> -		} else {
> -			*page = pfn_to_page(dax_to_pfn(entry));
> -			dax_lock_entry(&xas, entry);
> -		}
> -		xas_unlock_irq(&xas);
> -		break;
> -	}
> -	rcu_read_unlock();
> -	return (dax_entry_t)entry;
> -}
> -
> -void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
> -		dax_entry_t cookie)
> -{
> -	XA_STATE(xas, &mapping->i_pages, index);
> -
> -	if (cookie == ~0UL)
> -		return;
> -
> -	dax_unlock_entry(&xas, (void *)cookie);
> -}
> -
> -/*
> - * Find page cache entry at given index. If it is a DAX entry, return it
> - * with the entry locked. If the page cache doesn't contain an entry at
> - * that index, add a locked empty entry.
> - *
> - * When requesting an entry with size DAX_PMD, grab_mapping_entry() will
> - * either return that locked entry or will return VM_FAULT_FALLBACK.
> - * This will happen if there are any PTE entries within the PMD range
> - * that we are requesting.
> - *
> - * We always favor PTE entries over PMD entries. There isn't a flow where we
> - * evict PTE entries in order to 'upgrade' them to a PMD entry.  A PMD
> - * insertion will fail if it finds any PTE entries already in the tree, and a
> - * PTE insertion will cause an existing PMD entry to be unmapped and
> - * downgraded to PTE entries.  This happens for both PMD zero pages as
> - * well as PMD empty entries.
> - *
> - * The exception to this downgrade path is for PMD entries that have
> - * real storage backing them.  We will leave these real PMD entries in
> - * the tree, and PTE writes will simply dirty the entire PMD entry.
> - *
> - * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> - * persistent memory the benefit is doubtful. We can add that later if we can
> - * show it helps.
> - *
> - * On error, this function does not return an ERR_PTR.  Instead it returns
> - * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
> - * overlap with xarray value entries.
> - */
> -static void *grab_mapping_entry(struct xa_state *xas,
> -		struct address_space *mapping, unsigned int order)
> -{
> -	unsigned long index = xas->xa_index;
> -	bool pmd_downgrade;	/* splitting PMD entry into PTE entries? */
> -	void *entry;
> -
> -retry:
> -	pmd_downgrade = false;
> -	xas_lock_irq(xas);
> -	entry = get_unlocked_entry(xas, order);
> -
> -	if (entry) {
> -		if (dax_is_conflict(entry))
> -			goto fallback;
> -		if (!xa_is_value(entry)) {
> -			xas_set_err(xas, -EIO);
> -			goto out_unlock;
> -		}
> -
> -		if (order == 0) {
> -			if (dax_is_pmd_entry(entry) &&
> -			    (dax_is_zero_entry(entry) ||
> -			     dax_is_empty_entry(entry))) {
> -				pmd_downgrade = true;
> -			}
> -		}
> -	}
> -
> -	if (pmd_downgrade) {
> -		/*
> -		 * Make sure 'entry' remains valid while we drop
> -		 * the i_pages lock.
> -		 */
> -		dax_lock_entry(xas, entry);
> -
> -		/*
> -		 * Besides huge zero pages the only other thing that gets
> -		 * downgraded are empty entries which don't need to be
> -		 * unmapped.
> -		 */
> -		if (dax_is_zero_entry(entry)) {
> -			xas_unlock_irq(xas);
> -			unmap_mapping_pages(mapping,
> -					xas->xa_index & ~PG_PMD_COLOUR,
> -					PG_PMD_NR, false);
> -			xas_reset(xas);
> -			xas_lock_irq(xas);
> -		}
> -
> -		dax_disassociate_entry(entry, mapping, false);
> -		xas_store(xas, NULL);	/* undo the PMD join */
> -		dax_wake_entry(xas, entry, WAKE_ALL);
> -		mapping->nrpages -= PG_PMD_NR;
> -		entry = NULL;
> -		xas_set(xas, index);
> -	}
> -
> -	if (entry) {
> -		dax_lock_entry(xas, entry);
> -	} else {
> -		unsigned long flags = DAX_EMPTY;
> -
> -		if (order > 0)
> -			flags |= DAX_PMD;
> -		entry = dax_make_entry(pfn_to_pfn_t(0), flags);
> -		dax_lock_entry(xas, entry);
> -		if (xas_error(xas))
> -			goto out_unlock;
> -		mapping->nrpages += 1UL << order;
> -	}
> -
> -out_unlock:
> -	xas_unlock_irq(xas);
> -	if (xas_nomem(xas, mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM))
> -		goto retry;
> -	if (xas->xa_node == XA_ERROR(-ENOMEM))
> -		return xa_mk_internal(VM_FAULT_OOM);
> -	if (xas_error(xas))
> -		return xa_mk_internal(VM_FAULT_SIGBUS);
> -	return entry;
> -fallback:
> -	xas_unlock_irq(xas);
> -	return xa_mk_internal(VM_FAULT_FALLBACK);
> -}
> -
> -static void *dax_zap_entry(struct xa_state *xas, void *entry)
> -{
> -	unsigned long v = xa_to_value(entry);
> -
> -	return xas_store(xas, xa_mk_value(v | DAX_ZAP));
> -}
> -
> -/**
> - * Return NULL if the entry is zapped and all pages in the entry are
> - * idle, otherwise return the non-idle page in the entry
> - */
> -static struct page *dax_zap_pages(struct xa_state *xas, void *entry)
> -{
> -	struct page *ret = NULL;
> -	unsigned long pfn;
> -	bool zap;
> -
> -	if (!dax_entry_size(entry))
> -		return NULL;
> -
> -	zap = !dax_is_zapped(entry);
> -
> -	for_each_mapped_pfn(entry, pfn) {
> -		struct page *page = pfn_to_page(pfn);
> -
> -		if (zap)
> -			page_ref_dec(page);
> -
> -		if (!ret && !dax_page_idle(page))
> -			ret = page;
> -	}
> -
> -	if (zap)
> -		dax_zap_entry(xas, entry);
> -
> -	return ret;
> -}
> -
> -/**
> - * dax_zap_mappings_range - find first pinned page in @mapping
> - * @mapping: address space to scan for a page with ref count > 1
> - * @start: Starting offset. Page containing 'start' is included.
> - * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
> - *       pages from 'start' till the end of file are included.
> - *
> - * DAX requires ZONE_DEVICE mapped pages. These pages are never
> - * 'onlined' to the page allocator so they are considered idle when
> - * page->count == 1. A filesystem uses this interface to determine if
> - * any page in the mapping is busy, i.e. for DMA, or other
> - * get_user_pages() usages.
> - *
> - * It is expected that the filesystem is holding locks to block the
> - * establishment of new mappings in this address_space. I.e. it expects
> - * to be able to run unmap_mapping_range() and subsequently not race
> - * mapping_mapped() becoming true.
> - */
> -struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
> -				    loff_t end)
> -{
> -	void *entry;
> -	unsigned int scanned = 0;
> -	struct page *page = NULL;
> -	pgoff_t start_idx = start >> PAGE_SHIFT;
> -	pgoff_t end_idx;
> -	XA_STATE(xas, &mapping->i_pages, start_idx);
> -
> -	/*
> -	 * In the 'limited' case get_user_pages() for dax is disabled.
> -	 */
> -	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> -		return NULL;
> -
> -	if (!dax_mapping(mapping))
> -		return NULL;
> -
> -	/* If end == LLONG_MAX, all pages from start to till end of file */
> -	if (end == LLONG_MAX)
> -		end_idx = ULONG_MAX;
> -	else
> -		end_idx = end >> PAGE_SHIFT;
> -	/*
> -	 * If we race get_user_pages_fast() here either we'll see the
> -	 * elevated page count in the iteration and wait, or
> -	 * get_user_pages_fast() will see that the page it took a reference
> -	 * against is no longer mapped in the page tables and bail to the
> -	 * get_user_pages() slow path.  The slow path is protected by
> -	 * pte_lock() and pmd_lock(). New references are not taken without
> -	 * holding those locks, and unmap_mapping_pages() will not zero the
> -	 * pte or pmd without holding the respective lock, so we are
> -	 * guaranteed to either see new references or prevent new
> -	 * references from being established.
> -	 */
> -	unmap_mapping_pages(mapping, start_idx, end_idx - start_idx + 1, 0);
> -
> -	xas_lock_irq(&xas);
> -	xas_for_each(&xas, entry, end_idx) {
> -		if (WARN_ON_ONCE(!xa_is_value(entry)))
> -			continue;
> -		if (unlikely(dax_is_locked(entry)))
> -			entry = get_unlocked_entry(&xas, 0);
> -		if (entry)
> -			page = dax_zap_pages(&xas, entry);
> -		put_unlocked_entry(&xas, entry, WAKE_NEXT);
> -		if (page)
> -			break;
> -		if (++scanned % XA_CHECK_SCHED)
> -			continue;
> -
> -		xas_pause(&xas);
> -		xas_unlock_irq(&xas);
> -		cond_resched();
> -		xas_lock_irq(&xas);
> -	}
> -	xas_unlock_irq(&xas);
> -	return page;
> -}
> -EXPORT_SYMBOL_GPL(dax_zap_mappings_range);
> -
> -struct page *dax_zap_mappings(struct address_space *mapping)
> -{
> -	return dax_zap_mappings_range(mapping, 0, LLONG_MAX);
> -}
> -EXPORT_SYMBOL_GPL(dax_zap_mappings);
> -
> -static int __dax_invalidate_entry(struct address_space *mapping,
> -					  pgoff_t index, bool trunc)
> -{
> -	XA_STATE(xas, &mapping->i_pages, index);
> -	int ret = 0;
> -	void *entry;
> -
> -	xas_lock_irq(&xas);
> -	entry = get_unlocked_entry(&xas, 0);
> -	if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> -		goto out;
> -	if (!trunc &&
> -	    (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
> -	     xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
> -		goto out;
> -	dax_disassociate_entry(entry, mapping, trunc);
> -	xas_store(&xas, NULL);
> -	mapping->nrpages -= 1UL << dax_entry_order(entry);
> -	ret = 1;
> -out:
> -	put_unlocked_entry(&xas, entry, WAKE_ALL);
> -	xas_unlock_irq(&xas);
> -	return ret;
> -}
> -
> -/*
> - * Delete DAX entry at @index from @mapping.  Wait for it
> - * to be unlocked before deleting it.
> - */
> -int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> -{
> -	int ret = __dax_invalidate_entry(mapping, index, true);
> -
> -	/*
> -	 * This gets called from truncate / punch_hole path. As such, the caller
> -	 * must hold locks protecting against concurrent modifications of the
> -	 * page cache (usually fs-private i_mmap_sem for writing). Since the
> -	 * caller has seen a DAX entry for this index, we better find it
> -	 * at that index as well...
> -	 */
> -	WARN_ON_ONCE(!ret);
> -	return ret;
> -}
> -
> -/*
> - * Invalidate DAX entry if it is clean.
> - */
> -int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> -				      pgoff_t index)
> -{
> -	return __dax_invalidate_entry(mapping, index, false);
> -}
> -
>  static pgoff_t dax_iomap_pgoff(const struct iomap *iomap, loff_t pos)
>  {
>  	return PHYS_PFN(iomap->addr + (pos & PAGE_MASK) - iomap->offset);
> @@ -894,195 +55,6 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
>  	return 0;
>  }
>
> -/*
> - * MAP_SYNC on a dax mapping guarantees dirty metadata is
> - * flushed on write-faults (non-cow), but not read-faults.
> - */
> -static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
> -		struct vm_area_struct *vma)
> -{
> -	return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
> -		(iter->iomap.flags & IOMAP_F_DIRTY);
> -}
> -
> -static bool dax_fault_is_cow(const struct iomap_iter *iter)
> -{
> -	return (iter->flags & IOMAP_WRITE) &&
> -		(iter->iomap.flags & IOMAP_F_SHARED);
> -}
> -
> -static unsigned long dax_iter_flags(const struct iomap_iter *iter,
> -				    struct vm_fault *vmf)
> -{
> -	unsigned long flags = 0;
> -
> -	if (!dax_fault_is_synchronous(iter, vmf->vma))
> -		flags |= DAX_DIRTY;
> -
> -	if (dax_fault_is_cow(iter))
> -		flags |= DAX_COW;
> -
> -	return flags;
> -}
> -
> -/*
> - * By this point grab_mapping_entry() has ensured that we have a locked entry
> - * of the appropriate size so we don't have to worry about downgrading PMDs to
> - * PTEs.  If we happen to be trying to insert a PTE and there is a PMD
> - * already in the tree, we will skip the insertion and just dirty the PMD as
> - * appropriate.
> - */
> -static vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> -				   void **pentry, pfn_t pfn,
> -				   unsigned long flags)
> -{
> -	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> -	void *new_entry = dax_make_entry(pfn, flags);
> -	bool dirty = flags & DAX_DIRTY;
> -	bool cow = flags & DAX_COW;
> -	void *entry = *pentry;
> -
> -	if (dirty)
> -		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
> -	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
> -		unsigned long index = xas->xa_index;
> -		/* we are replacing a zero page with block mapping */
> -		if (dax_is_pmd_entry(entry))
> -			unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
> -					PG_PMD_NR, false);
> -		else /* pte entry */
> -			unmap_mapping_pages(mapping, index, 1, false);
> -	}
> -
> -	xas_reset(xas);
> -	xas_lock_irq(xas);
> -	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> -		void *old;
> -
> -		dax_disassociate_entry(entry, mapping, false);
> -		dax_associate_entry(new_entry, mapping, vmf, flags);
> -		/*
> -		 * Only swap our new entry into the page cache if the current
> -		 * entry is a zero page or an empty entry.  If a normal PTE or
> -		 * PMD entry is already in the cache, we leave it alone.  This
> -		 * means that if we are trying to insert a PTE and the
> -		 * existing entry is a PMD, we will just leave the PMD in the
> -		 * tree and dirty it if necessary.
> -		 */
> -		old = dax_lock_entry(xas, new_entry);
> -		WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
> -					DAX_LOCKED));
> -		entry = new_entry;
> -	} else {
> -		xas_load(xas);	/* Walk the xa_state */
> -	}
> -
> -	if (dirty)
> -		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
> -
> -	if (cow)
> -		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> -
> -	xas_unlock_irq(xas);
> -	*pentry = entry;
> -	return 0;
> -}
> -
> -static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
> -		struct address_space *mapping, void *entry)
> -{
> -	unsigned long pfn, index, count, end;
> -	long ret = 0;
> -	struct vm_area_struct *vma;
> -
> -	/*
> -	 * A page got tagged dirty in DAX mapping? Something is seriously
> -	 * wrong.
> -	 */
> -	if (WARN_ON(!xa_is_value(entry)))
> -		return -EIO;
> -
> -	if (unlikely(dax_is_locked(entry))) {
> -		void *old_entry = entry;
> -
> -		entry = get_unlocked_entry(xas, 0);
> -
> -		/* Entry got punched out / reallocated? */
> -		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> -			goto put_unlocked;
> -		/*
> -		 * Entry got reallocated elsewhere? No need to writeback.
> -		 * We have to compare pfns as we must not bail out due to
> -		 * difference in lockbit or entry type.
> -		 */
> -		if (dax_to_pfn(old_entry) != dax_to_pfn(entry))
> -			goto put_unlocked;
> -		if (WARN_ON_ONCE(dax_is_empty_entry(entry) ||
> -					dax_is_zero_entry(entry))) {
> -			ret = -EIO;
> -			goto put_unlocked;
> -		}
> -
> -		/* Another fsync thread may have already done this entry */
> -		if (!xas_get_mark(xas, PAGECACHE_TAG_TOWRITE))
> -			goto put_unlocked;
> -	}
> -
> -	/* Lock the entry to serialize with page faults */
> -	dax_lock_entry(xas, entry);
> -
> -	/*
> -	 * We can clear the tag now but we have to be careful so that concurrent
> -	 * dax_writeback_one() calls for the same index cannot finish before we
> -	 * actually flush the caches. This is achieved as the calls will look
> -	 * at the entry only under the i_pages lock and once they do that
> -	 * they will see the entry locked and wait for it to unlock.
> -	 */
> -	xas_clear_mark(xas, PAGECACHE_TAG_TOWRITE);
> -	xas_unlock_irq(xas);
> -
> -	/*
> -	 * If dax_writeback_mapping_range() was given a wbc->range_start
> -	 * in the middle of a PMD, the 'index' we use needs to be
> -	 * aligned to the start of the PMD.
> -	 * This allows us to flush for PMD_SIZE and not have to worry about
> -	 * partial PMD writebacks.
> -	 */
> -	pfn = dax_to_pfn(entry);
> -	count = 1UL << dax_entry_order(entry);
> -	index = xas->xa_index & ~(count - 1);
> -	end = index + count - 1;
> -
> -	/* Walk all mappings of a given index of a file and writeprotect them */
> -	i_mmap_lock_read(mapping);
> -	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) {
> -		pfn_mkclean_range(pfn, count, index, vma);
> -		cond_resched();
> -	}
> -	i_mmap_unlock_read(mapping);
> -
> -	dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
> -	/*
> -	 * After we have flushed the cache, we can clear the dirty tag. There
> -	 * cannot be new dirty data in the pfn after the flush has completed as
> -	 * the pfn mappings are writeprotected and fault waits for mapping
> -	 * entry lock.
> -	 */
> -	xas_reset(xas);
> -	xas_lock_irq(xas);
> -	xas_store(xas, entry);
> -	xas_clear_mark(xas, PAGECACHE_TAG_DIRTY);
> -	dax_wake_entry(xas, entry, WAKE_NEXT);
> -
> -	trace_dax_writeback_one(mapping->host, index, count);
> -	return ret;
> -
> - put_unlocked:
> -	put_unlocked_entry(xas, entry, WAKE_NEXT);
> -	return ret;
> -}
> -
>  /*
>   * Flush the mapping to the persistent domain within the byte range of [start,
>   * end]. This is required by data integrity operations to ensure file data is
> @@ -1219,6 +191,37 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
>  	return 0;
>  }
>
> +/*
> + * MAP_SYNC on a dax mapping guarantees dirty metadata is
> + * flushed on write-faults (non-cow), but not read-faults.
> + */
> +static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
> +				     struct vm_area_struct *vma)
> +{
> +	return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
> +	       (iter->iomap.flags & IOMAP_F_DIRTY);
> +}
> +
> +static bool dax_fault_is_cow(const struct iomap_iter *iter)
> +{
> +	return (iter->flags & IOMAP_WRITE) &&
> +	       (iter->iomap.flags & IOMAP_F_SHARED);
> +}
> +
> +static unsigned long dax_iter_flags(const struct iomap_iter *iter,
> +				    struct vm_fault *vmf)
> +{
> +	unsigned long flags = 0;
> +
> +	if (!dax_fault_is_synchronous(iter, vmf->vma))
> +		flags |= DAX_DIRTY;
> +
> +	if (dax_fault_is_cow(iter))
> +		flags |= DAX_COW;
> +
> +	return flags;
> +}
> +
>  /*
>   * The user has performed a load from a hole in the file.  Allocating a new
>   * page in the file would cause excessive storage usage for workloads with
> @@ -1701,7 +704,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
>  		iter.flags |= IOMAP_WRITE;
>
> -	entry = grab_mapping_entry(&xas, mapping, 0);
> +	entry = dax_grab_mapping_entry(&xas, mapping, 0);
>  	if (xa_is_internal(entry)) {
>  		ret = xa_to_internal(entry);
>  		goto out;
> @@ -1818,12 +821,12 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
>  		goto fallback;
>
>  	/*
> -	 * grab_mapping_entry() will make sure we get an empty PMD entry,
> +	 * dax_grab_mapping_entry() will make sure we get an empty PMD entry,
>  	 * a zero PMD entry or a DAX PMD.  If it can't (because a PTE
>  	 * entry is already in the array, for instance), it will return
>  	 * VM_FAULT_FALLBACK.
>  	 */
> -	entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
> +	entry = dax_grab_mapping_entry(&xas, mapping, PMD_ORDER);
>  	if (xa_is_internal(entry)) {
>  		ret = xa_to_internal(entry);
>  		goto fallback;
> @@ -1897,50 +900,6 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
>
> -/*
> - * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables
> - * @vmf: The description of the fault
> - * @pfn: PFN to insert
> - * @order: Order of entry to insert.
> - *
> - * This function inserts a writeable PTE or PMD entry into the page tables
> - * for an mmaped DAX file.  It also marks the page cache entry as dirty.
> - */
> -static vm_fault_t
> -dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
> -{
> -	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> -	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
> -	void *entry;
> -	vm_fault_t ret;
> -
> -	xas_lock_irq(&xas);
> -	entry = get_unlocked_entry(&xas, order);
> -	/* Did we race with someone splitting entry or so? */
> -	if (!entry || dax_is_conflict(entry) ||
> -	    (order == 0 && !dax_is_pte_entry(entry))) {
> -		put_unlocked_entry(&xas, entry, WAKE_NEXT);
> -		xas_unlock_irq(&xas);
> -		trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
> -						      VM_FAULT_NOPAGE);
> -		return VM_FAULT_NOPAGE;
> -	}
> -	xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
> -	dax_lock_entry(&xas, entry);
> -	xas_unlock_irq(&xas);
> -	if (order == 0)
> -		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> -#ifdef CONFIG_FS_DAX_PMD
> -	else if (order == PMD_ORDER)
> -		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
> -#endif
> -	else
> -		ret = VM_FAULT_FALLBACK;
> -	dax_unlock_entry(&xas, entry);
> -	trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
> -	return ret;
> -}
> -
>  /**
>   * dax_finish_sync_fault - finish synchronous page fault
>   * @vmf: The description of the fault
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index f6acb4ed73cb..de60a34088bb 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -157,15 +157,33 @@ static inline void fs_put_dax(struct dax_device *dax_dev, void *holder)
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct dax_device *dax_dev, struct writeback_control *wbc);
>
> -struct page *dax_zap_mappings(struct address_space *mapping);
> -struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
> -				    loff_t end);
> +#else
> +static inline int dax_writeback_mapping_range(struct address_space *mapping,
> +		struct dax_device *dax_dev, struct writeback_control *wbc)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +#endif
> +
> +int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> +		const struct iomap_ops *ops);
> +int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> +		const struct iomap_ops *ops);
> +
> +#if IS_ENABLED(CONFIG_DAX)
> +int dax_read_lock(void);
> +void dax_read_unlock(int id);
>  dax_entry_t dax_lock_page(struct page *page);
>  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> +void run_dax(struct dax_device *dax_dev);
>  dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
>  		unsigned long index, struct page **page);
>  void dax_unlock_mapping_entry(struct address_space *mapping,
>  		unsigned long index, dax_entry_t cookie);
> +struct page *dax_zap_mappings(struct address_space *mapping);
> +struct page *dax_zap_mappings_range(struct address_space *mapping, loff_t start,
> +				    loff_t end);
>  #else
>  static inline struct page *dax_zap_mappings(struct address_space *mapping)
>  {
> @@ -179,12 +197,6 @@ static inline struct page *dax_zap_mappings_range(struct address_space *mapping,
>  	return NULL;
>  }
>
> -static inline int dax_writeback_mapping_range(struct address_space *mapping,
> -		struct dax_device *dax_dev, struct writeback_control *wbc)
> -{
> -	return -EOPNOTSUPP;
> -}
> -
>  static inline dax_entry_t dax_lock_page(struct page *page)
>  {
>  	if (IS_DAX(page->mapping->host))
> @@ -196,6 +208,15 @@ static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
>  {
>  }
>
> +static inline int dax_read_lock(void)
> +{
> +	return 0;
> +}
> +
> +static inline void dax_read_unlock(int id)
> +{
> +}
> +
>  static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
>  		unsigned long index, struct page **page)
>  {
> @@ -208,11 +229,6 @@ static inline void dax_unlock_mapping_entry(struct address_space *mapping,
>  }
>  #endif
>
> -int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> -		const struct iomap_ops *ops);
> -int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -		const struct iomap_ops *ops);
> -
>  /*
>   * Document all the code locations that want know when a dax page is
>   * unreferenced.
> @@ -222,19 +238,6 @@ static inline bool dax_page_idle(struct page *page)
>  	return page_ref_count(page) == 1;
>  }
>
> -#if IS_ENABLED(CONFIG_DAX)
> -int dax_read_lock(void);
> -void dax_read_unlock(int id);
> -#else
> -static inline int dax_read_lock(void)
> -{
> -	return 0;
> -}
> -
> -static inline void dax_read_unlock(int id)
> -{
> -}
> -#endif /* CONFIG_DAX */
>  bool dax_alive(struct dax_device *dax_dev);
>  void *dax_get_private(struct dax_device *dax_dev);
>  long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> @@ -255,6 +258,9 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  		    pfn_t *pfnp, int *errp, const struct iomap_ops *ops);
>  vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  		enum page_entry_size pe_size, pfn_t pfn);
> +void *dax_grab_mapping_entry(struct xa_state *xas,
> +			     struct address_space *mapping, unsigned int order);
> +void dax_unlock_entry(struct xa_state *xas, void *entry);
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
>  				      pgoff_t index);
> @@ -271,6 +277,56 @@ static inline bool dax_mapping(struct address_space *mapping)
>  	return mapping->host && IS_DAX(mapping->host);
>  }
>
> +/*
> + * DAX pagecache entries use XArray value entries so they can't be mistaken
> + * for pages.  We use one bit for locking, one bit for the entry size (PMD)
> + * and two more to tell us if the entry is a zero page or an empty entry that
> + * is just used for locking.  In total four special bits.
> + *
> + * If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
> + * and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
> + * block allocation.
> + */
> +#define DAX_SHIFT	(5)
> +#define DAX_MASK	((1UL << DAX_SHIFT) - 1)
> +#define DAX_LOCKED	(1UL << 0)
> +#define DAX_PMD		(1UL << 1)
> +#define DAX_ZERO_PAGE	(1UL << 2)
> +#define DAX_EMPTY	(1UL << 3)
> +#define DAX_ZAP		(1UL << 4)
> +
> +/*
> + * These flags are not conveyed in Xarray value entries, they are just
> + * modifiers to dax_insert_entry().
> + */
> +#define DAX_DIRTY (1UL << (DAX_SHIFT + 0))
> +#define DAX_COW   (1UL << (DAX_SHIFT + 1))
> +
> +vm_fault_t dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> +			    void **pentry, pfn_t pfn, unsigned long flags);
> +vm_fault_t dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn,
> +				  unsigned int order);
> +int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
> +		      struct address_space *mapping, void *entry);
> +
> +/* The 'colour' (ie low bits) within a PMD of a page offset.  */
> +#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
> +#define PG_PMD_NR (PMD_SIZE >> PAGE_SHIFT)
> +
> +/* The order of a PMD entry */
> +#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)
> +
> +static inline unsigned int pe_order(enum page_entry_size pe_size)
> +{
> +	if (pe_size == PE_SIZE_PTE)
> +		return PAGE_SHIFT - PAGE_SHIFT;
> +	if (pe_size == PE_SIZE_PMD)
> +		return PMD_SHIFT - PAGE_SHIFT;
> +	if (pe_size == PE_SIZE_PUD)
> +		return PUD_SHIFT - PAGE_SHIFT;
> +	return ~0;
> +}
> +
>  #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>  void hmem_register_device(int target_nid, struct resource *r);
>  #else
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index fd57407e7f3d..e5d30eec3bf1 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -221,6 +221,12 @@ static inline void devm_memunmap_pages(struct device *dev,
>  {
>  }
>
> +static inline struct dev_pagemap *
> +get_dev_pagemap_many(unsigned long pfn, struct dev_pagemap *pgmap, int refs)
> +{
> +	return NULL;
> +}
> +
>  static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  		struct dev_pagemap *pgmap)
>  {

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion
  2022-09-27  6:07                             ` Alistair Popple
@ 2022-09-27 12:56                               ` Jason Gunthorpe
  0 siblings, 0 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-27 12:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Dan Williams, akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Tue, Sep 27, 2022 at 04:07:05PM +1000, Alistair Popple wrote:

> That sounds good to me at least. I just noticed we introduced this exact
> bug for device private/coherent pages when making their refcounts zero
> based. Nothing currently takes pgmap->ref when a private/coherent page
> is mapped. Therefore memunmap_pages() will complete and pgmap destroyed
> while pgmap pages are still mapped.

To kind of summarize this thread

Either we should get the pgmap reference during the refcount = 1 flow,
and put it during page_free()

Or we should have the pgmap destroy sweep all the pages and wait for
them to become ref == 0

I don't think we should have pgmap references randomly strewn all over
the place. A positive refcount on the page alone must be enough to
prove that the struct page exists and the pgmap is not destroyed.

Every driver using pgmap needs something liek this, so I'd prefer it
be in the pgmap code..

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core
  2022-09-27  6:20   ` Alistair Popple
@ 2022-09-29 22:38     ` Dan Williams
  0 siblings, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-29 22:38 UTC (permalink / raw)
  To: Alistair Popple, Dan Williams
  Cc: akpm, Matthew Wilcox, Jan Kara, Darrick J. Wong, Jason Gunthorpe,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Alistair Popple wrote:
> 
> Dan Williams <dan.j.williams@intel.com> writes:
> 
> [...]
> 
> > +/**
> > + * dax_zap_mappings_range - find first pinned page in @mapping
> > + * @mapping: address space to scan for a page with ref count > 1
> > + * @start: Starting offset. Page containing 'start' is included.
> > + * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
> > + *       pages from 'start' till the end of file are included.
> > + *
> > + * DAX requires ZONE_DEVICE mapped pages. These pages are never
> > + * 'onlined' to the page allocator so they are considered idle when
> > + * page->count == 1. A filesystem uses this interface to determine if
> 
> Minor nit-pick I noticed while reading this but shouldn't that be
> "page->count == 0" now?

I put this patch set down for a couple days to attend a conference and
now I am warming up my cache again. I believe the patch to make this
zero based comes later in this series, but I definitely did not come
back and fix this up in that patch, so good catch!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-26 14:10                         ` Jan Kara
@ 2022-09-29 23:33                           ` Dan Williams
  2022-09-30 13:41                             ` Jan Kara
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-29 23:33 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Jan Kara, Dan Williams, Jason Gunthorpe, akpm, Matthew Wilcox,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

Jan Kara wrote:
> On Mon 26-09-22 09:54:07, Dave Chinner wrote:
> > On Fri, Sep 23, 2022 at 11:38:03AM +0200, Jan Kara wrote:
> > > On Fri 23-09-22 12:10:12, Dave Chinner wrote:
> > > > On Thu, Sep 22, 2022 at 05:41:08PM -0700, Dan Williams wrote:
> > > > > Dave Chinner wrote:
> > > > > > On Wed, Sep 21, 2022 at 07:28:51PM -0300, Jason Gunthorpe wrote:
> > > > > > > On Thu, Sep 22, 2022 at 08:14:16AM +1000, Dave Chinner wrote:
> > > > > > > 
> > > > > > > > Where are these DAX page pins that don't require the pin holder to
> > > > > > > > also hold active references to the filesystem objects coming from?
> > > > > > > 
> > > > > > > O_DIRECT and things like it.
> > > > > > 
> > > > > > O_DIRECT IO to a file holds a reference to a struct file which holds
> > > > > > an active reference to the struct inode. Hence you can't reclaim an
> > > > > > inode while an O_DIRECT IO is in progress to it. 
> > > > > > 
> > > > > > Similarly, file-backed pages pinned from user vmas have the inode
> > > > > > pinned by the VMA having a reference to the struct file passed to
> > > > > > them when they are instantiated. Hence anything using mmap() to pin
> > > > > > file-backed pages (i.e. applications using FSDAX access from
> > > > > > userspace) should also have a reference to the inode that prevents
> > > > > > the inode from being reclaimed.
> > > > > > 
> > > > > > So I'm at a loss to understand what "things like it" might actually
> > > > > > mean. Can you actually describe a situation where we actually permit
> > > > > > (even temporarily) these use-after-free scenarios?
> > > > > 
> > > > > Jason mentioned a scenario here:
> > > > > 
> > > > > https://lore.kernel.org/all/YyuoE8BgImRXVkkO@nvidia.com/
> > > > > 
> > > > > Multi-thread process where thread1 does open(O_DIRECT)+mmap()+read() and
> > > > > thread2 does memunmap()+close() while the read() is inflight.
> > > > 
> > > > And, ah, what production application does this and expects to be
> > > > able to process the result of the read() operation without getting a
> > > > SEGV?
> > > > 
> > > > There's a huge difference between an unlikely scenario which we need
> > > > to work (such as O_DIRECT IO to/from a mmap() buffer at a different
> > > > offset on the same file) and this sort of scenario where even if we
> > > > handle it correctly, the application can't do anything with the
> > > > result and will crash immediately....
> > > 
> > > I'm not sure I fully follow what we are concerned about here. As you've
> > > written above direct IO holds reference to the inode until it is completed
> > > (through kiocb->file->inode chain). So direct IO should be safe?
> > 
> > AFAICT, it's the user buffer allocated by mmap() that the direct IO
> > is DMAing into/out of that is the issue here. i.e. mmap() a file
> > that is DAX enabled, pass the mmap region to DIO on a non-dax file,
> > GUP in the DIO path takes a page pin on user pages that are DAX
> > mapped, the userspace application then unmaps the file pages and
> > unlinks the FSDAX file.
> > 
> > At this point the FSDAX mapped inode has no active references, so
> > the filesystem frees the inode and it's allocated storage space, and
> > now the DIO or whatever is holding the GUP reference is
> > now a moving storage UAF violation. What ever is holding the GUP
> > reference doesn't even have a reference to the FSDAX filesystem -
> > the DIO fd could point to a file in a different filesystem
> > altogether - and so the fsdax filesytem could be unmounted at this
> > point whilst the application is still actively using the storage
> > underlying the filesystem.
> > 
> > That's just .... broken.
> 
> Hum, so I'm confused (and my last email probably was as well). So let me
> spell out the details here so that I can get on the same page about what we
> are trying to solve:
> 
> For FSDAX, backing storage for a page must not be freed (i.e., filesystem
> must to free corresponding block) while there are some references to the
> page. This is achieved by calls to dax_layout_busy_page() from the
> filesystem before truncating file / punching hole into a file. So AFAICT
> this is working correctly and I don't think the patch series under
> discussion aims to change this besides the change in how page without
> references is detected.

Correct. All the nominal truncate paths via hole punch and
truncate_setsize() are already handled for a long time now. However,
what was not covered was the truncate that happens at iput_final() time.
In that case the code has just been getting lucky for all that time.
There is thankfully a WARN() that will trigger if the iput_final()
truncate happens while a page is referenced, so it is at least not
silent.

I know Dave is tired of this discussion, but every time he engages the
solution gets better, like finding this iput_final() bug, so I hope he
continues to engage here and I'm grateful for the help.

> Now there is a separate question that while someone holds a reference to
> FSDAX page, the inode this page belongs to can get evicted from memory. For
> FSDAX nothing prevents that AFAICT. If this happens, we loose track of the
> page<->inode association so if somebody later comes and truncates the
> inode, we will not detect the page belonging to the inode is still in use
> (dax_layout_busy_page() does not find the page) and we have a problem.
> Correct?

The WARN would fire at iput_final(). Everything that happens after
that is in UAF territory. In my brief search I did not see reports of
this WARN firing, but it is past time to fix it.

> > > I'd be more worried about stuff like vmsplice() that can add file pages
> > > into pipe without holding inode alive in any way and keeping them there for
> > > arbitrarily long time. Didn't we want to add FOLL_LONGTERM to gup executed
> > > from vmsplice() to avoid issues like this?
> > 
> > Yes, ISTR that was part of the plan - use FOLL_LONGTERM to ensure
> > FSDAX can't run operations that pin pages but don't take fs
> > references. I think that's how we prevented RDMA users from pinning
> > FSDAX direct mapped storage media in this way. It does not, however,
> > prevent the above "short term" GUP UAF situation from occurring.
> 
> If what I wrote above is correct, then I understand and agree.
> 
> > > I agree that freeing VMA while there are pinned pages is ... inconvenient.
> > > But that is just how gup works since the beginning - the moment you have
> > > struct page reference, you completely forget about the mapping you've used
> > > to get to the page. So anything can happen with the mapping after that
> > > moment. And in case of pages mapped by multiple processes I can easily see
> > > that one of the processes decides to unmap the page (and it may well be
> > > that was the initial process that acquired page references) while others
> > > still keep accessing the page using page references stored in some internal
> > > structure (RDMA anyone?).
> > 
> > Yup, and this is why RDMA on FSDAX using this method of pinning pages
> > will end up corrupting data and filesystems, hence FOLL_LONGTERM
> > protecting against most of these situations from even arising. But
> > that's that workaround, not a long term solution that allows RDMA to
> > be run on FSDAX managed storage media.
> > 
> > I said on #xfs a few days ago:
> > 
> > [23/9/22 10:23] * dchinner is getting deja vu over this latest round
> > of "dax mappings don't pin the filesystem objects that own the
> > storage media being mapped"
> > 
> > And I'm getting that feeling again right now...
> > 
> > > I think it will be rather difficult to come up
> > > with some scheme keeping VMA alive while there are pages pinned without
> > > regressing userspace which over the years became very much tailored to the
> > > peculiar gup behavior.
> > 
> > Perhaps all we should do is add a page flag for fsdax mapped pages
> > that says GUP must pin the VMA, so only mapped pages that fall into
> > this category take the perf penalty of VMA management.
> 
> Possibly. But my concern with VMA pinning was not only about performance
> but also about applications relying on being able to unmap pages that are
> currently pinned. At least from some processes one of which may be the one
> doing the original pinning. But yeah, the fact that FOLL_LONGTERM is
> forbidden with DAX somewhat restricts the insanity we have to deal with. So
> maybe pinning the VMA for DAX mappings might actually be a workable
> solution.

As far as I can see, VMAs are not currently reference counted they are
just added / deleted from an mm_struct, and nothing guarantees
mapping_mapped() stays true while a page is pinned.

I like Dave's mental model that the inode is the arbiter for the page,
and the arbiter is not allowed to go out of scope before asserting that
everything it granted previously has been returned.

write_inode_now() unconditionally invokes dax_writeback_mapping_range()
when the inode is committed to going out of scope. write_inode_now() is
allowed to sleep until all dirty mapping entries are written back. I see
nothing wrong with additionally checking for entries with elevated page
reference counts and doing a:

    __wait_var_event(page, dax_page_idle(page));

Since the inode is out of scope there should be no concerns with racing
new 0 -> 1 page->_refcount transitions. Just wait for transient page
pins to finally drain to zero which should already be on the order of
the wait time to complete synchrounous writeback in the dirty inode
case.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-29 23:33                           ` Dan Williams
@ 2022-09-30 13:41                             ` Jan Kara
  2022-09-30 17:56                               ` Dan Williams
  0 siblings, 1 reply; 84+ messages in thread
From: Jan Kara @ 2022-09-30 13:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Chinner, Jason Gunthorpe, akpm, Matthew Wilcox,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

On Thu 29-09-22 16:33:27, Dan Williams wrote:
> Jan Kara wrote:
> > On Mon 26-09-22 09:54:07, Dave Chinner wrote:
> > > > I'd be more worried about stuff like vmsplice() that can add file pages
> > > > into pipe without holding inode alive in any way and keeping them there for
> > > > arbitrarily long time. Didn't we want to add FOLL_LONGTERM to gup executed
> > > > from vmsplice() to avoid issues like this?
> > > 
> > > Yes, ISTR that was part of the plan - use FOLL_LONGTERM to ensure
> > > FSDAX can't run operations that pin pages but don't take fs
> > > references. I think that's how we prevented RDMA users from pinning
> > > FSDAX direct mapped storage media in this way. It does not, however,
> > > prevent the above "short term" GUP UAF situation from occurring.
> > 
> > If what I wrote above is correct, then I understand and agree.
> > 
> > > > I agree that freeing VMA while there are pinned pages is ... inconvenient.
> > > > But that is just how gup works since the beginning - the moment you have
> > > > struct page reference, you completely forget about the mapping you've used
> > > > to get to the page. So anything can happen with the mapping after that
> > > > moment. And in case of pages mapped by multiple processes I can easily see
> > > > that one of the processes decides to unmap the page (and it may well be
> > > > that was the initial process that acquired page references) while others
> > > > still keep accessing the page using page references stored in some internal
> > > > structure (RDMA anyone?).
> > > 
> > > Yup, and this is why RDMA on FSDAX using this method of pinning pages
> > > will end up corrupting data and filesystems, hence FOLL_LONGTERM
> > > protecting against most of these situations from even arising. But
> > > that's that workaround, not a long term solution that allows RDMA to
> > > be run on FSDAX managed storage media.
> > > 
> > > I said on #xfs a few days ago:
> > > 
> > > [23/9/22 10:23] * dchinner is getting deja vu over this latest round
> > > of "dax mappings don't pin the filesystem objects that own the
> > > storage media being mapped"
> > > 
> > > And I'm getting that feeling again right now...
> > > 
> > > > I think it will be rather difficult to come up
> > > > with some scheme keeping VMA alive while there are pages pinned without
> > > > regressing userspace which over the years became very much tailored to the
> > > > peculiar gup behavior.
> > > 
> > > Perhaps all we should do is add a page flag for fsdax mapped pages
> > > that says GUP must pin the VMA, so only mapped pages that fall into
> > > this category take the perf penalty of VMA management.
> > 
> > Possibly. But my concern with VMA pinning was not only about performance
> > but also about applications relying on being able to unmap pages that are
> > currently pinned. At least from some processes one of which may be the one
> > doing the original pinning. But yeah, the fact that FOLL_LONGTERM is
> > forbidden with DAX somewhat restricts the insanity we have to deal with. So
> > maybe pinning the VMA for DAX mappings might actually be a workable
> > solution.
> 
> As far as I can see, VMAs are not currently reference counted they are
> just added / deleted from an mm_struct, and nothing guarantees
> mapping_mapped() stays true while a page is pinned.

I agree this solution requires quite some work. But I wanted to say that
in principle it would be a logically consistent and technically not that
difficult solution.
 
> I like Dave's mental model that the inode is the arbiter for the page,
> and the arbiter is not allowed to go out of scope before asserting that
> everything it granted previously has been returned.
> 
> write_inode_now() unconditionally invokes dax_writeback_mapping_range()
> when the inode is committed to going out of scope. write_inode_now() is
> allowed to sleep until all dirty mapping entries are written back. I see
> nothing wrong with additionally checking for entries with elevated page
> reference counts and doing a:
> 
>     __wait_var_event(page, dax_page_idle(page));
> 
> Since the inode is out of scope there should be no concerns with racing
> new 0 -> 1 page->_refcount transitions. Just wait for transient page
> pins to finally drain to zero which should already be on the order of
> the wait time to complete synchrounous writeback in the dirty inode
> case.

I agree this is doable but there's the nasty sideeffect that inode reclaim
may block for abitrary time waiting for page pinning. If the application
that has pinned the page requires __GFP_FS memory allocation to get to a
point where it releases the page, we even have a deadlock possibility.
So it's better than the UAF issue but still not ideal.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-30 13:41                             ` Jan Kara
@ 2022-09-30 17:56                               ` Dan Williams
  2022-09-30 18:06                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Dan Williams @ 2022-09-30 17:56 UTC (permalink / raw)
  To: Jan Kara, Dan Williams
  Cc: Jan Kara, Dave Chinner, Jason Gunthorpe, akpm, Matthew Wilcox,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

Jan Kara wrote:
[..]
> I agree this is doable but there's the nasty sideeffect that inode reclaim
> may block for abitrary time waiting for page pinning. If the application
> that has pinned the page requires __GFP_FS memory allocation to get to a
> point where it releases the page, we even have a deadlock possibility.
> So it's better than the UAF issue but still not ideal.

I expect VMA pinning would have similar deadlock exposure if pinning a
VMA keeps the inode allocated. Anything that puts a page-pin release
dependency in the inode freeing path can potentially deadlock a reclaim
event that depends on that inode being freed.

As you say the UAF is worse. I am not too worried about the deadlock
case for a couple reasons:

1/ There are no reports I can find of iput_final() triggering the WARN
that validates that truncate_inode_pages_final() is called while all
associated pages are unpinned. That WARN has been in place since 2017:

d2c997c0f145 fs, dax: use page->mapping to warn if truncate collides with a busy page

2/ It is bad form for I/O drivers to perform __GFP_FS and __GFP_IO
allocations in their fast paths. So while the deadlock is not impossible
it is unlikely with the major producers of transient page pin events.

My hope, famous last words, is that this is only a theoretical deadlock,
or we can handle this with targeted driver fixes. Any driver that thinks
it wants to pin pages and then do more allocations that recurse into the
FS likely wants to get that out of its fast path anyway. I will also
take a look at a lockdep annotation for the wait event to see if that
can give an early warning versus fs_reclaim_acquire().

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-30 17:56                               ` Dan Williams
@ 2022-09-30 18:06                                 ` Jason Gunthorpe
  2022-09-30 18:46                                   ` Dan Williams
  2022-10-03  7:55                                   ` Jan Kara
  0 siblings, 2 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2022-09-30 18:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Chinner, akpm, Matthew Wilcox, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

On Fri, Sep 30, 2022 at 10:56:27AM -0700, Dan Williams wrote:
> Jan Kara wrote:
> [..]
> > I agree this is doable but there's the nasty sideeffect that inode reclaim
> > may block for abitrary time waiting for page pinning. If the application
> > that has pinned the page requires __GFP_FS memory allocation to get to a
> > point where it releases the page, we even have a deadlock possibility.
> > So it's better than the UAF issue but still not ideal.
> 
> I expect VMA pinning would have similar deadlock exposure if pinning a
> VMA keeps the inode allocated. Anything that puts a page-pin release
> dependency in the inode freeing path can potentially deadlock a reclaim
> event that depends on that inode being freed.

I think the desire would be to go from the VMA to an inode_get and
hold the inode reference for the from the pin_user_pages() to the
unpin_user_page(), ie prevent it from being freed in the first place.

It is a fine idea, the trouble is just the high complexity to get
there.

However, I wonder if the trucate/hole punch paths have the same
deadlock problem?

I agree with you though, given the limited options we should convert
the UAF into an unlikely deadlock.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-30 18:06                                 ` Jason Gunthorpe
@ 2022-09-30 18:46                                   ` Dan Williams
  2022-10-03  7:55                                   ` Jan Kara
  1 sibling, 0 replies; 84+ messages in thread
From: Dan Williams @ 2022-09-30 18:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams
  Cc: Jan Kara, Dave Chinner, akpm, Matthew Wilcox, Darrick J. Wong,
	Christoph Hellwig, John Hubbard, linux-fsdevel, nvdimm,
	linux-xfs, linux-mm, linux-ext4

Jason Gunthorpe wrote:
> On Fri, Sep 30, 2022 at 10:56:27AM -0700, Dan Williams wrote:
> > Jan Kara wrote:
> > [..]
> > > I agree this is doable but there's the nasty sideeffect that inode reclaim
> > > may block for abitrary time waiting for page pinning. If the application
> > > that has pinned the page requires __GFP_FS memory allocation to get to a
> > > point where it releases the page, we even have a deadlock possibility.
> > > So it's better than the UAF issue but still not ideal.
> > 
> > I expect VMA pinning would have similar deadlock exposure if pinning a
> > VMA keeps the inode allocated. Anything that puts a page-pin release
> > dependency in the inode freeing path can potentially deadlock a reclaim
> > event that depends on that inode being freed.
> 
> I think the desire would be to go from the VMA to an inode_get and
> hold the inode reference for the from the pin_user_pages() to the
> unpin_user_page(), ie prevent it from being freed in the first place.
> 
> It is a fine idea, the trouble is just the high complexity to get
> there.
> 
> However, I wonder if the trucate/hole punch paths have the same
> deadlock problem?

If the deadlock is waiting for inode reclaim to complete then I can see
why the VMA pin proposal and the current truncate paths do not trigger
that deadlock because the inode is kept out of the reclaim path.

> I agree with you though, given the limited options we should convert
> the UAF into an unlikely deadlock.

I think this approach makes the implementation incrementally better, and
that the need to plumb VMA pinning can await evidence that a driver
actually does this *and* the driver can not be fixed.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 05/18] xfs: Add xfs_break_layouts() to the inode eviction path
  2022-09-30 18:06                                 ` Jason Gunthorpe
  2022-09-30 18:46                                   ` Dan Williams
@ 2022-10-03  7:55                                   ` Jan Kara
  1 sibling, 0 replies; 84+ messages in thread
From: Jan Kara @ 2022-10-03  7:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dan Williams, Jan Kara, Dave Chinner, akpm, Matthew Wilcox,
	Darrick J. Wong, Christoph Hellwig, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

On Fri 30-09-22 15:06:47, Jason Gunthorpe wrote:
> On Fri, Sep 30, 2022 at 10:56:27AM -0700, Dan Williams wrote:
> > Jan Kara wrote:
> > [..]
> > > I agree this is doable but there's the nasty sideeffect that inode reclaim
> > > may block for abitrary time waiting for page pinning. If the application
> > > that has pinned the page requires __GFP_FS memory allocation to get to a
> > > point where it releases the page, we even have a deadlock possibility.
> > > So it's better than the UAF issue but still not ideal.
> > 
> > I expect VMA pinning would have similar deadlock exposure if pinning a
> > VMA keeps the inode allocated. Anything that puts a page-pin release
> > dependency in the inode freeing path can potentially deadlock a reclaim
> > event that depends on that inode being freed.
> 
> I think the desire would be to go from the VMA to an inode_get and
> hold the inode reference for the from the pin_user_pages() to the
> unpin_user_page(), ie prevent it from being freed in the first place.

Yes, that was the idea how to avoid UAF problems.

> It is a fine idea, the trouble is just the high complexity to get
> there.
> 
> However, I wonder if the trucate/hole punch paths have the same
> deadlock problem?

Do you mean someone requiring say truncate(2) to complete on file F in
order to unpin pages of F? That is certainly a deadlock but it has always
worked this way for DAX so at least applications knowingly targetted at DAX
will quickly notice and avoid such unwise dependency ;).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/18] Fix the DAX-gup mistake
  2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
                   ` (18 preceding siblings ...)
  2022-09-20 14:29 ` [PATCH v2 00/18] Fix the DAX-gup mistake Jason Gunthorpe
@ 2022-11-09  0:20 ` Andrew Morton
  2022-11-09 11:38   ` Jan Kara
  19 siblings, 1 reply; 84+ messages in thread
From: Andrew Morton @ 2022-11-09  0:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, Jan Kara, Christoph Hellwig, Darrick J. Wong,
	Matthew Wilcox, John Hubbard, linux-fsdevel, nvdimm, linux-xfs,
	linux-mm, linux-ext4

All seems to be quiet on this front, so I plan to move this series into
mm-stable a few days from now.

We do have this report of dax_holder_notify_failure being unavailable
with CONFIG_DAX=n:
https://lkml.kernel.org/r/202210230716.tNv8A5mN-lkp@intel.com but that
appears to predate this series.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH v2 00/18] Fix the DAX-gup mistake
  2022-11-09  0:20 ` Andrew Morton
@ 2022-11-09 11:38   ` Jan Kara
  0 siblings, 0 replies; 84+ messages in thread
From: Jan Kara @ 2022-11-09 11:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Jason Gunthorpe, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Matthew Wilcox, John Hubbard, linux-fsdevel,
	nvdimm, linux-xfs, linux-mm, linux-ext4

On Tue 08-11-22 16:20:59, Andrew Morton wrote:
> All seems to be quiet on this front, so I plan to move this series into
> mm-stable a few days from now.
> 
> We do have this report of dax_holder_notify_failure being unavailable
> with CONFIG_DAX=n:
> https://lkml.kernel.org/r/202210230716.tNv8A5mN-lkp@intel.com but that
> appears to predate this series.

Andrew, there has been v3 some time ago [1] and even that gathered some
non-trivial feedback from Jason so I don't think this is settled...

[1] https://lore.kernel.org/all/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2022-11-09 11:38 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16  3:35 [PATCH v2 00/18] Fix the DAX-gup mistake Dan Williams
2022-09-16  3:35 ` [PATCH v2 01/18] fsdax: Wait on @page not @page->_refcount Dan Williams
2022-09-20 14:30   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 02/18] fsdax: Use dax_page_idle() to document DAX busy page checking Dan Williams
2022-09-20 14:31   ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 03/18] fsdax: Include unmapped inodes for page-idle detection Dan Williams
2022-09-16  3:35 ` [PATCH v2 04/18] ext4: Add ext4_break_layouts() to the inode eviction path Dan Williams
2022-09-16  3:35 ` [PATCH v2 05/18] xfs: Add xfs_break_layouts() " Dan Williams
2022-09-18 22:57   ` Dave Chinner
2022-09-19 16:11     ` Dan Williams
2022-09-19 21:29       ` Dave Chinner
2022-09-20 16:44         ` Dan Williams
2022-09-21 22:14           ` Dave Chinner
2022-09-21 22:28             ` Jason Gunthorpe
2022-09-23  0:18               ` Dave Chinner
2022-09-23  0:41                 ` Dan Williams
2022-09-23  2:10                   ` Dave Chinner
2022-09-23  9:38                     ` Jan Kara
2022-09-23 23:06                       ` Dan Williams
2022-09-25 23:54                       ` Dave Chinner
2022-09-26 14:10                         ` Jan Kara
2022-09-29 23:33                           ` Dan Williams
2022-09-30 13:41                             ` Jan Kara
2022-09-30 17:56                               ` Dan Williams
2022-09-30 18:06                                 ` Jason Gunthorpe
2022-09-30 18:46                                   ` Dan Williams
2022-10-03  7:55                                   ` Jan Kara
2022-09-23 12:39                     ` Jason Gunthorpe
2022-09-26  0:34                       ` Dave Chinner
2022-09-26 13:04                         ` Jason Gunthorpe
2022-09-22  0:02             ` Dan Williams
2022-09-22  0:10               ` Jason Gunthorpe
2022-09-16  3:35 ` [PATCH v2 06/18] fsdax: Rework dax_layout_busy_page() to dax_zap_mappings() Dan Williams
2022-09-16  3:35 ` [PATCH v2 07/18] fsdax: Update dax_insert_entry() calling convention to return an error Dan Williams
2022-09-16  3:35 ` [PATCH v2 08/18] fsdax: Cleanup dax_associate_entry() Dan Williams
2022-09-16  3:36 ` [PATCH v2 09/18] fsdax: Rework dax_insert_entry() calling convention Dan Williams
2022-09-16  3:36 ` [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion Dan Williams
2022-09-21 14:03   ` Jason Gunthorpe
2022-09-21 15:18     ` Dan Williams
2022-09-21 21:38       ` Dan Williams
2022-09-21 22:07         ` Jason Gunthorpe
2022-09-22  0:14           ` Dan Williams
2022-09-22  0:25             ` Jason Gunthorpe
2022-09-22  2:17               ` Dan Williams
2022-09-22 17:55                 ` Jason Gunthorpe
2022-09-22 21:54                   ` Dan Williams
2022-09-23  1:36                     ` Dave Chinner
2022-09-23  2:01                       ` Dan Williams
2022-09-23 13:24                     ` Jason Gunthorpe
2022-09-23 16:29                       ` Dan Williams
2022-09-23 17:42                         ` Jason Gunthorpe
2022-09-23 19:03                           ` Dan Williams
2022-09-23 19:23                             ` Jason Gunthorpe
2022-09-27  6:07                             ` Alistair Popple
2022-09-27 12:56                               ` Jason Gunthorpe
2022-09-16  3:36 ` [PATCH v2 11/18] devdax: Minor warning fixups Dan Williams
2022-09-16  3:36 ` [PATCH v2 12/18] devdax: Move address_space helpers to the DAX core Dan Williams
2022-09-27  6:20   ` Alistair Popple
2022-09-29 22:38     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 13/18] dax: Prep mapping helpers for compound pages Dan Williams
2022-09-21 14:06   ` Jason Gunthorpe
2022-09-21 15:19     ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 14/18] devdax: add PUD support to the DAX mapping infrastructure Dan Williams
2022-09-16  3:36 ` [PATCH v2 15/18] devdax: Use dax_insert_entry() + dax_delete_mapping_entry() Dan Williams
2022-09-21 14:10   ` Jason Gunthorpe
2022-09-21 15:48     ` Dan Williams
2022-09-21 22:23       ` Jason Gunthorpe
2022-09-22  0:15         ` Dan Williams
2022-09-16  3:36 ` [PATCH v2 16/18] mm/memremap_pages: Support initializing pages to a zero reference count Dan Williams
2022-09-21 15:24   ` Jason Gunthorpe
2022-09-21 23:45     ` Dan Williams
2022-09-22  0:03       ` Alistair Popple
2022-09-22  0:04       ` Jason Gunthorpe
2022-09-22  0:34         ` Dan Williams
2022-09-22  1:36           ` Alistair Popple
2022-09-22  2:34             ` Dan Williams
2022-09-26  6:17               ` Alistair Popple
2022-09-22  0:13       ` John Hubbard
2022-09-16  3:36 ` [PATCH v2 17/18] fsdax: Delete put_devmap_managed_page_refs() Dan Williams
2022-09-16  3:36 ` [PATCH v2 18/18] mm/gup: Drop DAX pgmap accounting Dan Williams
2022-09-20 14:29 ` [PATCH v2 00/18] Fix the DAX-gup mistake Jason Gunthorpe
2022-09-20 16:50   ` Dan Williams
2022-11-09  0:20 ` Andrew Morton
2022-11-09 11:38   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).