linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
@ 2022-09-25 13:33 Shiyang Ruan
  2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, david, dan.j.williams, hch

Changes since v8:
  1. P2: rename drop_pagecache_sb() to super_drop_pagecache().
  2. P2: let super_drop_pagecache() accept invalidate method.
  3. P3: invalidate all dax mappings by invalidate_inode_pages2().
  4. P3: shutdown the filesystem when it is to be removed.
  5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2].

[1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
[2]: https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/

Shiyang Ruan (3):
  xfs: fix the calculation of length and end
  fs: move drop_pagecache_sb() for others to use
  mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

 drivers/dax/super.c         |  3 ++-
 fs/drop_caches.c            | 35 ++----------------------------
 fs/super.c                  | 43 +++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_notify_failure.c | 36 ++++++++++++++++++++++++++-----
 include/linux/fs.h          |  1 +
 include/linux/mm.h          |  1 +
 include/linux/pagemap.h     |  1 +
 mm/truncate.c               | 20 +++++++++++++++--
 8 files changed, 99 insertions(+), 41 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] xfs: fix the calculation of length and end
  2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2022-09-25 13:33 ` Shiyang Ruan
  2022-09-25 13:33 ` [PATCH 2/3] fs: move drop_pagecache_sb() for others to use Shiyang Ruan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, david, dan.j.williams, hch

The end should be start + length - 1.  Also fix the calculation of the
length when seeking for intersection of notify range and device.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_notify_failure.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index c4078d0ec108..3830f908e215 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure(
 	int			error = 0;
 	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
 	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
-	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen - 1);
 	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
 
 	error = xfs_trans_alloc_empty(mp, &tp);
@@ -210,7 +210,7 @@ xfs_dax_notify_failure(
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
 	/* Ignore the range out of filesystem area */
-	if (offset + len < ddev_start)
+	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
 	if (offset > ddev_end)
 		return -ENXIO;
@@ -222,8 +222,8 @@ xfs_dax_notify_failure(
 		len -= ddev_start - offset;
 		offset = 0;
 	}
-	if (offset + len > ddev_end)
-		len -= ddev_end - offset;
+	if (offset + len - 1 > ddev_end)
+		len -= offset + len - 1 - ddev_end;
 
 	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
 			mf_flags);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] fs: move drop_pagecache_sb() for others to use
  2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan
@ 2022-09-25 13:33 ` Shiyang Ruan
  2022-09-25 13:33 ` [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2022-09-30  3:28 ` [PATCH v9 0/3] " Shiyang Ruan
  3 siblings, 0 replies; 11+ messages in thread
From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, david, dan.j.williams, hch

xfs_notify_failure.c requires a method to invalidate all dax mappings.
drop_pagecache_sb() can do this but it is a static function and only
build with CONFIG_SYSCTL.  Now, move it to super.c and make it available
for others.  And use its second argument to choose which invalidate
method to use.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/drop_caches.c        | 35 ++-------------------------------
 fs/super.c              | 43 +++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h      |  1 +
 include/linux/pagemap.h |  1 +
 mm/truncate.c           | 20 +++++++++++++++++--
 5 files changed, 65 insertions(+), 35 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index e619c31b6bd9..4c9281885077 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -15,38 +15,6 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb, void *unused)
-{
-	struct inode *inode, *toput_inode = NULL;
-
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		spin_lock(&inode->i_lock);
-		/*
-		 * We must skip inodes in unusual state. We may also skip
-		 * inodes without pages but we deliberately won't in case
-		 * we need to reschedule to avoid softlockups.
-		 */
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping_empty(inode->i_mapping) && !need_resched())) {
-			spin_unlock(&inode->i_lock);
-			continue;
-		}
-		__iget(inode);
-		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-		invalidate_mapping_pages(inode->i_mapping, 0, -1);
-		iput(toput_inode);
-		toput_inode = inode;
-
-		cond_resched();
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(toput_inode);
-}
-
 int drop_caches_sysctl_handler(struct ctl_table *table, int write,
 		void *buffer, size_t *length, loff_t *ppos)
 {
@@ -59,7 +27,8 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int write,
 		static int stfu;
 
 		if (sysctl_drop_caches & 1) {
-			iterate_supers(drop_pagecache_sb, NULL);
+			iterate_supers(super_drop_pagecache,
+				       invalidate_inode_pages);
 			count_vm_event(DROP_PAGECACHE);
 		}
 		if (sysctl_drop_caches & 2) {
diff --git a/fs/super.c b/fs/super.c
index 734ed584a946..7cdbf146bc31 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
 #include <linux/lockdep.h>
 #include <linux/user_namespace.h>
 #include <linux/fs_context.h>
+#include <linux/pagemap.h>
 #include <uapi/linux/mount.h>
 #include "internal.h"
 
@@ -677,6 +678,48 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
+/*
+ *	super_drop_pagecache - drop all page caches of a filesystem
+ *	@sb: superblock to invalidate
+ *	@arg: invalidate method, such as invalidate_inode_pages(),
+ *	        invalidate_inode_pages2()
+ *
+ *	Scans the inodes of a filesystem, drop all page caches.
+ */
+void super_drop_pagecache(struct super_block *sb, void *arg)
+{
+	struct inode *inode, *toput_inode = NULL;
+	int (*invalidator)(struct address_space *) = arg;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		/*
+		 * We must skip inodes in unusual state. We may also skip
+		 * inodes without pages but we deliberately won't in case
+		 * we need to reschedule to avoid softlockups.
+		 */
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (mapping_empty(inode->i_mapping) && !need_resched())) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
+		invalidator(inode->i_mapping);
+		iput(toput_inode);
+		toput_inode = inode;
+
+		cond_resched();
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(toput_inode);
+}
+EXPORT_SYMBOL(super_drop_pagecache);
+
 static void __iterate_supers(void (*f)(struct super_block *))
 {
 	struct super_block *sb, *p = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9eced4cc286e..0e60c494688e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3292,6 +3292,7 @@ extern struct super_block *get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
+void super_drop_pagecache(struct super_block *sb, void *unused);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
 extern void iterate_supers_type(struct file_system_type *,
 			        void (*)(struct super_block *, void *), void *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0178b2040ea3..8879c141b117 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -27,6 +27,7 @@ static inline void invalidate_remote_inode(struct inode *inode)
 	    S_ISLNK(inode->i_mode))
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 }
+int invalidate_inode_pages(struct address_space *mapping);
 int invalidate_inode_pages2(struct address_space *mapping);
 int invalidate_inode_pages2_range(struct address_space *mapping,
 		pgoff_t start, pgoff_t end);
diff --git a/mm/truncate.c b/mm/truncate.c
index 0b0708bf935f..3016258d41e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -548,12 +548,13 @@ unsigned long invalidate_mapping_pagevec(struct address_space *mapping,
 }
 
 /**
- * invalidate_mapping_pages - Invalidate all clean, unlocked cache of one inode
+ * invalidate_mapping_pages - Invalidate range of clean, unlocked cache of one
+ *			      inode
  * @mapping: the address_space which holds the cache to invalidate
  * @start: the offset 'from' which to invalidate
  * @end: the offset 'to' which to invalidate (inclusive)
  *
- * This function removes pages that are clean, unmapped and unlocked,
+ * This function removes range of pages that are clean, unmapped and unlocked,
  * as well as shadow entries. It will not block on IO activity.
  *
  * If you want to remove all the pages of one inode, regardless of
@@ -568,6 +569,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 }
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
+/**
+ * invalidate_inode_pages - Invalidate all clean, unlocked cache of one inode
+ * @mapping: the address_space which holds the cache to invalidate
+ *
+ * This function removes all pages that are clean, unmapped and unlocked,
+ * as well as shadow entries. It will not block on IO activity.
+ */
+int invalidate_inode_pages(struct address_space *mapping)
+{
+	invalidate_mapping_pages(mapping, 0, -1);
+
+	return 0;
+}
+EXPORT_SYMBOL(invalidate_inode_pages);
+
 /*
  * This is like invalidate_inode_page(), except it ignores the page's
  * refcount.  We do this because invalidate_inode_pages2() needs stronger
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
  2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan
  2022-09-25 13:33 ` [PATCH 2/3] fs: move drop_pagecache_sb() for others to use Shiyang Ruan
@ 2022-09-25 13:33 ` Shiyang Ruan
  2022-09-30  3:28 ` [PATCH v9 0/3] " Shiyang Ruan
  3 siblings, 0 replies; 11+ messages in thread
From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, david, dan.j.williams, hch

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c         |  3 ++-
 fs/xfs/xfs_notify_failure.c | 28 +++++++++++++++++++++++++++-
 include/linux/mm.h          |  1 +
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5c1e678a1285 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -77,6 +78,9 @@ xfs_dax_failure_fn(
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* The device is about to be removed.  Not a really failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -168,7 +172,9 @@ xfs_dax_notify_ddev_failure(
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		if (!error)
 			error = -EFSCORRUPTED;
-	}
+	} else if (mf_flags & MF_MEM_PRE_REMOVE)
+		xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
 	return error;
 }
 
@@ -182,12 +188,24 @@ xfs_dax_notify_failure(
 	struct xfs_mount	*mp = dax_holder(dax_dev);
 	u64			ddev_start;
 	u64			ddev_end;
+	int			error;
 
 	if (!(mp->m_super->s_flags & SB_BORN)) {
 		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
 		return -EIO;
 	}
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "device is about to be removed!");
+		down_write(&mp->m_super->s_umount);
+		error = sync_filesystem(mp->m_super);
+		/* invalidate_inode_pages2() invalidates dax mapping */
+		super_drop_pagecache(mp->m_super, invalidate_inode_pages2);
+		up_write(&mp->m_super->s_umount);
+		if (error)
+			return error;
+	}
+
 	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
 		xfs_debug(mp,
 			 "notify_failure() not supported on realtime device!");
@@ -196,6 +214,8 @@ xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -209,6 +229,12 @@ xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..9122a1c57dd2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3183,6 +3183,7 @@ enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
                   ` (2 preceding siblings ...)
  2022-09-25 13:33 ` [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2022-09-30  3:28 ` Shiyang Ruan
  2022-10-13 10:07   ` Shiyang Ruan
  3 siblings, 1 reply; 11+ messages in thread
From: Shiyang Ruan @ 2022-09-30  3:28 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, david, dan.j.williams, hch

Hi,

Ping

在 2022/9/25 21:33, Shiyang Ruan 写道:
> Changes since v8:
>    1. P2: rename drop_pagecache_sb() to super_drop_pagecache().
>    2. P2: let super_drop_pagecache() accept invalidate method.
>    3. P3: invalidate all dax mappings by invalidate_inode_pages2().
>    4. P3: shutdown the filesystem when it is to be removed.
>    5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2].
> 
> [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
> [2]: https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/
> 
> Shiyang Ruan (3):
>    xfs: fix the calculation of length and end
>    fs: move drop_pagecache_sb() for others to use
>    mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
> 
>   drivers/dax/super.c         |  3 ++-
>   fs/drop_caches.c            | 35 ++----------------------------
>   fs/super.c                  | 43 +++++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_notify_failure.c | 36 ++++++++++++++++++++++++++-----
>   include/linux/fs.h          |  1 +
>   include/linux/mm.h          |  1 +
>   include/linux/pagemap.h     |  1 +
>   mm/truncate.c               | 20 +++++++++++++++--
>   8 files changed, 99 insertions(+), 41 deletions(-)
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-30  3:28 ` [PATCH v9 0/3] " Shiyang Ruan
@ 2022-10-13 10:07   ` Shiyang Ruan
  0 siblings, 0 replies; 11+ messages in thread
From: Shiyang Ruan @ 2022-10-13 10:07 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, david, dan.j.williams, hch

Ping again~

在 2022/9/30 11:28, Shiyang Ruan 写道:
> Hi,
> 
> Ping
> 
> 在 2022/9/25 21:33, Shiyang Ruan 写道:
>> Changes since v8:
>>    1. P2: rename drop_pagecache_sb() to super_drop_pagecache().
>>    2. P2: let super_drop_pagecache() accept invalidate method.
>>    3. P3: invalidate all dax mappings by invalidate_inode_pages2().
>>    4. P3: shutdown the filesystem when it is to be removed.
>>    5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2].
>>
>> [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/
>> [2]: 
>> https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/
>>
>> Shiyang Ruan (3):
>>    xfs: fix the calculation of length and end
>>    fs: move drop_pagecache_sb() for others to use
>>    mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
>>
>>   drivers/dax/super.c         |  3 ++-
>>   fs/drop_caches.c            | 35 ++----------------------------
>>   fs/super.c                  | 43 +++++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_notify_failure.c | 36 ++++++++++++++++++++++++++-----
>>   include/linux/fs.h          |  1 +
>>   include/linux/mm.h          |  1 +
>>   include/linux/pagemap.h     |  1 +
>>   mm/truncate.c               | 20 +++++++++++++++--
>>   8 files changed, 99 insertions(+), 41 deletions(-)
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-20  2:45     ` Dave Chinner
@ 2022-09-21  0:58       ` Darrick J. Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2022-09-21  0:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Shiyang Ruan, linux-kernel, linux-xfs, nvdimm, linux-mm,
	linux-fsdevel, dan.j.williams, hch, jane.chu

On Tue, Sep 20, 2022 at 12:45:19PM +1000, Dave Chinner wrote:
> On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote:
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1].  With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > (or mapped device) on it to unmap all files in use and notify processes
> > who are using those files.
> > 
> > Call trace:
> > trigger unbind
> >  -> unbind_store()
> >   -> ... (skip)
> >    -> devres_release_all()   # was pmem driver ->remove() in v1
> >     -> kill_dax()
> >      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> >       -> xfs_dax_notify_failure()
> > 
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event.  So do not shutdown filesystem directly if something not
> > supported, or if failure range includes metadata area.  Make sure all
> > files and processes are handled correctly.
> > 
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > 
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> >  drivers/dax/super.c         |  3 ++-
> >  fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++
> >  include/linux/mm.h          |  1 +
> >  3 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 9b5e2a5eb0ae..cf9a64563fbe 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> >  		return;
> >  
> >  	if (dax_dev->holder_data != NULL)
> > -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > +				MF_MEM_PRE_REMOVE);
> >  
> >  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> >  	synchronize_srcu(&dax_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 3830f908e215..5e04ba7fa403 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -22,6 +22,7 @@
> >  
> >  #include <linux/mm.h>
> >  #include <linux/dax.h>
> > +#include <linux/fs.h>
> >  
> >  struct xfs_failure_info {
> >  	xfs_agblock_t		startblock;
> > @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
> >  
> >  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> >  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > +		/* The device is about to be removed.  Not a really failure. */
> > +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > +			return 0;
> >  		notify->want_shutdown = true;
> >  		return 0;
> >  	}
> > @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
> >  	struct xfs_mount	*mp = dax_holder(dax_dev);
> >  	u64			ddev_start;
> >  	u64			ddev_end;
> > +	int			error;
> >  
> >  	if (!(mp->m_super->s_flags & SB_BORN)) {
> >  		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> >  		return -EIO;
> >  	}
> >  
> > +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> > +		xfs_info(mp, "device is about to be removed!");
> > +		down_write(&mp->m_super->s_umount);
> > +		error = sync_filesystem(mp->m_super);
> > +		drop_pagecache_sb(mp->m_super, NULL);
> > +		up_write(&mp->m_super->s_umount);
> > +		if (error)
> > +			return error;
> 
> If the device is about to go away unexpectedly, shouldn't this shut
> down the filesystem after syncing it here?  If the filesystem has
> been shut down, then everything will fail before removal finally
> triggers, and the act of unmounting the filesystem post device
> removal will clean up the page cache and all the other caches.

IIRC they want to kill all the processes with MAP_SYNC mappings sooner
than whenever the admin gets around to unmounting the filesystem, which
is why PRE_REMOVE will then go walk the rmapbt to find processes to
shoot down.  I'm not sure, though, if drop_pagecache_sb only touches
DRAM page cache or if it'll shoot down fsdax mappings too?

> IOWs, I don't understand why the page cache is considered special
> here (as opposed to, say, the inode or dentry caches), nor why we
> aren't shutting down the filesystem directly after syncing it to
> disk to ensure that we don't end up with applications losing data as
> a result of racing with the removal....

But yeah, we might as well shut down the fs at the end of PRE_REMOVE
handling, if the rmap walk hasn't already done that.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-02 10:36   ` [PATCH 3/3] " Shiyang Ruan
  2022-09-14 18:15     ` Darrick J. Wong
@ 2022-09-20  2:45     ` Dave Chinner
  2022-09-21  0:58       ` Darrick J. Wong
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2022-09-20  2:45 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, hch, jane.chu

On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()   # was pmem driver ->remove() in v1
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c         |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++
>  include/linux/mm.h          |  1 +
>  3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);
>  
>  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  	synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 3830f908e215..5e04ba7fa403 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/mm.h>
>  #include <linux/dax.h>
> +#include <linux/fs.h>
>  
>  struct xfs_failure_info {
>  	xfs_agblock_t		startblock;
> @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
>  
>  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		/* The device is about to be removed.  Not a really failure. */
> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		notify->want_shutdown = true;
>  		return 0;
>  	}
> @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
>  	struct xfs_mount	*mp = dax_holder(dax_dev);
>  	u64			ddev_start;
>  	u64			ddev_end;
> +	int			error;
>  
>  	if (!(mp->m_super->s_flags & SB_BORN)) {
>  		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>  		return -EIO;
>  	}
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "device is about to be removed!");
> +		down_write(&mp->m_super->s_umount);
> +		error = sync_filesystem(mp->m_super);
> +		drop_pagecache_sb(mp->m_super, NULL);
> +		up_write(&mp->m_super->s_umount);
> +		if (error)
> +			return error;

If the device is about to go away unexpectedly, shouldn't this shut
down the filesystem after syncing it here?  If the filesystem has
been shut down, then everything will fail before removal finally
triggers, and the act of unmounting the filesystem post device
removal will clean up the page cache and all the other caches.

IOWs, I don't understand why the page cache is considered special
here (as opposed to, say, the inode or dentry caches), nor why we
aren't shutting down the filesystem directly after syncing it to
disk to ensure that we don't end up with applications losing data as
a result of racing with the removal....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-14 18:15     ` Darrick J. Wong
@ 2022-09-15  1:49       ` Shiyang Ruan
  0 siblings, 0 replies; 11+ messages in thread
From: Shiyang Ruan @ 2022-09-15  1:49 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu



在 2022/9/15 2:15, Darrick J. Wong 写道:
> On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote:
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1].  With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> (or mapped device) on it to unmap all files in use and notify processes
>> who are using those files.
>>
>> Call trace:
>> trigger unbind
>>   -> unbind_store()
>>    -> ... (skip)
>>     -> devres_release_all()   # was pmem driver ->remove() in v1
>>      -> kill_dax()
>>       -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>        -> xfs_dax_notify_failure()
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event.  So do not shutdown filesystem directly if something not
>> supported, or if failure range includes metadata area.  Make sure all
>> files and processes are handled correctly.
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/dax/super.c         |  3 ++-
>>   fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++
>>   include/linux/mm.h          |  1 +
>>   3 files changed, 26 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 9b5e2a5eb0ae..cf9a64563fbe 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>   		return;
>>   
>>   	if (dax_dev->holder_data != NULL)
>> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> +				MF_MEM_PRE_REMOVE);
>>   
>>   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>   	synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 3830f908e215..5e04ba7fa403 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>   
>>   #include <linux/mm.h>
>>   #include <linux/dax.h>
>> +#include <linux/fs.h>
>>   
>>   struct xfs_failure_info {
>>   	xfs_agblock_t		startblock;
>> @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
>>   
>>   	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>   	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> +		/* The device is about to be removed.  Not a really failure. */
>> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		notify->want_shutdown = true;
>>   		return 0;
>>   	}
>> @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
>>   	struct xfs_mount	*mp = dax_holder(dax_dev);
>>   	u64			ddev_start;
>>   	u64			ddev_end;
>> +	int			error;
>>   
>>   	if (!(mp->m_super->s_flags & SB_BORN)) {
>>   		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>>   		return -EIO;
>>   	}
>>   
>> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
>> +		xfs_info(mp, "device is about to be removed!");
>> +		down_write(&mp->m_super->s_umount);
>> +		error = sync_filesystem(mp->m_super);
>> +		drop_pagecache_sb(mp->m_super, NULL);
>> +		up_write(&mp->m_super->s_umount);
>> +		if (error)
>> +			return error;
>> +	}
>> +
>>   	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
>>   		xfs_debug(mp,
>>   			 "notify_failure() not supported on realtime device!");
>> @@ -196,6 +211,8 @@ xfs_dax_notify_failure(
>>   
>>   	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>   	    mp->m_logdev_targp != mp->m_ddev_targp) {
>> +		if (mf_flags & MF_MEM_PRE_REMOVE)
>> +			return 0;
>>   		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>   		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>   		return -EFSCORRUPTED;
>> @@ -209,6 +226,12 @@ xfs_dax_notify_failure(
>>   	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>   	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>   
>> +	/* Notify failure on the whole device */
>> +	if (offset == 0 && len == U64_MAX) {
>> +		offset = ddev_start;
>> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> +	}
> 
> I wonder, won't the trimming code below take care of this?

The len is U64_MAX, so 'offset + len - 1' will overflow.  That can't be 
handled correctly by the trimming code below.


--
Thanks,
Ruan.

> 
> The rest of the patch looks ok to me.
> 
> --D
> 
>> +
>>   	/* Ignore the range out of filesystem area */
>>   	if (offset + len - 1 < ddev_start)
>>   		return -ENXIO;
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 21f8b27bd9fd..9122a1c57dd2 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3183,6 +3183,7 @@ enum mf_flags {
>>   	MF_UNPOISON = 1 << 4,
>>   	MF_SW_SIMULATED = 1 << 5,
>>   	MF_NO_RETRY = 1 << 6,
>> +	MF_MEM_PRE_REMOVE = 1 << 7,
>>   };
>>   int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>   		      unsigned long count, int mf_flags);
>> -- 
>> 2.37.2
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-02 10:36   ` [PATCH 3/3] " Shiyang Ruan
@ 2022-09-14 18:15     ` Darrick J. Wong
  2022-09-15  1:49       ` Shiyang Ruan
  2022-09-20  2:45     ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2022-09-14 18:15 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>    -> devres_release_all()   # was pmem driver ->remove() in v1
>     -> kill_dax()
>      -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>       -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c         |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++
>  include/linux/mm.h          |  1 +
>  3 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>  		return;
>  
>  	if (dax_dev->holder_data != NULL)
> -		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> +		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> +				MF_MEM_PRE_REMOVE);
>  
>  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  	synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 3830f908e215..5e04ba7fa403 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>  
>  #include <linux/mm.h>
>  #include <linux/dax.h>
> +#include <linux/fs.h>
>  
>  struct xfs_failure_info {
>  	xfs_agblock_t		startblock;
> @@ -77,6 +78,9 @@ xfs_dax_failure_fn(
>  
>  	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>  	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		/* The device is about to be removed.  Not a really failure. */
> +		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		notify->want_shutdown = true;
>  		return 0;
>  	}
> @@ -182,12 +186,23 @@ xfs_dax_notify_failure(
>  	struct xfs_mount	*mp = dax_holder(dax_dev);
>  	u64			ddev_start;
>  	u64			ddev_end;
> +	int			error;
>  
>  	if (!(mp->m_super->s_flags & SB_BORN)) {
>  		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>  		return -EIO;
>  	}
>  
> +	if (mf_flags & MF_MEM_PRE_REMOVE) {
> +		xfs_info(mp, "device is about to be removed!");
> +		down_write(&mp->m_super->s_umount);
> +		error = sync_filesystem(mp->m_super);
> +		drop_pagecache_sb(mp->m_super, NULL);
> +		up_write(&mp->m_super->s_umount);
> +		if (error)
> +			return error;
> +	}
> +
>  	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
>  		xfs_debug(mp,
>  			 "notify_failure() not supported on realtime device!");
> @@ -196,6 +211,8 @@ xfs_dax_notify_failure(
>  
>  	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>  	    mp->m_logdev_targp != mp->m_ddev_targp) {
> +		if (mf_flags & MF_MEM_PRE_REMOVE)
> +			return 0;
>  		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>  		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>  		return -EFSCORRUPTED;
> @@ -209,6 +226,12 @@ xfs_dax_notify_failure(
>  	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>  	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>  
> +	/* Notify failure on the whole device */
> +	if (offset == 0 && len == U64_MAX) {
> +		offset = ddev_start;
> +		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> +	}

I wonder, won't the trimming code below take care of this?

The rest of the patch looks ok to me.

--D

> +
>  	/* Ignore the range out of filesystem area */
>  	if (offset + len - 1 < ddev_start)
>  		return -ENXIO;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd..9122a1c57dd2 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3183,6 +3183,7 @@ enum mf_flags {
>  	MF_UNPOISON = 1 << 4,
>  	MF_SW_SIMULATED = 1 << 5,
>  	MF_NO_RETRY = 1 << 6,
> +	MF_MEM_PRE_REMOVE = 1 << 7,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>  		      unsigned long count, int mf_flags);
> -- 
> 2.37.2
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
  2022-09-02 10:35 ` [PATCH v8 0/3] " Shiyang Ruan
@ 2022-09-02 10:36   ` Shiyang Ruan
  2022-09-14 18:15     ` Darrick J. Wong
  2022-09-20  2:45     ` Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Shiyang Ruan @ 2022-09-02 10:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
    -> kill_dax()
     -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
      -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c         |  3 ++-
 fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++
 include/linux/mm.h          |  1 +
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
 		return;
 
 	if (dax_dev->holder_data != NULL)
-		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+				MF_MEM_PRE_REMOVE);
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 3830f908e215..5e04ba7fa403 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
 
 #include <linux/mm.h>
 #include <linux/dax.h>
+#include <linux/fs.h>
 
 struct xfs_failure_info {
 	xfs_agblock_t		startblock;
@@ -77,6 +78,9 @@ xfs_dax_failure_fn(
 
 	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
 	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		/* The device is about to be removed.  Not a really failure. */
+		if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		notify->want_shutdown = true;
 		return 0;
 	}
@@ -182,12 +186,23 @@ xfs_dax_notify_failure(
 	struct xfs_mount	*mp = dax_holder(dax_dev);
 	u64			ddev_start;
 	u64			ddev_end;
+	int			error;
 
 	if (!(mp->m_super->s_flags & SB_BORN)) {
 		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
 		return -EIO;
 	}
 
+	if (mf_flags & MF_MEM_PRE_REMOVE) {
+		xfs_info(mp, "device is about to be removed!");
+		down_write(&mp->m_super->s_umount);
+		error = sync_filesystem(mp->m_super);
+		drop_pagecache_sb(mp->m_super, NULL);
+		up_write(&mp->m_super->s_umount);
+		if (error)
+			return error;
+	}
+
 	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
 		xfs_debug(mp,
 			 "notify_failure() not supported on realtime device!");
@@ -196,6 +211,8 @@ xfs_dax_notify_failure(
 
 	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
 	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		if (mf_flags & MF_MEM_PRE_REMOVE)
+			return 0;
 		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
 		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
 		return -EFSCORRUPTED;
@@ -209,6 +226,12 @@ xfs_dax_notify_failure(
 	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
 	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
 
+	/* Notify failure on the whole device */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+	}
+
 	/* Ignore the range out of filesystem area */
 	if (offset + len - 1 < ddev_start)
 		return -ENXIO;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..9122a1c57dd2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3183,6 +3183,7 @@ enum mf_flags {
 	MF_UNPOISON = 1 << 4,
 	MF_SW_SIMULATED = 1 << 5,
 	MF_NO_RETRY = 1 << 6,
+	MF_MEM_PRE_REMOVE = 1 << 7,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 		      unsigned long count, int mf_flags);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-13 10:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan
2022-09-25 13:33 ` [PATCH 2/3] fs: move drop_pagecache_sb() for others to use Shiyang Ruan
2022-09-25 13:33 ` [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2022-09-30  3:28 ` [PATCH v9 0/3] " Shiyang Ruan
2022-10-13 10:07   ` Shiyang Ruan
  -- strict thread matches above, loose matches on Subject: below --
2022-08-26 13:03 [PATCH v7] " Shiyang Ruan
2022-09-02 10:35 ` [PATCH v8 0/3] " Shiyang Ruan
2022-09-02 10:36   ` [PATCH 3/3] " Shiyang Ruan
2022-09-14 18:15     ` Darrick J. Wong
2022-09-15  1:49       ` Shiyang Ruan
2022-09-20  2:45     ` Dave Chinner
2022-09-21  0:58       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).