* [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind @ 2022-09-25 13:33 Shiyang Ruan 2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, david, dan.j.williams, hch Changes since v8: 1. P2: rename drop_pagecache_sb() to super_drop_pagecache(). 2. P2: let super_drop_pagecache() accept invalidate method. 3. P3: invalidate all dax mappings by invalidate_inode_pages2(). 4. P3: shutdown the filesystem when it is to be removed. 5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2]. [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/ [2]: https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/ Shiyang Ruan (3): xfs: fix the calculation of length and end fs: move drop_pagecache_sb() for others to use mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind drivers/dax/super.c | 3 ++- fs/drop_caches.c | 35 ++---------------------------- fs/super.c | 43 +++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_notify_failure.c | 36 ++++++++++++++++++++++++++----- include/linux/fs.h | 1 + include/linux/mm.h | 1 + include/linux/pagemap.h | 1 + mm/truncate.c | 20 +++++++++++++++-- 8 files changed, 99 insertions(+), 41 deletions(-) -- 2.37.3 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/3] xfs: fix the calculation of length and end 2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan @ 2022-09-25 13:33 ` Shiyang Ruan 2022-09-25 13:33 ` [PATCH 2/3] fs: move drop_pagecache_sb() for others to use Shiyang Ruan ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, david, dan.j.williams, hch The end should be start + length - 1. Also fix the calculation of the length when seeking for intersection of notify range and device. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> --- fs/xfs/xfs_notify_failure.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index c4078d0ec108..3830f908e215 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure( int error = 0; xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr); xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno); - xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen); + xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen - 1); xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno); error = xfs_trans_alloc_empty(mp, &tp); @@ -210,7 +210,7 @@ xfs_dax_notify_failure( ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; /* Ignore the range out of filesystem area */ - if (offset + len < ddev_start) + if (offset + len - 1 < ddev_start) return -ENXIO; if (offset > ddev_end) return -ENXIO; @@ -222,8 +222,8 @@ xfs_dax_notify_failure( len -= ddev_start - offset; offset = 0; } - if (offset + len > ddev_end) - len -= ddev_end - offset; + if (offset + len - 1 > ddev_end) + len -= offset + len - 1 - ddev_end; return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len), mf_flags); -- 2.37.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/3] fs: move drop_pagecache_sb() for others to use 2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan 2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan @ 2022-09-25 13:33 ` Shiyang Ruan 2022-09-25 13:33 ` [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan 2022-09-30 3:28 ` [PATCH v9 0/3] " Shiyang Ruan 3 siblings, 0 replies; 11+ messages in thread From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, david, dan.j.williams, hch xfs_notify_failure.c requires a method to invalidate all dax mappings. drop_pagecache_sb() can do this but it is a static function and only build with CONFIG_SYSCTL. Now, move it to super.c and make it available for others. And use its second argument to choose which invalidate method to use. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- fs/drop_caches.c | 35 ++------------------------------- fs/super.c | 43 +++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 1 + include/linux/pagemap.h | 1 + mm/truncate.c | 20 +++++++++++++++++-- 5 files changed, 65 insertions(+), 35 deletions(-) diff --git a/fs/drop_caches.c b/fs/drop_caches.c index e619c31b6bd9..4c9281885077 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -15,38 +15,6 @@ /* A global variable is a bit ugly, but it keeps the code simple */ int sysctl_drop_caches; -static void drop_pagecache_sb(struct super_block *sb, void *unused) -{ - struct inode *inode, *toput_inode = NULL; - - spin_lock(&sb->s_inode_list_lock); - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - spin_lock(&inode->i_lock); - /* - * We must skip inodes in unusual state. We may also skip - * inodes without pages but we deliberately won't in case - * we need to reschedule to avoid softlockups. - */ - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || - (mapping_empty(inode->i_mapping) && !need_resched())) { - spin_unlock(&inode->i_lock); - continue; - } - __iget(inode); - spin_unlock(&inode->i_lock); - spin_unlock(&sb->s_inode_list_lock); - - invalidate_mapping_pages(inode->i_mapping, 0, -1); - iput(toput_inode); - toput_inode = inode; - - cond_resched(); - spin_lock(&sb->s_inode_list_lock); - } - spin_unlock(&sb->s_inode_list_lock); - iput(toput_inode); -} - int drop_caches_sysctl_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) { @@ -59,7 +27,8 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int write, static int stfu; if (sysctl_drop_caches & 1) { - iterate_supers(drop_pagecache_sb, NULL); + iterate_supers(super_drop_pagecache, + invalidate_inode_pages); count_vm_event(DROP_PAGECACHE); } if (sysctl_drop_caches & 2) { diff --git a/fs/super.c b/fs/super.c index 734ed584a946..7cdbf146bc31 100644 --- a/fs/super.c +++ b/fs/super.c @@ -36,6 +36,7 @@ #include <linux/lockdep.h> #include <linux/user_namespace.h> #include <linux/fs_context.h> +#include <linux/pagemap.h> #include <uapi/linux/mount.h> #include "internal.h" @@ -677,6 +678,48 @@ void drop_super_exclusive(struct super_block *sb) } EXPORT_SYMBOL(drop_super_exclusive); +/* + * super_drop_pagecache - drop all page caches of a filesystem + * @sb: superblock to invalidate + * @arg: invalidate method, such as invalidate_inode_pages(), + * invalidate_inode_pages2() + * + * Scans the inodes of a filesystem, drop all page caches. + */ +void super_drop_pagecache(struct super_block *sb, void *arg) +{ + struct inode *inode, *toput_inode = NULL; + int (*invalidator)(struct address_space *) = arg; + + spin_lock(&sb->s_inode_list_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + spin_lock(&inode->i_lock); + /* + * We must skip inodes in unusual state. We may also skip + * inodes without pages but we deliberately won't in case + * we need to reschedule to avoid softlockups. + */ + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || + (mapping_empty(inode->i_mapping) && !need_resched())) { + spin_unlock(&inode->i_lock); + continue; + } + __iget(inode); + spin_unlock(&inode->i_lock); + spin_unlock(&sb->s_inode_list_lock); + + invalidator(inode->i_mapping); + iput(toput_inode); + toput_inode = inode; + + cond_resched(); + spin_lock(&sb->s_inode_list_lock); + } + spin_unlock(&sb->s_inode_list_lock); + iput(toput_inode); +} +EXPORT_SYMBOL(super_drop_pagecache); + static void __iterate_supers(void (*f)(struct super_block *)) { struct super_block *sb, *p = NULL; diff --git a/include/linux/fs.h b/include/linux/fs.h index 9eced4cc286e..0e60c494688e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3292,6 +3292,7 @@ extern struct super_block *get_super(struct block_device *); extern struct super_block *get_active_super(struct block_device *bdev); extern void drop_super(struct super_block *sb); extern void drop_super_exclusive(struct super_block *sb); +void super_drop_pagecache(struct super_block *sb, void *unused); extern void iterate_supers(void (*)(struct super_block *, void *), void *); extern void iterate_supers_type(struct file_system_type *, void (*)(struct super_block *, void *), void *); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 0178b2040ea3..8879c141b117 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -27,6 +27,7 @@ static inline void invalidate_remote_inode(struct inode *inode) S_ISLNK(inode->i_mode)) invalidate_mapping_pages(inode->i_mapping, 0, -1); } +int invalidate_inode_pages(struct address_space *mapping); int invalidate_inode_pages2(struct address_space *mapping); int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end); diff --git a/mm/truncate.c b/mm/truncate.c index 0b0708bf935f..3016258d41e7 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -548,12 +548,13 @@ unsigned long invalidate_mapping_pagevec(struct address_space *mapping, } /** - * invalidate_mapping_pages - Invalidate all clean, unlocked cache of one inode + * invalidate_mapping_pages - Invalidate range of clean, unlocked cache of one + * inode * @mapping: the address_space which holds the cache to invalidate * @start: the offset 'from' which to invalidate * @end: the offset 'to' which to invalidate (inclusive) * - * This function removes pages that are clean, unmapped and unlocked, + * This function removes range of pages that are clean, unmapped and unlocked, * as well as shadow entries. It will not block on IO activity. * * If you want to remove all the pages of one inode, regardless of @@ -568,6 +569,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, } EXPORT_SYMBOL(invalidate_mapping_pages); +/** + * invalidate_inode_pages - Invalidate all clean, unlocked cache of one inode + * @mapping: the address_space which holds the cache to invalidate + * + * This function removes all pages that are clean, unmapped and unlocked, + * as well as shadow entries. It will not block on IO activity. + */ +int invalidate_inode_pages(struct address_space *mapping) +{ + invalidate_mapping_pages(mapping, 0, -1); + + return 0; +} +EXPORT_SYMBOL(invalidate_inode_pages); + /* * This is like invalidate_inode_page(), except it ignores the page's * refcount. We do this because invalidate_inode_pages2() needs stronger -- 2.37.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan 2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan 2022-09-25 13:33 ` [PATCH 2/3] fs: move drop_pagecache_sb() for others to use Shiyang Ruan @ 2022-09-25 13:33 ` Shiyang Ruan 2022-09-30 3:28 ` [PATCH v9 0/3] " Shiyang Ruan 3 siblings, 0 replies; 11+ messages in thread From: Shiyang Ruan @ 2022-09-25 13:33 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, david, dan.j.williams, hch This patch is inspired by Dan's "mm, dax, pmem: Introduce dev_pagemap_failure()"[1]. With the help of dax_holder and ->notify_failure() mechanism, the pmem driver is able to ask filesystem (or mapped device) on it to unmap all files in use and notify processes who are using those files. Call trace: trigger unbind -> unbind_store() -> ... (skip) -> devres_release_all() # was pmem driver ->remove() in v1 -> kill_dax() -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) -> xfs_dax_notify_failure() Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove event. So do not shutdown filesystem directly if something not supported, or if failure range includes metadata area. Make sure all files and processes are handled correctly. [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- drivers/dax/super.c | 3 ++- fs/xfs/xfs_notify_failure.c | 28 +++++++++++++++++++++++++++- include/linux/mm.h | 1 + 3 files changed, 30 insertions(+), 2 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9b5e2a5eb0ae..cf9a64563fbe 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) return; if (dax_dev->holder_data != NULL) - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); + dax_holder_notify_failure(dax_dev, 0, U64_MAX, + MF_MEM_PRE_REMOVE); clear_bit(DAXDEV_ALIVE, &dax_dev->flags); synchronize_srcu(&dax_srcu); diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index 3830f908e215..5c1e678a1285 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -22,6 +22,7 @@ #include <linux/mm.h> #include <linux/dax.h> +#include <linux/fs.h> struct xfs_failure_info { xfs_agblock_t startblock; @@ -77,6 +78,9 @@ xfs_dax_failure_fn( if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { + /* The device is about to be removed. Not a really failure. */ + if (notify->mf_flags & MF_MEM_PRE_REMOVE) + return 0; notify->want_shutdown = true; return 0; } @@ -168,7 +172,9 @@ xfs_dax_notify_ddev_failure( xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); if (!error) error = -EFSCORRUPTED; - } + } else if (mf_flags & MF_MEM_PRE_REMOVE) + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); + return error; } @@ -182,12 +188,24 @@ xfs_dax_notify_failure( struct xfs_mount *mp = dax_holder(dax_dev); u64 ddev_start; u64 ddev_end; + int error; if (!(mp->m_super->s_flags & SB_BORN)) { xfs_warn(mp, "filesystem is not ready for notify_failure()!"); return -EIO; } + if (mf_flags & MF_MEM_PRE_REMOVE) { + xfs_info(mp, "device is about to be removed!"); + down_write(&mp->m_super->s_umount); + error = sync_filesystem(mp->m_super); + /* invalidate_inode_pages2() invalidates dax mapping */ + super_drop_pagecache(mp->m_super, invalidate_inode_pages2); + up_write(&mp->m_super->s_umount); + if (error) + return error; + } + if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) { xfs_debug(mp, "notify_failure() not supported on realtime device!"); @@ -196,6 +214,8 @@ xfs_dax_notify_failure( if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && mp->m_logdev_targp != mp->m_ddev_targp) { + if (mf_flags & MF_MEM_PRE_REMOVE) + return 0; xfs_err(mp, "ondisk log corrupt, shutting down fs!"); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); return -EFSCORRUPTED; @@ -209,6 +229,12 @@ xfs_dax_notify_failure( ddev_start = mp->m_ddev_targp->bt_dax_part_off; ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; + /* Notify failure on the whole device */ + if (offset == 0 && len == U64_MAX) { + offset = ddev_start; + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); + } + /* Ignore the range out of filesystem area */ if (offset + len - 1 < ddev_start) return -ENXIO; diff --git a/include/linux/mm.h b/include/linux/mm.h index 21f8b27bd9fd..9122a1c57dd2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3183,6 +3183,7 @@ enum mf_flags { MF_UNPOISON = 1 << 4, MF_SW_SIMULATED = 1 << 5, MF_NO_RETRY = 1 << 6, + MF_MEM_PRE_REMOVE = 1 << 7, }; int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, unsigned long count, int mf_flags); -- 2.37.3 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan ` (2 preceding siblings ...) 2022-09-25 13:33 ` [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan @ 2022-09-30 3:28 ` Shiyang Ruan 2022-10-13 10:07 ` Shiyang Ruan 3 siblings, 1 reply; 11+ messages in thread From: Shiyang Ruan @ 2022-09-30 3:28 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, david, dan.j.williams, hch Hi, Ping 在 2022/9/25 21:33, Shiyang Ruan 写道: > Changes since v8: > 1. P2: rename drop_pagecache_sb() to super_drop_pagecache(). > 2. P2: let super_drop_pagecache() accept invalidate method. > 3. P3: invalidate all dax mappings by invalidate_inode_pages2(). > 4. P3: shutdown the filesystem when it is to be removed. > 5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2]. > > [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/ > [2]: https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/ > > Shiyang Ruan (3): > xfs: fix the calculation of length and end > fs: move drop_pagecache_sb() for others to use > mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind > > drivers/dax/super.c | 3 ++- > fs/drop_caches.c | 35 ++---------------------------- > fs/super.c | 43 +++++++++++++++++++++++++++++++++++++ > fs/xfs/xfs_notify_failure.c | 36 ++++++++++++++++++++++++++----- > include/linux/fs.h | 1 + > include/linux/mm.h | 1 + > include/linux/pagemap.h | 1 + > mm/truncate.c | 20 +++++++++++++++-- > 8 files changed, 99 insertions(+), 41 deletions(-) > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-30 3:28 ` [PATCH v9 0/3] " Shiyang Ruan @ 2022-10-13 10:07 ` Shiyang Ruan 0 siblings, 0 replies; 11+ messages in thread From: Shiyang Ruan @ 2022-10-13 10:07 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, david, dan.j.williams, hch Ping again~ 在 2022/9/30 11:28, Shiyang Ruan 写道: > Hi, > > Ping > > 在 2022/9/25 21:33, Shiyang Ruan 写道: >> Changes since v8: >> 1. P2: rename drop_pagecache_sb() to super_drop_pagecache(). >> 2. P2: let super_drop_pagecache() accept invalidate method. >> 3. P3: invalidate all dax mappings by invalidate_inode_pages2(). >> 4. P3: shutdown the filesystem when it is to be removed. >> 5. Rebase on 6.0-rc6 + Darrick's patch[1] + Dan's patch[2]. >> >> [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/ >> [2]: >> https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/ >> >> Shiyang Ruan (3): >> xfs: fix the calculation of length and end >> fs: move drop_pagecache_sb() for others to use >> mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind >> >> drivers/dax/super.c | 3 ++- >> fs/drop_caches.c | 35 ++---------------------------- >> fs/super.c | 43 +++++++++++++++++++++++++++++++++++++ >> fs/xfs/xfs_notify_failure.c | 36 ++++++++++++++++++++++++++----- >> include/linux/fs.h | 1 + >> include/linux/mm.h | 1 + >> include/linux/pagemap.h | 1 + >> mm/truncate.c | 20 +++++++++++++++-- >> 8 files changed, 99 insertions(+), 41 deletions(-) >> ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v7] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind @ 2022-08-26 13:03 Shiyang Ruan 2022-09-02 10:35 ` [PATCH v8 0/3] " Shiyang Ruan 0 siblings, 1 reply; 11+ messages in thread From: Shiyang Ruan @ 2022-08-26 13:03 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, dan.j.williams, david, hch, jane.chu This patch is inspired by Dan's "mm, dax, pmem: Introduce dev_pagemap_failure()"[1]. With the help of dax_holder and ->notify_failure() mechanism, the pmem driver is able to ask filesystem (or mapped device) on it to unmap all files in use and notify processes who are using those files. Call trace: trigger unbind -> unbind_store() -> ... (skip) -> devres_release_all() -> kill_dax() -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) -> xfs_dax_notify_failure() Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove event. So do not shutdown filesystem directly if something not supported, or if failure range includes metadata area. Make sure all files and processes are handled correctly. == Changes since v6: 1. Rebase on 6.0-rc2 and Darrick's patch[2]. Changes since v5: 1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE 2. hold s_umount before sync_filesystem() 3. do sync_filesystem() after SB_BORN check 4. Rebased on next-20220714 [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ [2]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/ Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> --- drivers/dax/super.c | 3 ++- fs/xfs/xfs_notify_failure.c | 15 +++++++++++++++ include/linux/mm.h | 1 + 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9b5e2a5eb0ae..cf9a64563fbe 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) return; if (dax_dev->holder_data != NULL) - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); + dax_holder_notify_failure(dax_dev, 0, U64_MAX, + MF_MEM_PRE_REMOVE); clear_bit(DAXDEV_ALIVE, &dax_dev->flags); synchronize_srcu(&dax_srcu); diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index 65d5eb20878e..a9769f17e998 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -77,6 +77,9 @@ xfs_dax_failure_fn( if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { + /* Do not shutdown so early when device is to be removed */ + if (notify->mf_flags & MF_MEM_PRE_REMOVE) + return 0; notify->want_shutdown = true; return 0; } @@ -182,12 +185,22 @@ xfs_dax_notify_failure( struct xfs_mount *mp = dax_holder(dax_dev); u64 ddev_start; u64 ddev_end; + int error; if (!(mp->m_sb.sb_flags & SB_BORN)) { xfs_warn(mp, "filesystem is not ready for notify_failure()!"); return -EIO; } + if (mf_flags & MF_MEM_PRE_REMOVE) { + xfs_info(mp, "device is about to be removed!"); + down_write(&mp->m_super->s_umount); + error = sync_filesystem(mp->m_super); + up_write(&mp->m_super->s_umount); + if (error) + return error; + } + if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) { xfs_warn(mp, "notify_failure() not supported on realtime device!"); @@ -196,6 +209,8 @@ xfs_dax_notify_failure( if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && mp->m_logdev_targp != mp->m_ddev_targp) { + if (mf_flags & MF_MEM_PRE_REMOVE) + return 0; xfs_err(mp, "ondisk log corrupt, shutting down fs!"); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); return -EFSCORRUPTED; diff --git a/include/linux/mm.h b/include/linux/mm.h index 982f2607180b..2c7c132e6512 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3176,6 +3176,7 @@ enum mf_flags { MF_UNPOISON = 1 << 4, MF_SW_SIMULATED = 1 << 5, MF_NO_RETRY = 1 << 6, + MF_MEM_PRE_REMOVE = 1 << 7, }; int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, unsigned long count, int mf_flags); -- 2.37.2 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH v8 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-08-26 13:03 [PATCH v7] " Shiyang Ruan @ 2022-09-02 10:35 ` Shiyang Ruan 2022-09-02 10:36 ` [PATCH 3/3] " Shiyang Ruan 0 siblings, 1 reply; 11+ messages in thread From: Shiyang Ruan @ 2022-09-02 10:35 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, dan.j.williams, david, hch, jane.chu Changes since v7: 1. Add P1 to fix calculation mistake 2. Add P2 to move drop_pagecache_sb() to super.c for xfs to use 3. P3: Add invalidate all mappings after sync. 4. P3: Set offset&len to be start&length of device when it is to be removed. 5. Rebase on 6.0-rc3 + Darrick's patch[1] + Dan's patch[2]. Changes since v6: 1. Rebase on 6.0-rc2 and Darrick's patch[1]. [1]: https://lore.kernel.org/linux-xfs/Yv5wIa2crHioYeRr@magnolia/ [2]: https://lore.kernel.org/linux-xfs/166153426798.2758201.15108211981034512993.stgit@dwillia2-xfh.jf.intel.com/ Shiyang Ruan (3): xfs: fix the calculation of length and end fs: move drop_pagecache_sb() for others to use mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind drivers/dax/super.c | 3 ++- fs/drop_caches.c | 33 --------------------------------- fs/super.c | 34 ++++++++++++++++++++++++++++++++++ fs/xfs/xfs_notify_failure.c | 31 +++++++++++++++++++++++++++---- include/linux/fs.h | 1 + include/linux/mm.h | 1 + 6 files changed, 65 insertions(+), 38 deletions(-) -- 2.37.2 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-02 10:35 ` [PATCH v8 0/3] " Shiyang Ruan @ 2022-09-02 10:36 ` Shiyang Ruan 2022-09-14 18:15 ` Darrick J. Wong 2022-09-20 2:45 ` Dave Chinner 0 siblings, 2 replies; 11+ messages in thread From: Shiyang Ruan @ 2022-09-02 10:36 UTC (permalink / raw) To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel Cc: djwong, dan.j.williams, david, hch, jane.chu This patch is inspired by Dan's "mm, dax, pmem: Introduce dev_pagemap_failure()"[1]. With the help of dax_holder and ->notify_failure() mechanism, the pmem driver is able to ask filesystem (or mapped device) on it to unmap all files in use and notify processes who are using those files. Call trace: trigger unbind -> unbind_store() -> ... (skip) -> devres_release_all() # was pmem driver ->remove() in v1 -> kill_dax() -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) -> xfs_dax_notify_failure() Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove event. So do not shutdown filesystem directly if something not supported, or if failure range includes metadata area. Make sure all files and processes are handled correctly. [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- drivers/dax/super.c | 3 ++- fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++ include/linux/mm.h | 1 + 3 files changed, 26 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 9b5e2a5eb0ae..cf9a64563fbe 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) return; if (dax_dev->holder_data != NULL) - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); + dax_holder_notify_failure(dax_dev, 0, U64_MAX, + MF_MEM_PRE_REMOVE); clear_bit(DAXDEV_ALIVE, &dax_dev->flags); synchronize_srcu(&dax_srcu); diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c index 3830f908e215..5e04ba7fa403 100644 --- a/fs/xfs/xfs_notify_failure.c +++ b/fs/xfs/xfs_notify_failure.c @@ -22,6 +22,7 @@ #include <linux/mm.h> #include <linux/dax.h> +#include <linux/fs.h> struct xfs_failure_info { xfs_agblock_t startblock; @@ -77,6 +78,9 @@ xfs_dax_failure_fn( if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { + /* The device is about to be removed. Not a really failure. */ + if (notify->mf_flags & MF_MEM_PRE_REMOVE) + return 0; notify->want_shutdown = true; return 0; } @@ -182,12 +186,23 @@ xfs_dax_notify_failure( struct xfs_mount *mp = dax_holder(dax_dev); u64 ddev_start; u64 ddev_end; + int error; if (!(mp->m_super->s_flags & SB_BORN)) { xfs_warn(mp, "filesystem is not ready for notify_failure()!"); return -EIO; } + if (mf_flags & MF_MEM_PRE_REMOVE) { + xfs_info(mp, "device is about to be removed!"); + down_write(&mp->m_super->s_umount); + error = sync_filesystem(mp->m_super); + drop_pagecache_sb(mp->m_super, NULL); + up_write(&mp->m_super->s_umount); + if (error) + return error; + } + if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) { xfs_debug(mp, "notify_failure() not supported on realtime device!"); @@ -196,6 +211,8 @@ xfs_dax_notify_failure( if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && mp->m_logdev_targp != mp->m_ddev_targp) { + if (mf_flags & MF_MEM_PRE_REMOVE) + return 0; xfs_err(mp, "ondisk log corrupt, shutting down fs!"); xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); return -EFSCORRUPTED; @@ -209,6 +226,12 @@ xfs_dax_notify_failure( ddev_start = mp->m_ddev_targp->bt_dax_part_off; ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; + /* Notify failure on the whole device */ + if (offset == 0 && len == U64_MAX) { + offset = ddev_start; + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); + } + /* Ignore the range out of filesystem area */ if (offset + len - 1 < ddev_start) return -ENXIO; diff --git a/include/linux/mm.h b/include/linux/mm.h index 21f8b27bd9fd..9122a1c57dd2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3183,6 +3183,7 @@ enum mf_flags { MF_UNPOISON = 1 << 4, MF_SW_SIMULATED = 1 << 5, MF_NO_RETRY = 1 << 6, + MF_MEM_PRE_REMOVE = 1 << 7, }; int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, unsigned long count, int mf_flags); -- 2.37.2 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-02 10:36 ` [PATCH 3/3] " Shiyang Ruan @ 2022-09-14 18:15 ` Darrick J. Wong 2022-09-15 1:49 ` Shiyang Ruan 2022-09-20 2:45 ` Dave Chinner 1 sibling, 1 reply; 11+ messages in thread From: Darrick J. Wong @ 2022-09-14 18:15 UTC (permalink / raw) To: Shiyang Ruan Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, dan.j.williams, david, hch, jane.chu On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote: > This patch is inspired by Dan's "mm, dax, pmem: Introduce > dev_pagemap_failure()"[1]. With the help of dax_holder and > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > (or mapped device) on it to unmap all files in use and notify processes > who are using those files. > > Call trace: > trigger unbind > -> unbind_store() > -> ... (skip) > -> devres_release_all() # was pmem driver ->remove() in v1 > -> kill_dax() > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) > -> xfs_dax_notify_failure() > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove > event. So do not shutdown filesystem directly if something not > supported, or if failure range includes metadata area. Make sure all > files and processes are handled correctly. > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > --- > drivers/dax/super.c | 3 ++- > fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++ > include/linux/mm.h | 1 + > 3 files changed, 26 insertions(+), 1 deletion(-) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index 9b5e2a5eb0ae..cf9a64563fbe 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) > return; > > if (dax_dev->holder_data != NULL) > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); > + dax_holder_notify_failure(dax_dev, 0, U64_MAX, > + MF_MEM_PRE_REMOVE); > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags); > synchronize_srcu(&dax_srcu); > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c > index 3830f908e215..5e04ba7fa403 100644 > --- a/fs/xfs/xfs_notify_failure.c > +++ b/fs/xfs/xfs_notify_failure.c > @@ -22,6 +22,7 @@ > > #include <linux/mm.h> > #include <linux/dax.h> > +#include <linux/fs.h> > > struct xfs_failure_info { > xfs_agblock_t startblock; > @@ -77,6 +78,9 @@ xfs_dax_failure_fn( > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { > + /* The device is about to be removed. Not a really failure. */ > + if (notify->mf_flags & MF_MEM_PRE_REMOVE) > + return 0; > notify->want_shutdown = true; > return 0; > } > @@ -182,12 +186,23 @@ xfs_dax_notify_failure( > struct xfs_mount *mp = dax_holder(dax_dev); > u64 ddev_start; > u64 ddev_end; > + int error; > > if (!(mp->m_super->s_flags & SB_BORN)) { > xfs_warn(mp, "filesystem is not ready for notify_failure()!"); > return -EIO; > } > > + if (mf_flags & MF_MEM_PRE_REMOVE) { > + xfs_info(mp, "device is about to be removed!"); > + down_write(&mp->m_super->s_umount); > + error = sync_filesystem(mp->m_super); > + drop_pagecache_sb(mp->m_super, NULL); > + up_write(&mp->m_super->s_umount); > + if (error) > + return error; > + } > + > if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) { > xfs_debug(mp, > "notify_failure() not supported on realtime device!"); > @@ -196,6 +211,8 @@ xfs_dax_notify_failure( > > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && > mp->m_logdev_targp != mp->m_ddev_targp) { > + if (mf_flags & MF_MEM_PRE_REMOVE) > + return 0; > xfs_err(mp, "ondisk log corrupt, shutting down fs!"); > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); > return -EFSCORRUPTED; > @@ -209,6 +226,12 @@ xfs_dax_notify_failure( > ddev_start = mp->m_ddev_targp->bt_dax_part_off; > ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; > > + /* Notify failure on the whole device */ > + if (offset == 0 && len == U64_MAX) { > + offset = ddev_start; > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); > + } I wonder, won't the trimming code below take care of this? The rest of the patch looks ok to me. --D > + > /* Ignore the range out of filesystem area */ > if (offset + len - 1 < ddev_start) > return -ENXIO; > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 21f8b27bd9fd..9122a1c57dd2 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3183,6 +3183,7 @@ enum mf_flags { > MF_UNPOISON = 1 << 4, > MF_SW_SIMULATED = 1 << 5, > MF_NO_RETRY = 1 << 6, > + MF_MEM_PRE_REMOVE = 1 << 7, > }; > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, > unsigned long count, int mf_flags); > -- > 2.37.2 > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-14 18:15 ` Darrick J. Wong @ 2022-09-15 1:49 ` Shiyang Ruan 0 siblings, 0 replies; 11+ messages in thread From: Shiyang Ruan @ 2022-09-15 1:49 UTC (permalink / raw) To: Darrick J. Wong Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, dan.j.williams, david, hch, jane.chu 在 2022/9/15 2:15, Darrick J. Wong 写道: > On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote: >> This patch is inspired by Dan's "mm, dax, pmem: Introduce >> dev_pagemap_failure()"[1]. With the help of dax_holder and >> ->notify_failure() mechanism, the pmem driver is able to ask filesystem >> (or mapped device) on it to unmap all files in use and notify processes >> who are using those files. >> >> Call trace: >> trigger unbind >> -> unbind_store() >> -> ... (skip) >> -> devres_release_all() # was pmem driver ->remove() in v1 >> -> kill_dax() >> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) >> -> xfs_dax_notify_failure() >> >> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove >> event. So do not shutdown filesystem directly if something not >> supported, or if failure range includes metadata area. Make sure all >> files and processes are handled correctly. >> >> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ >> >> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >> --- >> drivers/dax/super.c | 3 ++- >> fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++ >> include/linux/mm.h | 1 + >> 3 files changed, 26 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c >> index 9b5e2a5eb0ae..cf9a64563fbe 100644 >> --- a/drivers/dax/super.c >> +++ b/drivers/dax/super.c >> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) >> return; >> >> if (dax_dev->holder_data != NULL) >> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); >> + dax_holder_notify_failure(dax_dev, 0, U64_MAX, >> + MF_MEM_PRE_REMOVE); >> >> clear_bit(DAXDEV_ALIVE, &dax_dev->flags); >> synchronize_srcu(&dax_srcu); >> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c >> index 3830f908e215..5e04ba7fa403 100644 >> --- a/fs/xfs/xfs_notify_failure.c >> +++ b/fs/xfs/xfs_notify_failure.c >> @@ -22,6 +22,7 @@ >> >> #include <linux/mm.h> >> #include <linux/dax.h> >> +#include <linux/fs.h> >> >> struct xfs_failure_info { >> xfs_agblock_t startblock; >> @@ -77,6 +78,9 @@ xfs_dax_failure_fn( >> >> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || >> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { >> + /* The device is about to be removed. Not a really failure. */ >> + if (notify->mf_flags & MF_MEM_PRE_REMOVE) >> + return 0; >> notify->want_shutdown = true; >> return 0; >> } >> @@ -182,12 +186,23 @@ xfs_dax_notify_failure( >> struct xfs_mount *mp = dax_holder(dax_dev); >> u64 ddev_start; >> u64 ddev_end; >> + int error; >> >> if (!(mp->m_super->s_flags & SB_BORN)) { >> xfs_warn(mp, "filesystem is not ready for notify_failure()!"); >> return -EIO; >> } >> >> + if (mf_flags & MF_MEM_PRE_REMOVE) { >> + xfs_info(mp, "device is about to be removed!"); >> + down_write(&mp->m_super->s_umount); >> + error = sync_filesystem(mp->m_super); >> + drop_pagecache_sb(mp->m_super, NULL); >> + up_write(&mp->m_super->s_umount); >> + if (error) >> + return error; >> + } >> + >> if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) { >> xfs_debug(mp, >> "notify_failure() not supported on realtime device!"); >> @@ -196,6 +211,8 @@ xfs_dax_notify_failure( >> >> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && >> mp->m_logdev_targp != mp->m_ddev_targp) { >> + if (mf_flags & MF_MEM_PRE_REMOVE) >> + return 0; >> xfs_err(mp, "ondisk log corrupt, shutting down fs!"); >> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); >> return -EFSCORRUPTED; >> @@ -209,6 +226,12 @@ xfs_dax_notify_failure( >> ddev_start = mp->m_ddev_targp->bt_dax_part_off; >> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; >> >> + /* Notify failure on the whole device */ >> + if (offset == 0 && len == U64_MAX) { >> + offset = ddev_start; >> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); >> + } > > I wonder, won't the trimming code below take care of this? The len is U64_MAX, so 'offset + len - 1' will overflow. That can't be handled correctly by the trimming code below. -- Thanks, Ruan. > > The rest of the patch looks ok to me. > > --D > >> + >> /* Ignore the range out of filesystem area */ >> if (offset + len - 1 < ddev_start) >> return -ENXIO; >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 21f8b27bd9fd..9122a1c57dd2 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3183,6 +3183,7 @@ enum mf_flags { >> MF_UNPOISON = 1 << 4, >> MF_SW_SIMULATED = 1 << 5, >> MF_NO_RETRY = 1 << 6, >> + MF_MEM_PRE_REMOVE = 1 << 7, >> }; >> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, >> unsigned long count, int mf_flags); >> -- >> 2.37.2 >> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-02 10:36 ` [PATCH 3/3] " Shiyang Ruan 2022-09-14 18:15 ` Darrick J. Wong @ 2022-09-20 2:45 ` Dave Chinner 2022-09-21 0:58 ` Darrick J. Wong 1 sibling, 1 reply; 11+ messages in thread From: Dave Chinner @ 2022-09-20 2:45 UTC (permalink / raw) To: Shiyang Ruan Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong, dan.j.williams, hch, jane.chu On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote: > This patch is inspired by Dan's "mm, dax, pmem: Introduce > dev_pagemap_failure()"[1]. With the help of dax_holder and > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > (or mapped device) on it to unmap all files in use and notify processes > who are using those files. > > Call trace: > trigger unbind > -> unbind_store() > -> ... (skip) > -> devres_release_all() # was pmem driver ->remove() in v1 > -> kill_dax() > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) > -> xfs_dax_notify_failure() > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove > event. So do not shutdown filesystem directly if something not > supported, or if failure range includes metadata area. Make sure all > files and processes are handled correctly. > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > --- > drivers/dax/super.c | 3 ++- > fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++ > include/linux/mm.h | 1 + > 3 files changed, 26 insertions(+), 1 deletion(-) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index 9b5e2a5eb0ae..cf9a64563fbe 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) > return; > > if (dax_dev->holder_data != NULL) > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); > + dax_holder_notify_failure(dax_dev, 0, U64_MAX, > + MF_MEM_PRE_REMOVE); > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags); > synchronize_srcu(&dax_srcu); > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c > index 3830f908e215..5e04ba7fa403 100644 > --- a/fs/xfs/xfs_notify_failure.c > +++ b/fs/xfs/xfs_notify_failure.c > @@ -22,6 +22,7 @@ > > #include <linux/mm.h> > #include <linux/dax.h> > +#include <linux/fs.h> > > struct xfs_failure_info { > xfs_agblock_t startblock; > @@ -77,6 +78,9 @@ xfs_dax_failure_fn( > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { > + /* The device is about to be removed. Not a really failure. */ > + if (notify->mf_flags & MF_MEM_PRE_REMOVE) > + return 0; > notify->want_shutdown = true; > return 0; > } > @@ -182,12 +186,23 @@ xfs_dax_notify_failure( > struct xfs_mount *mp = dax_holder(dax_dev); > u64 ddev_start; > u64 ddev_end; > + int error; > > if (!(mp->m_super->s_flags & SB_BORN)) { > xfs_warn(mp, "filesystem is not ready for notify_failure()!"); > return -EIO; > } > > + if (mf_flags & MF_MEM_PRE_REMOVE) { > + xfs_info(mp, "device is about to be removed!"); > + down_write(&mp->m_super->s_umount); > + error = sync_filesystem(mp->m_super); > + drop_pagecache_sb(mp->m_super, NULL); > + up_write(&mp->m_super->s_umount); > + if (error) > + return error; If the device is about to go away unexpectedly, shouldn't this shut down the filesystem after syncing it here? If the filesystem has been shut down, then everything will fail before removal finally triggers, and the act of unmounting the filesystem post device removal will clean up the page cache and all the other caches. IOWs, I don't understand why the page cache is considered special here (as opposed to, say, the inode or dentry caches), nor why we aren't shutting down the filesystem directly after syncing it to disk to ensure that we don't end up with applications losing data as a result of racing with the removal.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind 2022-09-20 2:45 ` Dave Chinner @ 2022-09-21 0:58 ` Darrick J. Wong 0 siblings, 0 replies; 11+ messages in thread From: Darrick J. Wong @ 2022-09-21 0:58 UTC (permalink / raw) To: Dave Chinner Cc: Shiyang Ruan, linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, dan.j.williams, hch, jane.chu On Tue, Sep 20, 2022 at 12:45:19PM +1000, Dave Chinner wrote: > On Fri, Sep 02, 2022 at 10:36:01AM +0000, Shiyang Ruan wrote: > > This patch is inspired by Dan's "mm, dax, pmem: Introduce > > dev_pagemap_failure()"[1]. With the help of dax_holder and > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > > (or mapped device) on it to unmap all files in use and notify processes > > who are using those files. > > > > Call trace: > > trigger unbind > > -> unbind_store() > > -> ... (skip) > > -> devres_release_all() # was pmem driver ->remove() in v1 > > -> kill_dax() > > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) > > -> xfs_dax_notify_failure() > > > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove > > event. So do not shutdown filesystem directly if something not > > supported, or if failure range includes metadata area. Make sure all > > files and processes are handled correctly. > > > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ > > > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > > --- > > drivers/dax/super.c | 3 ++- > > fs/xfs/xfs_notify_failure.c | 23 +++++++++++++++++++++++ > > include/linux/mm.h | 1 + > > 3 files changed, 26 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > > index 9b5e2a5eb0ae..cf9a64563fbe 100644 > > --- a/drivers/dax/super.c > > +++ b/drivers/dax/super.c > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) > > return; > > > > if (dax_dev->holder_data != NULL) > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX, > > + MF_MEM_PRE_REMOVE); > > > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags); > > synchronize_srcu(&dax_srcu); > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c > > index 3830f908e215..5e04ba7fa403 100644 > > --- a/fs/xfs/xfs_notify_failure.c > > +++ b/fs/xfs/xfs_notify_failure.c > > @@ -22,6 +22,7 @@ > > > > #include <linux/mm.h> > > #include <linux/dax.h> > > +#include <linux/fs.h> > > > > struct xfs_failure_info { > > xfs_agblock_t startblock; > > @@ -77,6 +78,9 @@ xfs_dax_failure_fn( > > > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || > > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { > > + /* The device is about to be removed. Not a really failure. */ > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE) > > + return 0; > > notify->want_shutdown = true; > > return 0; > > } > > @@ -182,12 +186,23 @@ xfs_dax_notify_failure( > > struct xfs_mount *mp = dax_holder(dax_dev); > > u64 ddev_start; > > u64 ddev_end; > > + int error; > > > > if (!(mp->m_super->s_flags & SB_BORN)) { > > xfs_warn(mp, "filesystem is not ready for notify_failure()!"); > > return -EIO; > > } > > > > + if (mf_flags & MF_MEM_PRE_REMOVE) { > > + xfs_info(mp, "device is about to be removed!"); > > + down_write(&mp->m_super->s_umount); > > + error = sync_filesystem(mp->m_super); > > + drop_pagecache_sb(mp->m_super, NULL); > > + up_write(&mp->m_super->s_umount); > > + if (error) > > + return error; > > If the device is about to go away unexpectedly, shouldn't this shut > down the filesystem after syncing it here? If the filesystem has > been shut down, then everything will fail before removal finally > triggers, and the act of unmounting the filesystem post device > removal will clean up the page cache and all the other caches. IIRC they want to kill all the processes with MAP_SYNC mappings sooner than whenever the admin gets around to unmounting the filesystem, which is why PRE_REMOVE will then go walk the rmapbt to find processes to shoot down. I'm not sure, though, if drop_pagecache_sb only touches DRAM page cache or if it'll shoot down fsdax mappings too? > IOWs, I don't understand why the page cache is considered special > here (as opposed to, say, the inode or dentry caches), nor why we > aren't shutting down the filesystem directly after syncing it to > disk to ensure that we don't end up with applications losing data as > a result of racing with the removal.... But yeah, we might as well shut down the fs at the end of PRE_REMOVE handling, if the rmap walk hasn't already done that. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-10-13 10:08 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-25 13:33 [PATCH v9 0/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan 2022-09-25 13:33 ` [PATCH 1/3] xfs: fix the calculation of length and end Shiyang Ruan 2022-09-25 13:33 ` [PATCH 2/3] fs: move drop_pagecache_sb() for others to use Shiyang Ruan 2022-09-25 13:33 ` [PATCH 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan 2022-09-30 3:28 ` [PATCH v9 0/3] " Shiyang Ruan 2022-10-13 10:07 ` Shiyang Ruan -- strict thread matches above, loose matches on Subject: below -- 2022-08-26 13:03 [PATCH v7] " Shiyang Ruan 2022-09-02 10:35 ` [PATCH v8 0/3] " Shiyang Ruan 2022-09-02 10:36 ` [PATCH 3/3] " Shiyang Ruan 2022-09-14 18:15 ` Darrick J. Wong 2022-09-15 1:49 ` Shiyang Ruan 2022-09-20 2:45 ` Dave Chinner 2022-09-21 0:58 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).