linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
@ 2022-05-08 14:36 Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 01/07] dax: Introduce holder for dax_device Shiyang Ruan
                   ` (16 more replies)
  0 siblings, 17 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe

This is a combination of two patchsets:
 1.fsdax-rmap: https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/
 2.fsdax-reflink: https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/

 Changes since v13 of fsdax-rmap:
  1. Fixed mistakes during rebasing code to latest next-
  2. Rebased to next-20220504

 Changes since v10 of fsdax-reflink:
  1. Rebased to next-20220504 and fsdax-rmap
  2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
      iter model'
  3. Fixed many conflicts during rebasing
  4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
      shorter than smap->length or dmap->length.
  PS: There are many changes during rebasing.  I think it's better to
      review again.

==
Shiyang Ruan (14):
  fsdax-rmap:
    dax: Introduce holder for dax_device
    mm: factor helpers for memory_failure_dev_pagemap
    pagemap,pmem: Introduce ->memory_failure()
    fsdax: Introduce dax_lock_mapping_entry()
    mm: Introduce mf_dax_kill_procs() for fsdax case
    xfs: Implement ->notify_failure() for XFS
    fsdax: set a CoW flag when associate reflink mappings
  fsdax-reflink:
    fsdax: Output address in dax_iomap_pfn() and rename it
    fsdax: Introduce dax_iomap_cow_copy()
    fsdax: Replace mmap entry in case of CoW
    fsdax: Add dax_iomap_cow_copy() for dax zero
    fsdax: Dedup file range to use a compare function
    xfs: support CoW in fsdax mode
    xfs: Add dax dedupe support

 drivers/dax/super.c         |  67 +++++-
 drivers/md/dm.c             |   2 +-
 drivers/nvdimm/pmem.c       |  17 ++
 fs/dax.c                    | 398 ++++++++++++++++++++++++++++++------
 fs/erofs/super.c            |  13 +-
 fs/ext2/super.c             |   7 +-
 fs/ext4/super.c             |   9 +-
 fs/remap_range.c            |  31 ++-
 fs/xfs/Makefile             |   5 +
 fs/xfs/xfs_buf.c            |  10 +-
 fs/xfs/xfs_file.c           |   9 +-
 fs/xfs/xfs_fsops.c          |   3 +
 fs/xfs/xfs_inode.c          |  69 ++++++-
 fs/xfs/xfs_inode.h          |   1 +
 fs/xfs/xfs_iomap.c          |  46 ++++-
 fs/xfs/xfs_iomap.h          |   3 +
 fs/xfs/xfs_mount.h          |   1 +
 fs/xfs/xfs_notify_failure.c | 220 ++++++++++++++++++++
 fs/xfs/xfs_reflink.c        |  12 +-
 fs/xfs/xfs_super.h          |   1 +
 include/linux/dax.h         |  56 ++++-
 include/linux/fs.h          |  12 +-
 include/linux/memremap.h    |  12 ++
 include/linux/mm.h          |   2 +
 include/linux/page-flags.h  |   6 +
 mm/memory-failure.c         | 257 ++++++++++++++++-------
 26 files changed, 1087 insertions(+), 182 deletions(-)
 create mode 100644 fs/xfs/xfs_notify_failure.c

-- 
2.35.1




^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v14 01/07] dax: Introduce holder for dax_device
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 02/07] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig,
	Dan Williams

To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation.  This holder is used to
remember who is using this dax_device:
 - When it is the backend of a filesystem, the holder will be the
   instance of this filesystem.
 - When this pmem device is one of the targets in a mapped device, the
   holder will be this mapped device.  In this case, the mapped device
   has its own dax_device and it will follow the first rule.  So that we
   can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.wiliams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 drivers/dax/super.c | 67 ++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/dm.c     |  2 +-
 fs/erofs/super.c    | 13 +++++----
 fs/ext2/super.c     |  7 +++--
 fs/ext4/super.c     |  9 +++---
 fs/xfs/xfs_buf.c    |  5 ++--
 include/linux/dax.h | 33 ++++++++++++++++------
 7 files changed, 112 insertions(+), 24 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0211e6f7b47a..5ddb159c4653 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -22,6 +22,8 @@
  * @private: dax driver private data
  * @flags: state and boolean properties
  * @ops: operations for this device
+ * @holder_data: holder of a dax_device: could be filesystem or mapped device
+ * @holder_ops: operations for the inner holder
  */
 struct dax_device {
 	struct inode inode;
@@ -29,6 +31,8 @@ struct dax_device {
 	void *private;
 	unsigned long flags;
 	const struct dax_operations *ops;
+	void *holder_data;
+	const struct dax_holder_operations *holder_ops;
 };
 
 static dev_t dax_devt;
@@ -71,8 +75,11 @@ EXPORT_SYMBOL_GPL(dax_remove_host);
  * fs_dax_get_by_bdev() - temporary lookup mechanism for filesystem-dax
  * @bdev: block device to find a dax_device for
  * @start_off: returns the byte offset into the dax_device that @bdev starts
+ * @holder: filesystem or mapped device inside the dax_device
+ * @ops: operations for the inner holder
  */
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 *start_off)
+struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 *start_off,
+		void *holder, const struct dax_holder_operations *ops)
 {
 	struct dax_device *dax_dev;
 	u64 part_size;
@@ -92,11 +99,26 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 *start_off)
 	dax_dev = xa_load(&dax_hosts, (unsigned long)bdev->bd_disk);
 	if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
 		dax_dev = NULL;
+	else if (holder) {
+		if (!cmpxchg(&dax_dev->holder_data, NULL, holder))
+			dax_dev->holder_ops = ops;
+		else
+			dax_dev = NULL;
+	}
 	dax_read_unlock(id);
 
 	return dax_dev;
 }
 EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
+
+void fs_put_dax(struct dax_device *dax_dev, void *holder)
+{
+	if (dax_dev && holder &&
+	    cmpxchg(&dax_dev->holder_data, holder, NULL) == holder)
+		dax_dev->holder_ops = NULL;
+	put_dax(dax_dev);
+}
+EXPORT_SYMBOL_GPL(fs_put_dax);
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
 
 enum dax_device_flags {
@@ -194,6 +216,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off,
+			      u64 len, int mf_flags)
+{
+	int rc, id;
+
+	id = dax_read_lock();
+	if (!dax_alive(dax_dev)) {
+		rc = -ENXIO;
+		goto out;
+	}
+
+	if (!dax_dev->holder_ops) {
+		rc = -EOPNOTSUPP;
+		goto out;
+	}
+
+	rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags);
+out:
+	dax_read_unlock(id);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
@@ -267,8 +312,15 @@ void kill_dax(struct dax_device *dax_dev)
 	if (!dax_dev)
 		return;
 
+	if (dax_dev->holder_data != NULL)
+		dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 	synchronize_srcu(&dax_srcu);
+
+	/* clear holder data */
+	dax_dev->holder_ops = NULL;
+	dax_dev->holder_data = NULL;
 }
 EXPORT_SYMBOL_GPL(kill_dax);
 
@@ -410,6 +462,19 @@ void put_dax(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(put_dax);
 
+/**
+ * dax_holder() - obtain the holder of a dax device
+ * @dax_dev: a dax_device instance
+
+ * Return: the holder's data which represents the holder if registered,
+ * otherwize NULL.
+ */
+void *dax_holder(struct dax_device *dax_dev)
+{
+	return dax_dev->holder_data;
+}
+EXPORT_SYMBOL_GPL(dax_holder);
+
 /**
  * inode_dax: convert a public inode into its dax_dev
  * @inode: An inode with i_cdev pointing to a dax_dev
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9650ba2075b8..953d8b0ced84 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -755,7 +755,7 @@ static int open_table_device(struct table_device *td, dev_t dev,
 	}
 
 	td->dm_dev.bdev = bdev;
-	td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, &part_off);
+	td->dm_dev.dax_dev = fs_dax_get_by_bdev(bdev, &part_off, NULL, NULL);
 	return 0;
 }
 
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 37b39ec9c9c2..f112ae04f1e0 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -274,8 +274,8 @@ static int erofs_init_devices(struct super_block *sb,
 				break;
 			}
 			dif->bdev = bdev;
-			dif->dax_dev = fs_dax_get_by_bdev(bdev,
-							  &dif->dax_part_off);
+			dif->dax_dev = fs_dax_get_by_bdev(bdev, &dif->dax_part_off,
+							  NULL, NULL);
 		}
 
 		dif->blocks = le32_to_cpu(dis->blocks);
@@ -658,6 +658,8 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_fs_info = sbi;
 	sbi->opt = ctx->opt;
 	ctx->opt.fsid = NULL;
+	sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->dax_part_off,
+					  NULL, NULL);
 	sbi->devs = ctx->devs;
 	ctx->devs = NULL;
 
@@ -684,7 +686,8 @@ static int erofs_fc_fill_super(struct super_block *sb, struct fs_context *fc)
 		}
 
 		sbi->dax_dev = fs_dax_get_by_bdev(sb->s_bdev,
-						  &sbi->dax_part_off);
+						  &sbi->dax_part_off,
+						  NULL, NULL);
 	}
 
 	err = erofs_read_superblock(sb);
@@ -776,7 +779,7 @@ static int erofs_release_device_info(int id, void *ptr, void *data)
 {
 	struct erofs_device_info *dif = ptr;
 
-	fs_put_dax(dif->dax_dev);
+	fs_put_dax(dif->dax_dev, NULL);
 	if (dif->bdev)
 		blkdev_put(dif->bdev, FMODE_READ | FMODE_EXCL);
 	erofs_fscache_unregister_cookie(&dif->fscache);
@@ -850,7 +853,7 @@ static void erofs_kill_sb(struct super_block *sb)
 		return;
 
 	erofs_free_dev_context(sbi->devs);
-	fs_put_dax(sbi->dax_dev);
+	fs_put_dax(sbi->dax_dev, NULL);
 	erofs_fscache_unregister_cookie(&sbi->s_fscache);
 	erofs_fscache_unregister_fs(sb);
 	kfree(sbi->opt.fsid);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index f6a19f6d9f6d..4638946251b9 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -171,7 +171,7 @@ static void ext2_put_super (struct super_block * sb)
 	brelse (sbi->s_sbh);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_put_dax(sbi->s_daxdev, NULL);
 	kfree(sbi);
 }
 
@@ -835,7 +835,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	}
 	sb->s_fs_info = sbi;
 	sbi->s_sb_block = sb_block;
-	sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off);
+	sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off,
+					   NULL, NULL);
 
 	spin_lock_init(&sbi->s_lock);
 	ret = -EINVAL;
@@ -1204,7 +1205,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 failed_mount:
 	brelse(bh);
 failed_sbi:
-	fs_put_dax(sbi->s_daxdev);
+	fs_put_dax(sbi->s_daxdev, NULL);
 	sb->s_fs_info = NULL;
 	kfree(sbi->s_blockgroup_lock);
 	kfree(sbi);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6900da973ce2..801ba7f6b088 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1304,7 +1304,7 @@ static void ext4_put_super(struct super_block *sb)
 	if (sbi->s_chksum_driver)
 		crypto_free_shash(sbi->s_chksum_driver);
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_put_dax(sbi->s_daxdev, NULL);
 	fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
 #if IS_ENABLED(CONFIG_UNICODE)
 	utf8_unload(sb->s_encoding);
@@ -4357,7 +4357,7 @@ static void ext4_free_sbi(struct ext4_sb_info *sbi)
 		return;
 
 	kfree(sbi->s_blockgroup_lock);
-	fs_put_dax(sbi->s_daxdev);
+	fs_put_dax(sbi->s_daxdev, NULL);
 	kfree(sbi);
 }
 
@@ -4369,7 +4369,8 @@ static struct ext4_sb_info *ext4_alloc_sbi(struct super_block *sb)
 	if (!sbi)
 		return NULL;
 
-	sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off);
+	sbi->s_daxdev = fs_dax_get_by_bdev(sb->s_bdev, &sbi->s_dax_part_off,
+					   NULL, NULL);
 
 	sbi->s_blockgroup_lock =
 		kzalloc(sizeof(struct blockgroup_lock), GFP_KERNEL);
@@ -4381,7 +4382,7 @@ static struct ext4_sb_info *ext4_alloc_sbi(struct super_block *sb)
 	sbi->s_sb = sb;
 	return sbi;
 err_out:
-	fs_put_dax(sbi->s_daxdev);
+	fs_put_dax(sbi->s_daxdev, NULL);
 	kfree(sbi);
 	return NULL;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index bf4e60871068..63b8d9b5096f 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1911,7 +1911,7 @@ xfs_free_buftarg(
 	list_lru_destroy(&btp->bt_lru);
 
 	blkdev_issue_flush(btp->bt_bdev);
-	fs_put_dax(btp->bt_daxdev);
+	fs_put_dax(btp->bt_daxdev, NULL);
 
 	kmem_free(btp);
 }
@@ -1964,7 +1964,8 @@ xfs_alloc_buftarg(
 	btp->bt_mount = mp;
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
-	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off);
+	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off, NULL,
+					    NULL);
 
 	/*
 	 * Buffer IO error rate limiting. Limit it to no more than 10 messages
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9fc5f99a0ae2..9c426a207ba8 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -32,8 +32,21 @@ struct dax_operations {
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
 };
 
+struct dax_holder_operations {
+	/*
+	 * notify_failure - notify memory failure into inner holder device
+	 * @dax_dev: the dax device which contains the holder
+	 * @offset: offset on this dax device where memory failure occurs
+	 * @len: length of this memory failure event
+	 * @flags: action flags for memory failure handler
+	 */
+	int (*notify_failure)(struct dax_device *dax_dev, u64 offset,
+			u64 len, int mf_flags);
+};
+
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
@@ -53,6 +66,10 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 	return dax_synchronous(dax_dev);
 }
 #else
+static inline void *dax_holder(struct dax_device *dax_dev)
+{
+	return NULL;
+}
 static inline struct dax_device *alloc_dax(void *private,
 		const struct dax_operations *ops)
 {
@@ -96,12 +113,9 @@ struct writeback_control;
 #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
 int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk);
 void dax_remove_host(struct gendisk *disk);
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev,
-		u64 *start_off);
-static inline void fs_put_dax(struct dax_device *dax_dev)
-{
-	put_dax(dax_dev);
-}
+struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev, u64 *start_off,
+		void *holder, const struct dax_holder_operations *ops);
+void fs_put_dax(struct dax_device *dax_dev, void *holder);
 #else
 static inline int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk)
 {
@@ -111,11 +125,12 @@ static inline void dax_remove_host(struct gendisk *disk)
 {
 }
 static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev,
-		u64 *start_off)
+		u64 *start_off, void *holder,
+		const struct dax_holder_operations *ops)
 {
 	return NULL;
 }
-static inline void fs_put_dax(struct dax_device *dax_dev)
+static inline void fs_put_dax(struct dax_device *dax_dev, void *holder)
 {
 }
 #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
@@ -185,6 +200,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 			size_t nr_pages);
+int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off, u64 len,
+		int mf_flags);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v14 02/07] mm: factor helpers for memory_failure_dev_pagemap
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 01/07] dax: Introduce holder for dax_device Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 03/07] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax.  So it is needed to factor some helper functions to
simplify these code.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 159 ++++++++++++++++++++++++--------------------
 1 file changed, 88 insertions(+), 71 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d9343cb28a58..9e4728103c0d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1495,6 +1495,91 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
 	return 0;
 }
 
+static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
+		struct address_space *mapping, pgoff_t index, int flags)
+{
+	struct to_kill *tk;
+	unsigned long size = 0;
+
+	list_for_each_entry(tk, to_kill, nd)
+		if (tk->size_shift)
+			size = max(size, 1UL << tk->size_shift);
+
+	if (size) {
+		/*
+		 * Unmap the largest mapping to avoid breaking up device-dax
+		 * mappings which are constant size. The actual size of the
+		 * mapping being torn down is communicated in siginfo, see
+		 * kill_proc()
+		 */
+		loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+
+		unmap_mapping_range(mapping, start, size, 0);
+	}
+
+	kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+}
+
+static int mf_generic_kill_procs(unsigned long long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	struct page *page = pfn_to_page(pfn);
+	LIST_HEAD(to_kill);
+	dax_entry_t cookie;
+	int rc = 0;
+
+	/*
+	 * Pages instantiated by device-dax (not filesystem-dax)
+	 * may be compound pages.
+	 */
+	page = compound_head(page);
+
+	/*
+	 * Prevent the inode from being freed while we are interrogating
+	 * the address_space, typically this would be handled by
+	 * lock_page(), but dax pages do not use the page lock. This
+	 * also prevents changes to the mapping of this pfn until
+	 * poison signaling is complete.
+	 */
+	cookie = dax_lock_page(page);
+	if (!cookie)
+		return -EBUSY;
+
+	if (hwpoison_filter(page)) {
+		rc = -EOPNOTSUPP;
+		goto unlock;
+	}
+
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		rc = -EBUSY;
+		goto unlock;
+	}
+
+	/*
+	 * Use this flag as an indication that the dax page has been
+	 * remapped UC to prevent speculative consumption of poison.
+	 */
+	SetPageHWPoison(page);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a
+	 * different physical page at a given virtual address, so all
+	 * userspace consumption of ZONE_DEVICE memory necessitates
+	 * SIGBUS (i.e. MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	collect_procs(page, &to_kill, true);
+
+	unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
+unlock:
+	dax_unlock_page(page, cookie);
+	return rc;
+}
+
 /*
  * Called from hugetlb code with hugetlb_lock held.
  *
@@ -1643,12 +1728,7 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		struct dev_pagemap *pgmap)
 {
 	struct page *page = pfn_to_page(pfn);
-	unsigned long size = 0;
-	struct to_kill *tk;
-	LIST_HEAD(tokill);
-	int rc = -EBUSY;
-	loff_t start;
-	dax_entry_t cookie;
+	int rc = -ENXIO;
 
 	if (flags & MF_COUNT_INCREASED)
 		/*
@@ -1657,73 +1737,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		put_page(page);
 
 	/* device metadata space is not recoverable */
-	if (!pgmap_pfn_valid(pgmap, pfn)) {
-		rc = -ENXIO;
-		goto out;
-	}
-
-	/*
-	 * Pages instantiated by device-dax (not filesystem-dax)
-	 * may be compound pages.
-	 */
-	page = compound_head(page);
-
-	/*
-	 * Prevent the inode from being freed while we are interrogating
-	 * the address_space, typically this would be handled by
-	 * lock_page(), but dax pages do not use the page lock. This
-	 * also prevents changes to the mapping of this pfn until
-	 * poison signaling is complete.
-	 */
-	cookie = dax_lock_page(page);
-	if (!cookie)
+	if (!pgmap_pfn_valid(pgmap, pfn))
 		goto out;
 
-	if (hwpoison_filter(page)) {
-		rc = -EOPNOTSUPP;
-		goto unlock;
-	}
-
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		/*
-		 * TODO: Handle HMM pages which may need coordination
-		 * with device-side memory.
-		 */
-		goto unlock;
-	}
-
-	/*
-	 * Use this flag as an indication that the dax page has been
-	 * remapped UC to prevent speculative consumption of poison.
-	 */
-	SetPageHWPoison(page);
-
-	/*
-	 * Unlike System-RAM there is no possibility to swap in a
-	 * different physical page at a given virtual address, so all
-	 * userspace consumption of ZONE_DEVICE memory necessitates
-	 * SIGBUS (i.e. MF_MUST_KILL)
-	 */
-	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
-	collect_procs(page, &tokill, true);
-
-	list_for_each_entry(tk, &tokill, nd)
-		if (tk->size_shift)
-			size = max(size, 1UL << tk->size_shift);
-	if (size) {
-		/*
-		 * Unmap the largest mapping to avoid breaking up
-		 * device-dax mappings which are constant size. The
-		 * actual size of the mapping being torn down is
-		 * communicated in siginfo, see kill_proc()
-		 */
-		start = (page->index << PAGE_SHIFT) & ~(size - 1);
-		unmap_mapping_range(page->mapping, start, size, 0);
-	}
-	kill_procs(&tokill, true, false, pfn, flags);
-	rc = 0;
-unlock:
-	dax_unlock_page(page, cookie);
+	rc = mf_generic_kill_procs(pfn, flags, pgmap);
 out:
 	/* drop pgmap ref acquired in caller */
 	put_dev_pagemap(pgmap);
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v14 03/07] pagemap,pmem: Introduce ->memory_failure()
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 01/07] dax: Introduce holder for dax_device Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 02/07] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 04/07] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the filesystem in which
the corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers.  If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 drivers/nvdimm/pmem.c    | 17 +++++++++++++++++
 include/linux/memremap.h | 12 ++++++++++++
 mm/memory-failure.c      | 14 ++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..bd502957cfdf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -366,6 +366,21 @@ static void pmem_release_disk(void *__pmem)
 	blk_cleanup_disk(pmem->disk);
 }
 
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+		unsigned long pfn, unsigned long nr_pages, int mf_flags)
+{
+	struct pmem_device *pmem =
+			container_of(pgmap, struct pmem_device, pgmap);
+	u64 offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
+	u64 len = nr_pages << PAGE_SHIFT;
+
+	return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
+}
+
+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+	.memory_failure		= pmem_pagemap_memory_failure,
+};
+
 static int pmem_attach_disk(struct device *dev,
 		struct nd_namespace_common *ndns)
 {
@@ -427,6 +442,7 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -440,6 +456,7 @@ static int pmem_attach_disk(struct device *dev,
 		pmem->pgmap.range.end = res->end;
 		pmem->pgmap.nr_range = 1;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
 		bb_range = pmem->pgmap.range;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 8af304f6b504..f1d413ef7c0d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,6 +79,18 @@ struct dev_pagemap_ops {
 	 * the page back to a CPU accessible page.
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+	/*
+	 * Handle the memory failure happens on a range of pfns.  Notify the
+	 * processes who are using these pfns, and try to recover the data on
+	 * them if necessary.  The mf_flags is finally passed to the recover
+	 * function through the whole notify routine.
+	 *
+	 * When this is not implemented, or it returns -EOPNOTSUPP, the caller
+	 * will fall back to a common handler called mf_generic_kill_procs().
+	 */
+	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+			      unsigned long nr_pages, int mf_flags);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9e4728103c0d..aeb19593af9c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1740,6 +1740,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	if (!pgmap_pfn_valid(pgmap, pfn))
 		goto out;
 
+	/*
+	 * Call driver's implementation to handle the memory failure, otherwise
+	 * fall back to generic handler.
+	 */
+	if (pgmap->ops->memory_failure) {
+		rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
+		/*
+		 * Fall back to generic handler too if operation is not
+		 * supported inside the driver/device/filesystem.
+		 */
+		if (rc != -EOPNOTSUPP)
+			goto out;
+	}
+
 	rc = mf_generic_kill_procs(pfn, flags, pgmap);
 out:
 	/* drop pgmap ref acquired in caller */
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v14 04/07] fsdax: Introduce dax_lock_mapping_entry()
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (2 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v14 03/07] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 05/07] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

The current dax_lock_page() locks dax entry by obtaining mapping and
index in page.  To support 1-to-N RMAP in NVDIMM, we need a new function
to lock a specific dax entry corresponding to this file's mapping,index.
And output the page corresponding to the specific dax entry for caller
use.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c            | 63 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h | 15 +++++++++++
 2 files changed, 78 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 1ac12e877f4f..57efd3f73655 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
 	dax_unlock_entry(&xas, (void *)cookie);
 }
 
+/*
+ * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
+ * @mapping: the file's mapping whose entry we want to lock
+ * @index: the offset within this file
+ * @page: output the dax page corresponding to this dax entry
+ *
+ * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
+ * could not be locked.
+ */
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
+		struct page **page)
+{
+	XA_STATE(xas, NULL, 0);
+	void *entry;
+
+	rcu_read_lock();
+	for (;;) {
+		entry = NULL;
+		if (!dax_mapping(mapping))
+			break;
+
+		xas.xa = &mapping->i_pages;
+		xas_lock_irq(&xas);
+		xas_set(&xas, index);
+		entry = xas_load(&xas);
+		if (dax_is_locked(entry)) {
+			rcu_read_unlock();
+			wait_entry_unlocked(&xas, entry);
+			rcu_read_lock();
+			continue;
+		}
+		if (!entry ||
+		    dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+			/*
+			 * Because we are looking for entry from file's mapping
+			 * and index, so the entry may not be inserted for now,
+			 * or even a zero/empty entry.  We don't think this is
+			 * an error case.  So, return a special value and do
+			 * not output @page.
+			 */
+			entry = (void *)~0UL;
+		} else {
+			*page = pfn_to_page(dax_to_pfn(entry));
+			dax_lock_entry(&xas, entry);
+		}
+		xas_unlock_irq(&xas);
+		break;
+	}
+	rcu_read_unlock();
+	return (dax_entry_t)entry;
+}
+
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
+		dax_entry_t cookie)
+{
+	XA_STATE(xas, &mapping->i_pages, index);
+
+	if (cookie == ~0UL)
+		return;
+
+	dax_unlock_entry(&xas, (void *)cookie);
+}
+
 /*
  * Find page cache entry at given index. If it is a DAX entry, return it
  * with the entry locked. If the page cache doesn't contain an entry at
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9c426a207ba8..c152f315d1c9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -143,6 +143,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
 struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+		unsigned long index, struct page **page);
+void dax_unlock_mapping_entry(struct address_space *mapping,
+		unsigned long index, dax_entry_t cookie);
 #else
 static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 {
@@ -170,6 +174,17 @@ static inline dax_entry_t dax_lock_page(struct page *page)
 static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 {
 }
+
+static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+		unsigned long index, struct page **page)
+{
+	return 0;
+}
+
+static inline void dax_unlock_mapping_entry(struct address_space *mapping,
+		unsigned long index, dax_entry_t cookie)
+{
+}
 #endif
 
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v14 05/07] mm: Introduce mf_dax_kill_procs() for fsdax case
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (3 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v14 04/07] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 06/07] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

This new function is a variant of mf_generic_kill_procs that accepts a
file, offset pair instead of a struct to support multiple files sharing
a DAX mapping.  It is intended to be called by the file systems as part
of the memory_failure handler after the file system performed a reverse
mapping from the storage address to the file and file offset.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
---
 include/linux/mm.h  |  2 +
 mm/memory-failure.c | 96 ++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 88 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index de32c0383387..e2c0f69f0660 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3227,6 +3227,8 @@ enum mf_flags {
 	MF_SOFT_OFFLINE = 1 << 3,
 	MF_UNPOISON = 1 << 4,
 };
+int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+		      unsigned long count, int mf_flags);
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index aeb19593af9c..aedfc5097420 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -295,10 +295,9 @@ void shake_page(struct page *p)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
-		struct vm_area_struct *vma)
+static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
+		unsigned long address)
 {
-	unsigned long address = vma_address(page, vma);
 	unsigned long ret = 0;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -338,10 +337,14 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
 /*
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
+ *
+ * Notice: @fsdax_pgoff is used only when @p is a fsdax page.
+ *   In other cases, such as anonymous and file-backend page, the address to be
+ *   killed can be caculated by @p itself.
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
-		       struct vm_area_struct *vma,
-		       struct list_head *to_kill)
+			pgoff_t fsdax_pgoff, struct vm_area_struct *vma,
+			struct list_head *to_kill)
 {
 	struct to_kill *tk;
 
@@ -352,9 +355,15 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 	}
 
 	tk->addr = page_address_in_vma(p, vma);
-	if (is_zone_device_page(p))
-		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-	else
+	if (is_zone_device_page(p)) {
+		/*
+		 * Since page->mapping is not used for fsdax, we need
+		 * calculate the address based on the vma.
+		 */
+		if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
+			tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
+		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
+	} else
 		tk->size_shift = page_shift(compound_head(p));
 
 	/*
@@ -503,7 +512,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 			if (!page_mapped_in_vma(page, vma))
 				continue;
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, 0, vma, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -539,13 +548,41 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			 * to be informed of all such data corruptions.
 			 */
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, 0, vma, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
 	i_mmap_unlock_read(mapping);
 }
 
+#ifdef CONFIG_FS_DAX
+/*
+ * Collect processes when the error hit a fsdax page.
+ */
+static void collect_procs_fsdax(struct page *page,
+		struct address_space *mapping, pgoff_t pgoff,
+		struct list_head *to_kill)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+
+	i_mmap_lock_read(mapping);
+	read_lock(&tasklist_lock);
+	for_each_process(tsk) {
+		struct task_struct *t = task_early_kill(tsk, true);
+
+		if (!t)
+			continue;
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+			if (vma->vm_mm == t->mm)
+				add_to_kill(t, page, pgoff, vma, to_kill);
+		}
+	}
+	read_unlock(&tasklist_lock);
+	i_mmap_unlock_read(mapping);
+}
+#endif /* CONFIG_FS_DAX */
+
 /*
  * Collect the processes who have the corrupted page mapped to kill.
  */
@@ -1580,6 +1617,45 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
 	return rc;
 }
 
+#ifdef CONFIG_FS_DAX
+/**
+ * mf_dax_kill_procs - Collect and kill processes who are using this file range
+ * @mapping:	address_space of the file in use
+ * @index:	start pgoff of the range within the file
+ * @count:	length of the range, in unit of PAGE_SIZE
+ * @mf_flags:	memory failure flags
+ */
+int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+		unsigned long count, int mf_flags)
+{
+	LIST_HEAD(to_kill);
+	dax_entry_t cookie;
+	struct page *page;
+	size_t end = index + count;
+
+	mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+
+	for (; index < end; index++) {
+		page = NULL;
+		cookie = dax_lock_mapping_entry(mapping, index, &page);
+		if (!cookie)
+			return -EBUSY;
+		if (!page)
+			goto unlock;
+
+		SetPageHWPoison(page);
+
+		collect_procs_fsdax(page, mapping, index, &to_kill);
+		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
+				index, mf_flags);
+unlock:
+		dax_unlock_mapping_entry(mapping, index, cookie);
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
+#endif /* CONFIG_FS_DAX */
+
 /*
  * Called from hugetlb code with hugetlb_lock held.
  *
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v14 06/07] xfs: Implement ->notify_failure() for XFS
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (4 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v14 05/07] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v14 07/07] fsdax: set a CoW flag when associate reflink mappings Shiyang Ruan
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

Introduce xfs_notify_failure.c to handle failure related works, such as
implement ->notify_failure(), register/unregister dax holder in xfs, and
so on.

If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data.  For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata.

After that, the memory failure needs to notify the processes who are
using those files.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile             |   5 +
 fs/xfs/xfs_buf.c            |  11 +-
 fs/xfs/xfs_fsops.c          |   3 +
 fs/xfs/xfs_mount.h          |   1 +
 fs/xfs/xfs_notify_failure.c | 220 ++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_super.h          |   1 +
 6 files changed, 238 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_notify_failure.c

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 04611a1068b4..09f5560e29f2 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -128,6 +128,11 @@ xfs-$(CONFIG_SYSCTL)		+= xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)		+= xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)	+= xfs_pnfs.o
 
+# notify failure
+ifeq ($(CONFIG_MEMORY_FAILURE),y)
+xfs-$(CONFIG_FS_DAX)		+= xfs_notify_failure.o
+endif
+
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
 
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 63b8d9b5096f..93e4248240fc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -5,6 +5,7 @@
  */
 #include "xfs.h"
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 
 #include "xfs_shared.h"
 #include "xfs_format.h"
@@ -1911,7 +1912,7 @@ xfs_free_buftarg(
 	list_lru_destroy(&btp->bt_lru);
 
 	blkdev_issue_flush(btp->bt_bdev);
-	fs_put_dax(btp->bt_daxdev, NULL);
+	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
 
 	kmem_free(btp);
 }
@@ -1958,14 +1959,18 @@ xfs_alloc_buftarg(
 	struct block_device	*bdev)
 {
 	xfs_buftarg_t		*btp;
+	const struct dax_holder_operations *ops = NULL;
 
+#if defined(CONFIG_FS_DAX) && defined(CONFIG_MEMORY_FAILURE)
+	ops = &xfs_dax_holder_operations;
+#endif
 	btp = kmem_zalloc(sizeof(*btp), KM_NOFS);
 
 	btp->bt_mount = mp;
 	btp->bt_dev =  bdev->bd_dev;
 	btp->bt_bdev = bdev;
-	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off, NULL,
-					    NULL);
+	btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off,
+					    mp, ops);
 
 	/*
 	 * Buffer IO error rate limiting. Limit it to no more than 10 messages
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 888839e75d11..ea9159967eaa 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -533,6 +533,9 @@ xfs_do_force_shutdown(
 	} else if (flags & SHUTDOWN_CORRUPT_INCORE) {
 		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
 		why = "Corruption of in-memory data";
+	} else if (flags & SHUTDOWN_CORRUPT_ONDISK) {
+		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
+		why = "Corruption of on-disk metadata";
 	} else {
 		tag = XFS_PTAG_SHUTDOWN_IOERROR;
 		why = "Metadata I/O Error";
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 8c42786e4942..540924b9e583 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -438,6 +438,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, uint32_t flags, char *fname,
 #define SHUTDOWN_LOG_IO_ERROR	(1u << 1) /* write attempt to the log failed */
 #define SHUTDOWN_FORCE_UMOUNT	(1u << 2) /* shutdown from a forced unmount */
 #define SHUTDOWN_CORRUPT_INCORE	(1u << 3) /* corrupt in-memory structures */
+#define SHUTDOWN_CORRUPT_ONDISK	(1u << 4)  /* corrupt metadata on device */
 
 #define XFS_SHUTDOWN_STRINGS \
 	{ SHUTDOWN_META_IO_ERROR,	"metadata_io" }, \
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
new file mode 100644
index 000000000000..aa8dc27c599c
--- /dev/null
+++ b/fs/xfs/xfs_notify_failure.c
@@ -0,0 +1,220 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022 Fujitsu.  All Rights Reserved.
+ */
+
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_alloc.h"
+#include "xfs_bit.h"
+#include "xfs_btree.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_trans.h"
+
+#include <linux/mm.h>
+#include <linux/dax.h>
+
+struct failure_info {
+	xfs_agblock_t		startblock;
+	xfs_extlen_t		blockcount;
+	int			mf_flags;
+};
+
+static pgoff_t
+xfs_failure_pgoff(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rec,
+	const struct failure_info	*notify)
+{
+	loff_t				pos = XFS_FSB_TO_B(mp, rec->rm_offset);
+
+	if (notify->startblock > rec->rm_startblock)
+		pos += XFS_FSB_TO_B(mp,
+				notify->startblock - rec->rm_startblock);
+	return pos >> PAGE_SHIFT;
+}
+
+static unsigned long
+xfs_failure_pgcnt(
+	struct xfs_mount		*mp,
+	const struct xfs_rmap_irec	*rec,
+	const struct failure_info	*notify)
+{
+	xfs_agblock_t			end_rec;
+	xfs_agblock_t			end_notify;
+	xfs_agblock_t			start_cross;
+	xfs_agblock_t			end_cross;
+
+	start_cross = max(rec->rm_startblock, notify->startblock);
+
+	end_rec = rec->rm_startblock + rec->rm_blockcount;
+	end_notify = notify->startblock + notify->blockcount;
+	end_cross = min(end_rec, end_notify);
+
+	return XFS_FSB_TO_B(mp, end_cross - start_cross) >> PAGE_SHIFT;
+}
+
+static int
+xfs_dax_failure_fn(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*data)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_inode		*ip;
+	struct failure_info		*notify = data;
+	int				error = 0;
+
+	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
+		return -EFSCORRUPTED;
+	}
+
+	/* Get files that incore, filter out others that are not in use. */
+	error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
+			 0, &ip);
+	/* Continue the rmap query if the inode isn't incore */
+	if (error == -ENODATA)
+		return 0;
+	if (error)
+		return error;
+
+	error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
+				  xfs_failure_pgoff(mp, rec, notify),
+				  xfs_failure_pgcnt(mp, rec, notify),
+				  notify->mf_flags);
+	xfs_irele(ip);
+	return error;
+}
+
+static int
+xfs_dax_notify_ddev_failure(
+	struct xfs_mount	*mp,
+	xfs_daddr_t		daddr,
+	xfs_daddr_t		bblen,
+	int			mf_flags)
+{
+	struct xfs_trans	*tp = NULL;
+	struct xfs_btree_cur	*cur = NULL;
+	struct xfs_buf		*agf_bp = NULL;
+	int			error = 0;
+	xfs_fsblock_t		fsbno = XFS_DADDR_TO_FSB(mp, daddr);
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	xfs_fsblock_t		end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+	xfs_agnumber_t		end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	for (; agno <= end_agno; agno++) {
+		struct xfs_rmap_irec	ri_low = { };
+		struct xfs_rmap_irec	ri_high;
+		struct failure_info	notify;
+		struct xfs_agf		*agf;
+		xfs_agblock_t		agend;
+
+		error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
+		if (error)
+			break;
+
+		cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agf_bp->b_pag);
+
+		/*
+		 * Set the rmap range from ri_low to ri_high, which represents
+		 * a [start, end] where we looking for the files or metadata.
+		 */
+		memset(&ri_high, 0xFF, sizeof(ri_high));
+		ri_low.rm_startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
+		if (agno == end_agno)
+			ri_high.rm_startblock = XFS_FSB_TO_AGBNO(mp, end_fsbno);
+
+		agf = agf_bp->b_addr;
+		agend = min(be32_to_cpu(agf->agf_length),
+				ri_high.rm_startblock);
+		notify.startblock = ri_low.rm_startblock;
+		notify.blockcount = agend - ri_low.rm_startblock;
+
+		error = xfs_rmap_query_range(cur, &ri_low, &ri_high,
+				xfs_dax_failure_fn, &notify);
+		xfs_btree_del_cursor(cur, error);
+		xfs_trans_brelse(tp, agf_bp);
+		if (error)
+			break;
+
+		fsbno = XFS_AGB_TO_FSB(mp, agno + 1, 0);
+	}
+
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+static int
+xfs_dax_notify_failure(
+	struct dax_device	*dax_dev,
+	u64			offset,
+	u64			len,
+	int			mf_flags)
+{
+	struct xfs_mount	*mp = dax_holder(dax_dev);
+	u64			ddev_start;
+	u64			ddev_end;
+
+	if (!(mp->m_sb.sb_flags & SB_BORN)) {
+		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
+		return -EIO;
+	}
+
+	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
+		xfs_warn(mp,
+			 "notify_failure() not supported on realtime device!");
+		return -EOPNOTSUPP;
+	}
+
+	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
+	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
+		return -EFSCORRUPTED;
+	}
+
+	if (!xfs_has_rmapbt(mp)) {
+		xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
+		return -EOPNOTSUPP;
+	}
+
+	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
+	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+
+	/* Ignore the range out of filesystem area */
+	if (offset + len < ddev_start)
+		return -ENXIO;
+	if (offset > ddev_end)
+		return -ENXIO;
+
+	/* Calculate the real range when it touches the boundary */
+	if (offset > ddev_start)
+		offset -= ddev_start;
+	else {
+		len -= ddev_start - offset;
+		offset = 0;
+	}
+	if (offset + len > ddev_end)
+		len -= ddev_end - offset;
+
+	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
+			mf_flags);
+}
+
+const struct dax_holder_operations xfs_dax_holder_operations = {
+	.notify_failure		= xfs_dax_notify_failure,
+};
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index 167d23f92ffe..27ab5087d0b3 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -93,6 +93,7 @@ extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
 extern const struct export_operations xfs_export_operations;
 extern const struct xattr_handler *xfs_xattr_handlers[];
 extern const struct quotactl_ops xfs_quotactl_operations;
+extern const struct dax_holder_operations xfs_dax_holder_operations;
 
 extern void xfs_reinit_percpu_counters(struct xfs_mount *mp);
 
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v14 07/07] fsdax: set a CoW flag when associate reflink mappings
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (5 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v14 06/07] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v11 01/07] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

Introduce a PAGE_MAPPING_DAX_COW flag to support association with CoW file
mappings.  In this case, since the dax-rmap has already took the
responsibility to look up for shared files by given dax page,
the page->mapping is no longer to used for rmap but for marking that
this dax page is shared.  And to make sure disassociation works fine, we
use page->index as refcount, and clear page->mapping to the initial
state when page->index is decreased to 0.

With the help of this new flag, it is able to distinguish normal case
and CoW case, and keep the warning in normal case.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c                   | 50 +++++++++++++++++++++++++++++++-------
 include/linux/page-flags.h |  6 +++++
 2 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 57efd3f73655..4d3dfc8bee33 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -334,13 +334,35 @@ static unsigned long dax_end_pfn(void *entry)
 	for (pfn = dax_to_pfn(entry); \
 			pfn < dax_end_pfn(entry); pfn++)
 
+static inline bool dax_mapping_is_cow(struct address_space *mapping)
+{
+	return (unsigned long)mapping == PAGE_MAPPING_DAX_COW;
+}
+
 /*
- * TODO: for reflink+dax we need a way to associate a single page with
- * multiple address_space instances at different linear_page_index()
- * offsets.
+ * Set the page->mapping with FS_DAX_MAPPING_COW flag, increase the refcount.
+ */
+static inline void dax_mapping_set_cow(struct page *page)
+{
+	if ((uintptr_t)page->mapping != PAGE_MAPPING_DAX_COW) {
+		/*
+		 * Reset the index if the page was already mapped
+		 * regularly before.
+		 */
+		if (page->mapping)
+			page->index = 1;
+		page->mapping = (void *)PAGE_MAPPING_DAX_COW;
+	}
+	page->index++;
+}
+
+/*
+ * When it is called in dax_insert_entry(), the cow flag will indicate that
+ * whether this entry is shared by multiple files.  If so, set the page->mapping
+ * FS_DAX_MAPPING_COW, and use page->index as refcount.
  */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
-		struct vm_area_struct *vma, unsigned long address)
+		struct vm_area_struct *vma, unsigned long address, bool cow)
 {
 	unsigned long size = dax_entry_size(entry), pfn, index;
 	int i = 0;
@@ -352,9 +374,13 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(page->mapping);
-		page->mapping = mapping;
-		page->index = index + i++;
+		if (cow) {
+			dax_mapping_set_cow(page);
+		} else {
+			WARN_ON_ONCE(page->mapping);
+			page->mapping = mapping;
+			page->index = index + i++;
+		}
 	}
 }
 
@@ -370,7 +396,12 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+		if (dax_mapping_is_cow(page->mapping)) {
+			/* keep the CoW flag if this page is still shared */
+			if (page->index-- > 0)
+				continue;
+		} else
+			WARN_ON_ONCE(page->mapping && page->mapping != mapping);
 		page->mapping = NULL;
 		page->index = 0;
 	}
@@ -829,7 +860,8 @@ static void *dax_insert_entry(struct xa_state *xas,
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
+		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
+				false);
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index fe47ee8dc258..cad9aeb5e75c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -650,6 +650,12 @@ __PAGEFLAG(Reported, reported, PF_NO_COMPOUND)
 #define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 #define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
+/*
+ * Different with flags above, this flag is used only for fsdax mode.  It
+ * indicates that this page->mapping is now under reflink case.
+ */
+#define PAGE_MAPPING_DAX_COW	0x1
+
 static __always_inline int PageMappingFlags(struct page *page)
 {
 	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != 0;
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v11 01/07] fsdax: Output address in dax_iomap_pfn() and rename it
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (6 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v14 07/07] fsdax: set a CoW flag when associate reflink mappings Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v11 02/07] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig,
	Ritesh Harjani

Add address output in dax_iomap_pfn() in order to perform a memcpy() in
CoW case.  Since this function both output address and pfn, rename it to
dax_iomap_direct_access().

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4d3dfc8bee33..d4f195aeaa12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1025,8 +1025,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
-			 pfn_t *pfnp)
+static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
+		size_t size, void **kaddr, pfn_t *pfnp)
 {
 	pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
 	int id, rc;
@@ -1034,11 +1034,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 
 	id = dax_read_lock();
 	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
-				   NULL, pfnp);
+				   kaddr, pfnp);
 	if (length < 0) {
 		rc = length;
 		goto out;
 	}
+	if (!pfnp)
+		goto out_check_addr;
 	rc = -EINVAL;
 	if (PFN_PHYS(length) < size)
 		goto out;
@@ -1048,6 +1050,12 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 	if (length > 1 && !pfn_t_devmap(*pfnp))
 		goto out;
 	rc = 0;
+
+out_check_addr:
+	if (!kaddr)
+		goto out;
+	if (!*kaddr)
+		rc = -EFAULT;
 out:
 	dax_read_unlock(id);
 	return rc;
@@ -1444,7 +1452,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 		return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
 	}
 
-	err = dax_iomap_pfn(&iter->iomap, pos, size, &pfn);
+	err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v11 02/07] fsdax: Introduce dax_iomap_cow_copy()
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (7 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v11 01/07] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v11 03/07] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

In the case where the iomap is a write operation and iomap is not equal
to srcmap after iomap_begin, we consider it is a CoW operation.

In this case, the destination (iomap->addr) points to a newly allocated
extent.  It is needed to copy the data from srcmap to the extent.  In
theory, it is better to copy the head and tail ranges which is outside
of the non-aligned area instead of copying the whole aligned range. But
in dax page fault, it will always be an aligned range. So copy the whole
range in this case.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 83 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index d4f195aeaa12..a4d56cfa33d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1061,6 +1061,60 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
 	return rc;
 }
 
+/**
+ * dax_iomap_cow_copy - Copy the data from source to destination before write
+ * @pos:	address to do copy from.
+ * @length:	size of copy operation.
+ * @align_size:	aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
+ * @srcmap:	iomap srcmap
+ * @daddr:	destination address to copy to.
+ *
+ * This can be called from two places. Either during DAX write fault (page
+ * aligned), to copy the length size data to daddr. Or, while doing normal DAX
+ * write operation, dax_iomap_actor() might call this to do the copy of either
+ * start or end unaligned address. In the latter case the rest of the copy of
+ * aligned ranges is taken care by dax_iomap_actor() itself.
+ */
+static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
+		const struct iomap *srcmap, void *daddr)
+{
+	loff_t head_off = pos & (align_size - 1);
+	size_t size = ALIGN(head_off + length, align_size);
+	loff_t end = pos + length;
+	loff_t pg_end = round_up(end, align_size);
+	bool copy_all = head_off == 0 && end == pg_end;
+	void *saddr = 0;
+	int ret = 0;
+
+	ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
+	if (ret)
+		return ret;
+
+	if (copy_all) {
+		ret = copy_mc_to_kernel(daddr, saddr, length);
+		return ret ? -EIO : 0;
+	}
+
+	/* Copy the head part of the range */
+	if (head_off) {
+		ret = copy_mc_to_kernel(daddr, saddr, head_off);
+		if (ret)
+			return -EIO;
+	}
+
+	/* Copy the tail part of the range */
+	if (end < pg_end) {
+		loff_t tail_off = head_off + length;
+		loff_t tail_len = pg_end - end;
+
+		ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
+					tail_len);
+		if (ret)
+			return -EIO;
+	}
+	return 0;
+}
+
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
  * page in the file would cause excessive storage usage for workloads with
@@ -1231,15 +1285,17 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		struct iov_iter *iter)
 {
 	const struct iomap *iomap = &iomi->iomap;
+	const struct iomap *srcmap = &iomi->srcmap;
 	loff_t length = iomap_length(iomi);
 	loff_t pos = iomi->pos;
 	struct dax_device *dax_dev = iomap->dax_dev;
 	loff_t end = pos + length, done = 0;
+	bool write = iov_iter_rw(iter) == WRITE;
 	ssize_t ret = 0;
 	size_t xfer;
 	int id;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (!write) {
 		end = min(end, i_size_read(iomi->inode));
 		if (pos >= end)
 			return 0;
@@ -1248,7 +1304,12 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 			return iov_iter_zero(min(length, end - pos), iter);
 	}
 
-	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
+	/*
+	 * In DAX mode, enforce either pure overwrites of written extents, or
+	 * writes to unwritten extents as part of a copy-on-write operation.
+	 */
+	if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
+			!(iomap->flags & IOMAP_F_SHARED)))
 		return -EIO;
 
 	/*
@@ -1282,13 +1343,21 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 			break;
 		}
 
+		if (write &&
+		    srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+			ret = dax_iomap_cow_copy(pos, length, PAGE_SIZE, srcmap,
+						 kaddr);
+			if (ret)
+				break;
+		}
+
 		map_len = PFN_PHYS(map_len);
 		kaddr += offset;
 		map_len -= offset;
 		if (map_len > end - pos)
 			map_len = end - pos;
 
-		if (iov_iter_rw(iter) == WRITE)
+		if (write)
 			xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr,
 					map_len, iter);
 		else
@@ -1428,6 +1497,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	const struct iomap *iomap = &iter->iomap;
+	const struct iomap *srcmap = &iter->srcmap;
 	size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
 	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -1435,6 +1505,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
 	int err = 0;
 	pfn_t pfn;
+	void *kaddr;
 
 	if (!pmd && vmf->cow_page)
 		return dax_fault_cow_page(vmf, iter);
@@ -1447,18 +1518,25 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 		return dax_pmd_load_hole(xas, vmf, iomap, entry);
 	}
 
-	if (iomap->type != IOMAP_MAPPED) {
+	if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
 		WARN_ON_ONCE(1);
 		return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
 	}
 
-	err = dax_iomap_direct_access(&iter->iomap, pos, size, NULL, &pfn);
+	err = dax_iomap_direct_access(iomap, pos, size, &kaddr, &pfn);
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
 	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
 				  write && !sync);
 
+	if (write &&
+	    srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {
+		err = dax_iomap_cow_copy(pos, size, size, srcmap, kaddr);
+		if (err)
+			return dax_fault_return(err);
+	}
+
 	if (sync)
 		return dax_fault_synchronous_pfnp(pfnp, pfn);
 
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v11 03/07] fsdax: Replace mmap entry in case of CoW
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (8 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v11 02/07] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v11 04/07] fsdax: Add dax_iomap_cow_copy() for dax zero Shiyang Ruan
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Goldwyn Rodrigues,
	Christoph Hellwig, Ritesh Harjani

Replace the existing entry to the newly allocated one in case of CoW.
Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
entry as writeprotected.  This helps us snapshots so new write
pagefaults after snapshots trigger a CoW.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c | 77 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 00d2cb72ec58..78e26204697b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -828,6 +828,23 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
 	return 0;
 }
 
+/*
+ * MAP_SYNC on a dax mapping guarantees dirty metadata is
+ * flushed on write-faults (non-cow), but not read-faults.
+ */
+static bool dax_fault_is_synchronous(const struct iomap_iter *iter,
+		struct vm_area_struct *vma)
+{
+	return (iter->flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC) &&
+		(iter->iomap.flags & IOMAP_F_DIRTY);
+}
+
+static bool dax_fault_is_cow(const struct iomap_iter *iter)
+{
+	return (iter->flags & IOMAP_WRITE) &&
+		(iter->iomap.flags & IOMAP_F_SHARED);
+}
+
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
  * of the appropriate size so we don't have to worry about downgrading PMDs to
@@ -835,16 +852,19 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
-		struct address_space *mapping, struct vm_fault *vmf,
-		void *entry, pfn_t pfn, unsigned long flags, bool dirty)
+static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
+		const struct iomap_iter *iter, void *entry, pfn_t pfn,
+		unsigned long flags)
 {
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	void *new_entry = dax_make_entry(pfn, flags);
+	bool dirty = !dax_fault_is_synchronous(iter, vmf->vma);
+	bool cow = dax_fault_is_cow(iter);
 
 	if (dirty)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
-	if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
+	if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
 		unsigned long index = xas->xa_index;
 		/* we are replacing a zero page with block mapping */
 		if (dax_is_pmd_entry(entry))
@@ -856,12 +876,12 @@ static void *dax_insert_entry(struct xa_state *xas,
 
 	xas_reset(xas);
 	xas_lock_irq(xas);
-	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+	if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
 		dax_disassociate_entry(entry, mapping, false);
 		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
-				false);
+				cow);
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
@@ -881,6 +901,9 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dirty)
 		xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
 
+	if (cow)
+		xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
+
 	xas_unlock_irq(xas);
 	return entry;
 }
@@ -1122,17 +1145,15 @@ static int dax_iomap_cow_copy(loff_t pos, uint64_t length, size_t align_size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
-		struct address_space *mapping, void **entry,
-		struct vm_fault *vmf)
+static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
+		const struct iomap_iter *iter, void **entry)
 {
-	struct inode *inode = mapping->host;
+	struct inode *inode = iter->inode;
 	unsigned long vaddr = vmf->address;
 	pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
 	vm_fault_t ret;
 
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_ZERO_PAGE, false);
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
 
 	ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
 	trace_dax_load_hole(inode, vmf, ret);
@@ -1141,7 +1162,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 
 #ifdef CONFIG_FS_DAX_PMD
 static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
-		const struct iomap *iomap, void **entry)
+		const struct iomap_iter *iter, void **entry)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
@@ -1159,8 +1180,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		goto fallback;
 
 	pfn = page_to_pfn_t(zero_page);
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
-			DAX_PMD | DAX_ZERO_PAGE, false);
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
+				  DAX_PMD | DAX_ZERO_PAGE);
 
 	if (arch_needs_pgtable_deposit()) {
 		pgtable = pte_alloc_one(vma->vm_mm);
@@ -1193,7 +1214,7 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 }
 #else
 static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
-		const struct iomap *iomap, void **entry)
+		const struct iomap_iter *iter, void **entry)
 {
 	return VM_FAULT_FALLBACK;
 }
@@ -1427,17 +1448,6 @@ static vm_fault_t dax_fault_return(int error)
 	return vmf_error(error);
 }
 
-/*
- * MAP_SYNC on a dax mapping guarantees dirty metadata is
- * flushed on write-faults (non-cow), but not read-faults.
- */
-static bool dax_fault_is_synchronous(unsigned long flags,
-		struct vm_area_struct *vma, const struct iomap *iomap)
-{
-	return (flags & IOMAP_WRITE) && (vma->vm_flags & VM_SYNC)
-		&& (iomap->flags & IOMAP_F_DIRTY);
-}
-
 /*
  * When handling a synchronous page fault and the inode need a fsync, we can
  * insert the PTE/PMD into page tables only after that fsync happened. Skip
@@ -1495,13 +1505,11 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 		const struct iomap_iter *iter, pfn_t *pfnp,
 		struct xa_state *xas, void **entry, bool pmd)
 {
-	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	const struct iomap *iomap = &iter->iomap;
 	const struct iomap *srcmap = &iter->srcmap;
 	size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
 	loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
-	bool write = vmf->flags & FAULT_FLAG_WRITE;
-	bool sync = dax_fault_is_synchronous(iter->flags, vmf->vma, iomap);
+	bool write = iter->flags & IOMAP_WRITE;
 	unsigned long entry_flags = pmd ? DAX_PMD : 0;
 	int err = 0;
 	pfn_t pfn;
@@ -1514,8 +1522,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (!write &&
 	    (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
 		if (!pmd)
-			return dax_load_hole(xas, mapping, entry, vmf);
-		return dax_pmd_load_hole(xas, vmf, iomap, entry);
+			return dax_load_hole(xas, vmf, iter, entry);
+		return dax_pmd_load_hole(xas, vmf, iter, entry);
 	}
 
 	if (iomap->type != IOMAP_MAPPED && !(iomap->flags & IOMAP_F_SHARED)) {
@@ -1527,8 +1535,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 	if (err)
 		return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
 
-	*entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
-				  write && !sync);
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, entry_flags);
 
 	if (write &&
 	    srcmap->type != IOMAP_HOLE && srcmap->addr != iomap->addr) {
@@ -1537,7 +1544,7 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
 			return dax_fault_return(err);
 	}
 
-	if (sync)
+	if (dax_fault_is_synchronous(iter, vmf->vma))
 		return dax_fault_synchronous_pfnp(pfnp, pfn);
 
 	/* insert PMD pfn */
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v11 04/07] fsdax: Add dax_iomap_cow_copy() for dax zero
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (9 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v11 03/07] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v11 05/07] fsdax: Dedup file range to use a compare function Shiyang Ruan
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Ritesh Harjani,
	Christoph Hellwig

Punch hole on a reflinked file needs dax_iomap_cow_copy() too.
Otherwise, data in not aligned area will be not correct.  So, add the
CoW operation for not aligned case in dax_memzero().

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>

==
This patch changed a lot when rebasing to next-20220504 branch.  Though it
has been tested by myself, I think it needs a re-review.
==
---
 fs/dax.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 78e26204697b..b3aa863e9fec 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1220,17 +1220,27 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 }
 #endif /* CONFIG_FS_DAX_PMD */
 
-static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff,
-		unsigned int offset, size_t size)
+static int dax_memzero(struct iomap_iter *iter, loff_t pos, size_t size)
 {
+	const struct iomap *iomap = &iter->iomap;
+	const struct iomap *srcmap = iomap_iter_srcmap(iter);
+	unsigned offset = offset_in_page(pos);
+	pgoff_t pgoff = dax_iomap_pgoff(iomap, pos);
 	void *kaddr;
 	long ret;
 
-	ret = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
-	if (ret > 0) {
-		memset(kaddr + offset, 0, size);
-		dax_flush(dax_dev, kaddr + offset, size);
-	}
+	ret = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
+	if (ret < 0)
+		return ret;
+	memset(kaddr + offset, 0, size);
+	if (srcmap->addr != iomap->addr) {
+		ret = dax_iomap_cow_copy(pos, size, PAGE_SIZE, srcmap,
+					 kaddr);
+		if (ret < 0)
+			return ret;
+		dax_flush(iomap->dax_dev, kaddr, PAGE_SIZE);
+	} else
+		dax_flush(iomap->dax_dev, kaddr + offset, size);
 	return ret;
 }
 
@@ -1257,7 +1267,7 @@ static s64 dax_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (IS_ALIGNED(pos, PAGE_SIZE) && size == PAGE_SIZE)
 			rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
 		else
-			rc = dax_memzero(iomap->dax_dev, pgoff, offset, size);
+			rc = dax_memzero(iter, pos, size);
 		dax_read_unlock(id);
 
 		if (rc < 0)
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v11 05/07] fsdax: Dedup file range to use a compare function
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (10 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v11 04/07] fsdax: Add dax_iomap_cow_copy() for dax zero Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v11 06/07] xfs: support CoW in fsdax mode Shiyang Ruan
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Goldwyn Rodrigues,
	Christoph Hellwig

With dax we cannot deal with readpage() etc. So, we create a dax
comparison function which is similar with
vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/dax.c             | 82 ++++++++++++++++++++++++++++++++++++++++++++
 fs/remap_range.c     | 31 ++++++++++++++---
 fs/xfs/xfs_reflink.c |  8 +++--
 include/linux/dax.h  |  8 +++++
 include/linux/fs.h   | 12 ++++---
 5 files changed, 130 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3aa863e9fec..601a23c6378c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1860,3 +1860,85 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t dax_range_compare_iter(struct iomap_iter *it_src,
+		struct iomap_iter *it_dest, u64 len, bool *same)
+{
+	const struct iomap *smap = &it_src->iomap;
+	const struct iomap *dmap = &it_dest->iomap;
+	loff_t pos1 = it_src->pos, pos2 = it_dest->pos;
+	void *saddr, *daddr;
+	int id, ret;
+
+	len = min(len, min(smap->length, dmap->length));
+
+	if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
+		*same = true;
+		return len;
+	}
+
+	if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
+		*same = false;
+		return 0;
+	}
+
+	id = dax_read_lock();
+	ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
+				      &saddr, NULL);
+	if (ret < 0)
+		goto out_unlock;
+
+	ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
+				      &daddr, NULL);
+	if (ret < 0)
+		goto out_unlock;
+
+	*same = !memcmp(saddr, daddr, len);
+	if (!*same)
+		len = 0;
+	dax_read_unlock(id);
+	return len;
+
+out_unlock:
+	dax_read_unlock(id);
+	return -EIO;
+}
+
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+		struct inode *dst, loff_t dstoff, loff_t len, bool *same,
+		const struct iomap_ops *ops)
+{
+	struct iomap_iter src_iter = {
+		.inode		= src,
+		.pos		= srcoff,
+		.len		= len,
+		.flags		= IOMAP_DAX,
+	};
+	struct iomap_iter dst_iter = {
+		.inode		= dst,
+		.pos		= dstoff,
+		.len		= len,
+		.flags		= IOMAP_DAX,
+	};
+	int ret;
+
+	while ((ret = iomap_iter(&src_iter, ops)) > 0) {
+		while ((ret = iomap_iter(&dst_iter, ops)) > 0) {
+			dst_iter.processed = dax_range_compare_iter(&src_iter,
+						&dst_iter, len, same);
+		}
+		if (ret <= 0)
+			src_iter.processed = ret;
+	}
+	return ret;
+}
+
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      loff_t *len, unsigned int remap_flags,
+			      const struct iomap_ops *ops)
+{
+	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+					       pos_out, len, remap_flags, ops);
+}
+EXPORT_SYMBOL_GPL(dax_remap_file_range_prep);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index e112b5424cdb..231de627c1b9 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -14,6 +14,7 @@
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/fs.h>
+#include <linux/dax.h>
 #include "internal.h"
 
 #include <linux/uaccess.h>
@@ -271,9 +272,11 @@ static int vfs_dedupe_file_range_compare(struct file *src, loff_t srcoff,
  * If there's an error, then the usual negative error code is returned.
  * Otherwise returns 0 with *len set to the request length.
  */
-int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
-				  struct file *file_out, loff_t pos_out,
-				  loff_t *len, unsigned int remap_flags)
+int
+__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				struct file *file_out, loff_t pos_out,
+				loff_t *len, unsigned int remap_flags,
+				const struct iomap_ops *dax_read_ops)
 {
 	struct inode *inode_in = file_inode(file_in);
 	struct inode *inode_out = file_inode(file_out);
@@ -333,8 +336,18 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (remap_flags & REMAP_FILE_DEDUP) {
 		bool		is_same = false;
 
-		ret = vfs_dedupe_file_range_compare(file_in, pos_in,
-				file_out, pos_out, *len, &is_same);
+		if (*len == 0)
+			return 0;
+
+		if (!IS_DAX(inode_in))
+			ret = vfs_dedupe_file_range_compare(file_in, pos_in,
+					file_out, pos_out, *len, &is_same);
+		else if (dax_read_ops)
+			ret = dax_dedupe_file_range_compare(inode_in, pos_in,
+					inode_out, pos_out, *len, &is_same,
+					dax_read_ops);
+		else
+			return -EINVAL;
 		if (ret)
 			return ret;
 		if (!is_same)
@@ -352,6 +365,14 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 
 	return ret;
 }
+
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *len, unsigned int remap_flags)
+{
+	return __generic_remap_file_range_prep(file_in, pos_in, file_out,
+					       pos_out, len, remap_flags, NULL);
+}
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
 loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 1ae6d3434ad2..10a9947e35d9 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1342,8 +1342,12 @@ xfs_reflink_remap_prep(
 	if (IS_DAX(inode_in) || IS_DAX(inode_out))
 		goto out_unlock;
 
-	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
-			len, remap_flags);
+	if (!IS_DAX(inode_in))
+		ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags);
+	else
+		ret = dax_remap_file_range_prep(file_in, pos_in, file_out,
+				pos_out, len, remap_flags, &xfs_read_iomap_ops);
 	if (ret || *len == 0)
 		goto out_unlock;
 
diff --git a/include/linux/dax.h b/include/linux/dax.h
index c152f315d1c9..f955a2dc96cb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -228,6 +228,14 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
+int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
+				  struct inode *dest, loff_t destoff,
+				  loff_t len, bool *is_same,
+				  const struct iomap_ops *ops);
+int dax_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+			      struct file *file_out, loff_t pos_out,
+			      loff_t *len, unsigned int remap_flags,
+			      const struct iomap_ops *ops);
 static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2a61e0f43de..dfc981fef3c3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -74,6 +74,7 @@ struct fsverity_operations;
 struct fs_context;
 struct fs_parameter_spec;
 struct fileattr;
+struct iomap_ops;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -2055,10 +2056,13 @@ extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
 				       struct file *file_out, loff_t pos_out,
 				       size_t len, unsigned int flags);
-extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
-					 struct file *file_out, loff_t pos_out,
-					 loff_t *count,
-					 unsigned int remap_flags);
+int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				    struct file *file_out, loff_t pos_out,
+				    loff_t *len, unsigned int remap_flags,
+				    const struct iomap_ops *dax_read_ops);
+int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
+				  struct file *file_out, loff_t pos_out,
+				  loff_t *count, unsigned int remap_flags);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v11 06/07] xfs: support CoW in fsdax mode
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (11 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v11 05/07] fsdax: Dedup file range to use a compare function Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-10  5:45   ` Christoph Hellwig
  2022-05-11  2:25   ` [PATCH v11.1 " Shiyang Ruan
  2022-05-08 14:36 ` [PATCH v11 07/07] xfs: Add dax dedupe support Shiyang Ruan
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file.
So, add a CoW identification in ->iomap_begin(), and implement
->iomap_end() to do the remapping work.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  |  7 ++-----
 fs/xfs/xfs_iomap.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.h |  3 +++
 3 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index af954a5b71f8..5a4508b23b51 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -669,7 +669,7 @@ xfs_file_dax_write(
 	pos = iocb->ki_pos;
 
 	trace_xfs_file_dax_write(iocb, from);
-	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1285,10 +1285,7 @@ __xfs_filemap_fault(
 		pfn_t pfn;
 
 		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
-				(write_fault && !vmf->cow_page) ?
-				 &xfs_direct_write_iomap_ops :
-				 &xfs_read_iomap_ops);
+		ret = xfs_dax_fault(vmf, pe_size, write_fault, &pfn);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5a393259a3a3..e35842215d22 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -27,6 +27,7 @@
 #include "xfs_dquot_item.h"
 #include "xfs_dquot.h"
 #include "xfs_reflink.h"
+#include "linux/dax.h"
 
 #define XFS_ALLOC_ALIGN(mp, off) \
 	(((off) >> mp->m_allocsize_log) << mp->m_allocsize_log)
@@ -773,7 +774,8 @@ xfs_direct_write_iomap_begin(
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
 		if (shared)
@@ -867,6 +869,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 	.iomap_begin		= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			length,
+	ssize_t			written,
+	unsigned		flags,
+	struct iomap		*iomap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	if (!xfs_is_cow_inode(ip))
+		return 0;
+
+	if (!written) {
+		xfs_reflink_cancel_cow_range(ip, pos, length, true);
+		return 0;
+	}
+
+	return xfs_reflink_end_cow(ip, pos, written);
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+	.iomap_begin	= xfs_direct_write_iomap_begin,
+	.iomap_end	= xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
 	struct inode		*inode,
@@ -1358,3 +1387,18 @@ xfs_truncate_page(
 	return iomap_truncate_page(inode, pos, did_zero,
 				   &xfs_buffered_write_iomap_ops);
 }
+
+#ifdef CONFIG_FS_DAX
+int
+xfs_dax_fault(
+	struct vm_fault		*vmf,
+	enum page_entry_size	pe_size,
+	bool			write_fault,
+	pfn_t			*pfn)
+{
+	return dax_iomap_fault(vmf, pe_size, pfn, NULL,
+			(write_fault && !vmf->cow_page) ?
+				&xfs_dax_write_iomap_ops :
+				&xfs_read_iomap_ops);
+}
+#endif
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index e88dc162c785..89dfa3bb099f 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -25,6 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
 int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
 		bool *did_zero);
 int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
+int xfs_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+		bool write_fault, pfn_t *pfn);
 
 static inline xfs_filblks_t
 xfs_aligned_fsb_count(
@@ -51,5 +53,6 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v11 07/07] xfs: Add dax dedupe support
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (12 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v11 06/07] xfs: support CoW in fsdax mode Shiyang Ruan
@ 2022-05-08 14:36 ` Shiyang Ruan
  2022-05-10  9:32 ` [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Christoph Hellwig
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-08 14:36 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
who are going to be deduped.  After that, call compare range function
only when files are both DAX or not.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c    |  2 +-
 fs/xfs/xfs_inode.c   | 69 +++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_inode.h   |  1 +
 fs/xfs/xfs_reflink.c |  4 +--
 4 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5a4508b23b51..cf78eb393258 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -807,7 +807,7 @@ xfs_wait_dax_page(
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
 }
 
-static int
+int
 xfs_break_dax_layouts(
 	struct inode		*inode,
 	bool			*retry)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b2879870a17e..96308065a2b3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3767,6 +3767,50 @@ xfs_iolock_two_inodes_and_break_layout(
 	return 0;
 }
 
+static int
+xfs_mmaplock_two_inodes_and_break_dax_layout(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	int			error;
+	bool			retry;
+	struct page		*page;
+
+	if (ip1->i_ino > ip2->i_ino)
+		swap(ip1, ip2);
+
+again:
+	retry = false;
+	/* Lock the first inode */
+	xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+	error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
+	if (error || retry) {
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+		if (error == 0 && retry)
+			goto again;
+		return error;
+	}
+
+	if (ip1 == ip2)
+		return 0;
+
+	/* Nested lock the second inode */
+	xfs_ilock(ip2, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
+	/*
+	 * We cannot use xfs_break_dax_layouts() directly here because it may
+	 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
+	 * for this nested lock case.
+	 */
+	page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
+	if (page && page_ref_count(page) != 1) {
+		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+		goto again;
+	}
+
+	return 0;
+}
+
 /*
  * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
  * mmap activity.
@@ -3781,8 +3825,19 @@ xfs_ilock2_io_mmap(
 	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
 	if (ret)
 		return ret;
-	filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
-				    VFS_I(ip2)->i_mapping);
+
+	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+		ret = xfs_mmaplock_two_inodes_and_break_dax_layout(ip1, ip2);
+		if (ret) {
+			inode_unlock(VFS_I(ip2));
+			if (ip1 != ip2)
+				inode_unlock(VFS_I(ip1));
+			return ret;
+		}
+	} else
+		filemap_invalidate_lock_two(VFS_I(ip1)->i_mapping,
+					    VFS_I(ip2)->i_mapping);
+
 	return 0;
 }
 
@@ -3792,8 +3847,14 @@ xfs_iunlock2_io_mmap(
 	struct xfs_inode	*ip1,
 	struct xfs_inode	*ip2)
 {
-	filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
-				      VFS_I(ip2)->i_mapping);
+	if (IS_DAX(VFS_I(ip1)) && IS_DAX(VFS_I(ip2))) {
+		xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+		if (ip1 != ip2)
+			xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+	} else
+		filemap_invalidate_unlock_two(VFS_I(ip1)->i_mapping,
+					      VFS_I(ip2)->i_mapping);
+
 	inode_unlock(VFS_I(ip2));
 	if (ip1 != ip2)
 		inode_unlock(VFS_I(ip1));
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7be6f8e705ab..8313cc83b6ee 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -467,6 +467,7 @@ xfs_itruncate_extents(
 }
 
 /* from xfs_file.c */
+int	xfs_break_dax_layouts(struct inode *inode, bool *retry);
 int	xfs_break_layouts(struct inode *inode, uint *iolock,
 		enum layout_break_reason reason);
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 10a9947e35d9..7cceea510a01 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1338,8 +1338,8 @@ xfs_reflink_remap_prep(
 	if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
 		goto out_unlock;
 
-	/* Don't share DAX file data for now. */
-	if (IS_DAX(inode_in) || IS_DAX(inode_out))
+	/* Don't share DAX file data with non-DAX file. */
+	if (IS_DAX(inode_in) != IS_DAX(inode_out))
 		goto out_unlock;
 
 	if (!IS_DAX(inode_in))
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v11 06/07] xfs: support CoW in fsdax mode
  2022-05-08 14:36 ` [PATCH v11 06/07] xfs: support CoW in fsdax mode Shiyang Ruan
@ 2022-05-10  5:45   ` Christoph Hellwig
  2022-05-10 10:06     ` Shiyang Ruan
  2022-05-11  2:25   ` [PATCH v11.1 " Shiyang Ruan
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2022-05-10  5:45 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu, rgoldwyn, viro, willy,
	naoya.horiguchi, linmiaohe

> +#ifdef CONFIG_FS_DAX
> +int
> +xfs_dax_fault(
> +	struct vm_fault		*vmf,
> +	enum page_entry_size	pe_size,
> +	bool			write_fault,
> +	pfn_t			*pfn)
> +{
> +	return dax_iomap_fault(vmf, pe_size, pfn, NULL,
> +			(write_fault && !vmf->cow_page) ?
> +				&xfs_dax_write_iomap_ops :
> +				&xfs_read_iomap_ops);
> +}
> +#endif

Is there any reason this is in xfs_iomap.c and not xfs_file.c?

Otherwise the patch looks good:


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (13 preceding siblings ...)
  2022-05-08 14:36 ` [PATCH v11 07/07] xfs: Add dax dedupe support Shiyang Ruan
@ 2022-05-10  9:32 ` Christoph Hellwig
  2022-05-11  0:03 ` Darrick J. Wong
  2022-06-02 18:56 ` Andrew Morton
  16 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2022-05-10  9:32 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu, rgoldwyn, viro, willy,
	naoya.horiguchi, linmiaohe

The patch numbering due looks odd due to the combination of the
two series.  But otherwise this looks good to me modulo the one
minor nitpick.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v11 06/07] xfs: support CoW in fsdax mode
  2022-05-10  5:45   ` Christoph Hellwig
@ 2022-05-10 10:06     ` Shiyang Ruan
  0 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-10 10:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, jane.chu, rgoldwyn, viro, willy,
	naoya.horiguchi, linmiaohe



在 2022/5/10 13:45, Christoph Hellwig 写道:
>> +#ifdef CONFIG_FS_DAX
>> +int
>> +xfs_dax_fault(
>> +	struct vm_fault		*vmf,
>> +	enum page_entry_size	pe_size,
>> +	bool			write_fault,
>> +	pfn_t			*pfn)
>> +{
>> +	return dax_iomap_fault(vmf, pe_size, pfn, NULL,
>> +			(write_fault && !vmf->cow_page) ?
>> +				&xfs_dax_write_iomap_ops :
>> +				&xfs_read_iomap_ops);
>> +}
>> +#endif
> 
> Is there any reason this is in xfs_iomap.c and not xfs_file.c?

Yes, It's better to put it in xfs_file.c since it's the only caller.  I 
didn't notice it...


--
Thanks,
Ruan.

> 
> Otherwise the patch looks good:
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (14 preceding siblings ...)
  2022-05-10  9:32 ` [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Christoph Hellwig
@ 2022-05-11  0:03 ` Darrick J. Wong
  2022-05-11  1:48   ` Dave Chinner
  2022-06-02 18:56 ` Andrew Morton
  16 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2022-05-11  0:03 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu, rgoldwyn, viro, willy,
	naoya.horiguchi, linmiaohe

On Sun, May 08, 2022 at 10:36:06PM +0800, Shiyang Ruan wrote:
> This is a combination of two patchsets:
>  1.fsdax-rmap: https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/
>  2.fsdax-reflink: https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/
> 
>  Changes since v13 of fsdax-rmap:
>   1. Fixed mistakes during rebasing code to latest next-
>   2. Rebased to next-20220504
> 
>  Changes since v10 of fsdax-reflink:
>   1. Rebased to next-20220504 and fsdax-rmap
>   2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
>       iter model'
>   3. Fixed many conflicts during rebasing
>   4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
>       shorter than smap->length or dmap->length.
>   PS: There are many changes during rebasing.  I think it's better to
>       review again.
> 
> ==
> Shiyang Ruan (14):
>   fsdax-rmap:
>     dax: Introduce holder for dax_device
>     mm: factor helpers for memory_failure_dev_pagemap
>     pagemap,pmem: Introduce ->memory_failure()
>     fsdax: Introduce dax_lock_mapping_entry()
>     mm: Introduce mf_dax_kill_procs() for fsdax case

Hmm.  This patchset touches at least the dax, pagecache, and xfs
subsystems.  Assuming it's too late for 5.19, how should we stage this
for 5.20?

I could just add the entire series to iomap-5.20-merge and base the
xfs-5.20-merge off of that?  But I'm not sure what else might be landing
in the other subsystems, so I'm open to input.

--D

>     xfs: Implement ->notify_failure() for XFS
>     fsdax: set a CoW flag when associate reflink mappings
>   fsdax-reflink:
>     fsdax: Output address in dax_iomap_pfn() and rename it
>     fsdax: Introduce dax_iomap_cow_copy()
>     fsdax: Replace mmap entry in case of CoW
>     fsdax: Add dax_iomap_cow_copy() for dax zero
>     fsdax: Dedup file range to use a compare function
>     xfs: support CoW in fsdax mode
>     xfs: Add dax dedupe support
> 
>  drivers/dax/super.c         |  67 +++++-
>  drivers/md/dm.c             |   2 +-
>  drivers/nvdimm/pmem.c       |  17 ++
>  fs/dax.c                    | 398 ++++++++++++++++++++++++++++++------
>  fs/erofs/super.c            |  13 +-
>  fs/ext2/super.c             |   7 +-
>  fs/ext4/super.c             |   9 +-
>  fs/remap_range.c            |  31 ++-
>  fs/xfs/Makefile             |   5 +
>  fs/xfs/xfs_buf.c            |  10 +-
>  fs/xfs/xfs_file.c           |   9 +-
>  fs/xfs/xfs_fsops.c          |   3 +
>  fs/xfs/xfs_inode.c          |  69 ++++++-
>  fs/xfs/xfs_inode.h          |   1 +
>  fs/xfs/xfs_iomap.c          |  46 ++++-
>  fs/xfs/xfs_iomap.h          |   3 +
>  fs/xfs/xfs_mount.h          |   1 +
>  fs/xfs/xfs_notify_failure.c | 220 ++++++++++++++++++++
>  fs/xfs/xfs_reflink.c        |  12 +-
>  fs/xfs/xfs_super.h          |   1 +
>  include/linux/dax.h         |  56 ++++-
>  include/linux/fs.h          |  12 +-
>  include/linux/memremap.h    |  12 ++
>  include/linux/mm.h          |   2 +
>  include/linux/page-flags.h  |   6 +
>  mm/memory-failure.c         | 257 ++++++++++++++++-------
>  26 files changed, 1087 insertions(+), 182 deletions(-)
>  create mode 100644 fs/xfs/xfs_notify_failure.c
> 
> -- 
> 2.35.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  0:03 ` Darrick J. Wong
@ 2022-05-11  1:48   ` Dave Chinner
  2022-05-11  1:55     ` Dan Williams
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2022-05-11  1:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Shiyang Ruan, linux-kernel, linux-xfs, nvdimm, linux-mm,
	linux-fsdevel, dan.j.williams, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe

On Tue, May 10, 2022 at 05:03:52PM -0700, Darrick J. Wong wrote:
> On Sun, May 08, 2022 at 10:36:06PM +0800, Shiyang Ruan wrote:
> > This is a combination of two patchsets:
> >  1.fsdax-rmap: https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/
> >  2.fsdax-reflink: https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/
> > 
> >  Changes since v13 of fsdax-rmap:
> >   1. Fixed mistakes during rebasing code to latest next-
> >   2. Rebased to next-20220504
> > 
> >  Changes since v10 of fsdax-reflink:
> >   1. Rebased to next-20220504 and fsdax-rmap
> >   2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
> >       iter model'
> >   3. Fixed many conflicts during rebasing
> >   4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
> >       shorter than smap->length or dmap->length.
> >   PS: There are many changes during rebasing.  I think it's better to
> >       review again.
> > 
> > ==
> > Shiyang Ruan (14):
> >   fsdax-rmap:
> >     dax: Introduce holder for dax_device
> >     mm: factor helpers for memory_failure_dev_pagemap
> >     pagemap,pmem: Introduce ->memory_failure()
> >     fsdax: Introduce dax_lock_mapping_entry()
> >     mm: Introduce mf_dax_kill_procs() for fsdax case
> 
> Hmm.  This patchset touches at least the dax, pagecache, and xfs
> subsystems.  Assuming it's too late for 5.19, how should we stage this
> for 5.20?

Yeah, it's past my "last date for this merge cycle" which was
-rc6. I expected stuff might slip a little - as it has with the LARP
code - but I don't have the time and bandwidth to start working
on merging another feature from scratch before the merge window
comes around.

Getting the dax+reflink stuff in this cycle was always an optimistic
stretch, but I wanted to try so that there was no doubt it would be
ready for merge in the next cycle...

> I could just add the entire series to iomap-5.20-merge and base the
> xfs-5.20-merge off of that?  But I'm not sure what else might be landing
> in the other subsystems, so I'm open to input.

It'll need to be a stable branch somewhere, but I don't think it
really matters where al long as it's merged into the xfs for-next
tree so it gets filesystem test coverage...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  1:48   ` Dave Chinner
@ 2022-05-11  1:55     ` Dan Williams
  2022-05-11  2:19       ` Dave Chinner
  2022-05-11  2:28       ` Andrew Morton
  0 siblings, 2 replies; 38+ messages in thread
From: Dan Williams @ 2022-05-11  1:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Shiyang Ruan, Linux Kernel Mailing List,
	linux-xfs, Linux NVDIMM, Linux MM, linux-fsdevel,
	Christoph Hellwig, Jane Chu, Goldwyn Rodrigues, Al Viro,
	Matthew Wilcox, Naoya Horiguchi, linmiaohe, Andrew Morton

[ add Andrew ]


On Tue, May 10, 2022 at 6:49 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, May 10, 2022 at 05:03:52PM -0700, Darrick J. Wong wrote:
> > On Sun, May 08, 2022 at 10:36:06PM +0800, Shiyang Ruan wrote:
> > > This is a combination of two patchsets:
> > >  1.fsdax-rmap: https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/
> > >  2.fsdax-reflink: https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/
> > >
> > >  Changes since v13 of fsdax-rmap:
> > >   1. Fixed mistakes during rebasing code to latest next-
> > >   2. Rebased to next-20220504
> > >
> > >  Changes since v10 of fsdax-reflink:
> > >   1. Rebased to next-20220504 and fsdax-rmap
> > >   2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
> > >       iter model'
> > >   3. Fixed many conflicts during rebasing
> > >   4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
> > >       shorter than smap->length or dmap->length.
> > >   PS: There are many changes during rebasing.  I think it's better to
> > >       review again.
> > >
> > > ==
> > > Shiyang Ruan (14):
> > >   fsdax-rmap:
> > >     dax: Introduce holder for dax_device
> > >     mm: factor helpers for memory_failure_dev_pagemap
> > >     pagemap,pmem: Introduce ->memory_failure()
> > >     fsdax: Introduce dax_lock_mapping_entry()
> > >     mm: Introduce mf_dax_kill_procs() for fsdax case
> >
> > Hmm.  This patchset touches at least the dax, pagecache, and xfs
> > subsystems.  Assuming it's too late for 5.19, how should we stage this
> > for 5.20?
>
> Yeah, it's past my "last date for this merge cycle" which was
> -rc6. I expected stuff might slip a little - as it has with the LARP
> code - but I don't have the time and bandwidth to start working
> on merging another feature from scratch before the merge window
> comes around.
>
> Getting the dax+reflink stuff in this cycle was always an optimistic
> stretch, but I wanted to try so that there was no doubt it would be
> ready for merge in the next cycle...
>
> > I could just add the entire series to iomap-5.20-merge and base the
> > xfs-5.20-merge off of that?  But I'm not sure what else might be landing
> > in the other subsystems, so I'm open to input.
>
> It'll need to be a stable branch somewhere, but I don't think it
> really matters where al long as it's merged into the xfs for-next
> tree so it gets filesystem test coverage...

So how about let the notify_failure() bits go through -mm this cycle,
if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
baseline to build from?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  1:55     ` Dan Williams
@ 2022-05-11  2:19       ` Dave Chinner
  2022-05-11  2:28       ` Andrew Morton
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2022-05-11  2:19 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Shiyang Ruan, Linux Kernel Mailing List,
	linux-xfs, Linux NVDIMM, Linux MM, linux-fsdevel,
	Christoph Hellwig, Jane Chu, Goldwyn Rodrigues, Al Viro,
	Matthew Wilcox, Naoya Horiguchi, linmiaohe, Andrew Morton

On Tue, May 10, 2022 at 06:55:50PM -0700, Dan Williams wrote:
> [ add Andrew ]
> 
> 
> On Tue, May 10, 2022 at 6:49 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, May 10, 2022 at 05:03:52PM -0700, Darrick J. Wong wrote:
> > > On Sun, May 08, 2022 at 10:36:06PM +0800, Shiyang Ruan wrote:
> > > > This is a combination of two patchsets:
> > > >  1.fsdax-rmap: https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/
> > > >  2.fsdax-reflink: https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/
> > > >
> > > >  Changes since v13 of fsdax-rmap:
> > > >   1. Fixed mistakes during rebasing code to latest next-
> > > >   2. Rebased to next-20220504
> > > >
> > > >  Changes since v10 of fsdax-reflink:
> > > >   1. Rebased to next-20220504 and fsdax-rmap
> > > >   2. Dropped a needless cleanup patch: 'fsdax: Convert dax_iomap_zero to
> > > >       iter model'
> > > >   3. Fixed many conflicts during rebasing
> > > >   4. Fixed a dedupe bug in Patch 05: the actuall length to compare could be
> > > >       shorter than smap->length or dmap->length.
> > > >   PS: There are many changes during rebasing.  I think it's better to
> > > >       review again.
> > > >
> > > > ==
> > > > Shiyang Ruan (14):
> > > >   fsdax-rmap:
> > > >     dax: Introduce holder for dax_device
> > > >     mm: factor helpers for memory_failure_dev_pagemap
> > > >     pagemap,pmem: Introduce ->memory_failure()
> > > >     fsdax: Introduce dax_lock_mapping_entry()
> > > >     mm: Introduce mf_dax_kill_procs() for fsdax case
> > >
> > > Hmm.  This patchset touches at least the dax, pagecache, and xfs
> > > subsystems.  Assuming it's too late for 5.19, how should we stage this
> > > for 5.20?
> >
> > Yeah, it's past my "last date for this merge cycle" which was
> > -rc6. I expected stuff might slip a little - as it has with the LARP
> > code - but I don't have the time and bandwidth to start working
> > on merging another feature from scratch before the merge window
> > comes around.
> >
> > Getting the dax+reflink stuff in this cycle was always an optimistic
> > stretch, but I wanted to try so that there was no doubt it would be
> > ready for merge in the next cycle...
> >
> > > I could just add the entire series to iomap-5.20-merge and base the
> > > xfs-5.20-merge off of that?  But I'm not sure what else might be landing
> > > in the other subsystems, so I'm open to input.
> >
> > It'll need to be a stable branch somewhere, but I don't think it
> > really matters where al long as it's merged into the xfs for-next
> > tree so it gets filesystem test coverage...
> 
> So how about let the notify_failure() bits go through -mm this cycle,
> if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> baseline to build from?

Sure, if you want to push them that way I'm not going to complain
or stop you. :)

Anything that makes the eventual XFS feature merge simpler counts as
a win in my books.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v11.1 06/07] xfs: support CoW in fsdax mode
  2022-05-08 14:36 ` [PATCH v11 06/07] xfs: support CoW in fsdax mode Shiyang Ruan
  2022-05-10  5:45   ` Christoph Hellwig
@ 2022-05-11  2:25   ` Shiyang Ruan
  1 sibling, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-11  2:25 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu, rgoldwyn, viro,
	willy, naoya.horiguchi, linmiaohe, Christoph Hellwig

In fsdax mode, WRITE and ZERO on a shared extent need CoW performed.
After that, new allocated extents needs to be remapped to the file.
So, add a CoW identification in ->iomap_begin(), and implement
->iomap_end() to do the remapping work.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c  | 33 ++++++++++++++++++++++++++++-----
 fs/xfs/xfs_iomap.c | 30 +++++++++++++++++++++++++++++-
 fs/xfs/xfs_iomap.h |  1 +
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index af954a5b71f8..fe9f92586acf 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -25,6 +25,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include <linux/dax.h>
 #include <linux/falloc.h>
 #include <linux/backing-dev.h>
 #include <linux/mman.h>
@@ -669,7 +670,7 @@ xfs_file_dax_write(
 	pos = iocb->ki_pos;
 
 	trace_xfs_file_dax_write(iocb, from);
-	ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
+	ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
 	if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
@@ -1254,6 +1255,31 @@ xfs_file_llseek(
 	return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+#ifdef CONFIG_FS_DAX
+int
+xfs_dax_fault(
+	struct vm_fault		*vmf,
+	enum page_entry_size	pe_size,
+	bool			write_fault,
+	pfn_t			*pfn)
+{
+	return dax_iomap_fault(vmf, pe_size, pfn, NULL,
+			(write_fault && !vmf->cow_page) ?
+				&xfs_dax_write_iomap_ops :
+				&xfs_read_iomap_ops);
+}
+#else
+int
+xfs_dax_fault(
+	struct vm_fault		*vmf,
+	enum page_entry_size	pe_size,
+	bool			write_fault,
+	pfn_t			*pfn)
+{
+	return 0;
+}
+#endif
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1285,10 +1311,7 @@ __xfs_filemap_fault(
 		pfn_t pfn;
 
 		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-		ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
-				(write_fault && !vmf->cow_page) ?
-				 &xfs_direct_write_iomap_ops :
-				 &xfs_read_iomap_ops);
+		ret = xfs_dax_fault(vmf, pe_size, write_fault, &pfn);
 		if (ret & VM_FAULT_NEEDDSYNC)
 			ret = dax_finish_sync_fault(vmf, pe_size, pfn);
 		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 5a393259a3a3..4c07f5e718fb 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -773,7 +773,8 @@ xfs_direct_write_iomap_begin(
 
 		/* may drop and re-acquire the ilock */
 		error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
-				&lockmode, flags & IOMAP_DIRECT);
+				&lockmode,
+				(flags & IOMAP_DIRECT) || IS_DAX(inode));
 		if (error)
 			goto out_unlock;
 		if (shared)
@@ -867,6 +868,33 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
 	.iomap_begin		= xfs_direct_write_iomap_begin,
 };
 
+static int
+xfs_dax_write_iomap_end(
+	struct inode		*inode,
+	loff_t			pos,
+	loff_t			length,
+	ssize_t			written,
+	unsigned		flags,
+	struct iomap		*iomap)
+{
+	struct xfs_inode	*ip = XFS_I(inode);
+
+	if (!xfs_is_cow_inode(ip))
+		return 0;
+
+	if (!written) {
+		xfs_reflink_cancel_cow_range(ip, pos, length, true);
+		return 0;
+	}
+
+	return xfs_reflink_end_cow(ip, pos, written);
+}
+
+const struct iomap_ops xfs_dax_write_iomap_ops = {
+	.iomap_begin	= xfs_direct_write_iomap_begin,
+	.iomap_end	= xfs_dax_write_iomap_end,
+};
+
 static int
 xfs_buffered_write_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index e88dc162c785..c782e8c0479c 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -51,5 +51,6 @@ extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
 extern const struct iomap_ops xfs_xattr_iomap_ops;
+extern const struct iomap_ops xfs_dax_write_iomap_ops;
 
 #endif /* __XFS_IOMAP_H__*/
-- 
2.35.1




^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  1:55     ` Dan Williams
  2022-05-11  2:19       ` Dave Chinner
@ 2022-05-11  2:28       ` Andrew Morton
  2022-05-11  2:43         ` Darrick J. Wong
  2022-05-11  4:20         ` Dan Williams
  1 sibling, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2022-05-11  2:28 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Chinner, Darrick J. Wong, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:

> > It'll need to be a stable branch somewhere, but I don't think it
> > really matters where al long as it's merged into the xfs for-next
> > tree so it gets filesystem test coverage...
> 
> So how about let the notify_failure() bits go through -mm this cycle,
> if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> baseline to build from?

What are we referring to here?  I think a minimal thing would be the
memremap.h and memory-failure.c changes from
https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?

Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
would probably be straining things to slip it into 5.19.

The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
right thing, but it's a networking errno.  I suppose livable with if it
never escapes the kernel, but if it can get back to userspace then a
user would be justified in wondering how the heck a filesystem
operation generated a networking errno?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  2:28       ` Andrew Morton
@ 2022-05-11  2:43         ` Darrick J. Wong
  2022-05-11  5:24           ` Andrew Morton
  2022-05-11  4:20         ` Dan Williams
  1 sibling, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2022-05-11  2:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Dave Chinner, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> 
> > > It'll need to be a stable branch somewhere, but I don't think it
> > > really matters where al long as it's merged into the xfs for-next
> > > tree so it gets filesystem test coverage...
> > 
> > So how about let the notify_failure() bits go through -mm this cycle,
> > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > baseline to build from?
> 
> What are we referring to here?  I think a minimal thing would be the
> memremap.h and memory-failure.c changes from
> https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?
> 
> Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> would probably be straining things to slip it into 5.19.
> 
> The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> right thing, but it's a networking errno.  I suppose livable with if it
> never escapes the kernel, but if it can get back to userspace then a
> user would be justified in wondering how the heck a filesystem
> operation generated a networking errno?

<shrug> most filesystems return EOPNOTSUPP rather enthusiastically when
they don't know how to do something...

--D

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  2:28       ` Andrew Morton
  2022-05-11  2:43         ` Darrick J. Wong
@ 2022-05-11  4:20         ` Dan Williams
  2022-05-11  4:34           ` Darrick J. Wong
  1 sibling, 1 reply; 38+ messages in thread
From: Dan Williams @ 2022-05-11  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Darrick J. Wong, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

On Tue, May 10, 2022 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
>
> > > It'll need to be a stable branch somewhere, but I don't think it
> > > really matters where al long as it's merged into the xfs for-next
> > > tree so it gets filesystem test coverage...
> >
> > So how about let the notify_failure() bits go through -mm this cycle,
> > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > baseline to build from?
>
> What are we referring to here?  I think a minimal thing would be the
> memremap.h and memory-failure.c changes from
> https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?

Latest is here:
https://lore.kernel.org/all/20220508143620.1775214-1-ruansy.fnst@fujitsu.com/

> Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> would probably be straining things to slip it into 5.19.

Hmm, if it's straining things and XFS will also target v5.20 I think
the best course for all involved is just wait. Let some of the current
conflicts in -mm land in v5.19 and then I can merge the DAX baseline
and publish a stable branch for XFS and BTRFS to build upon for v5.20.

> The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> right thing, but it's a networking errno.  I suppose livable with if it
> never escapes the kernel, but if it can get back to userspace then a
> user would be justified in wondering how the heck a filesystem
> operation generated a networking errno?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  4:20         ` Dan Williams
@ 2022-05-11  4:34           ` Darrick J. Wong
  0 siblings, 0 replies; 38+ messages in thread
From: Darrick J. Wong @ 2022-05-11  4:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Dave Chinner, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

On Tue, May 10, 2022 at 09:20:57PM -0700, Dan Williams wrote:
> On Tue, May 10, 2022 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > really matters where al long as it's merged into the xfs for-next
> > > > tree so it gets filesystem test coverage...
> > >
> > > So how about let the notify_failure() bits go through -mm this cycle,
> > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > baseline to build from?
> >
> > What are we referring to here?  I think a minimal thing would be the
> > memremap.h and memory-failure.c changes from
> > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?
> 
> Latest is here:
> https://lore.kernel.org/all/20220508143620.1775214-1-ruansy.fnst@fujitsu.com/
> 
> > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > would probably be straining things to slip it into 5.19.
> 
> Hmm, if it's straining things and XFS will also target v5.20 I think
> the best course for all involved is just wait. Let some of the current
> conflicts in -mm land in v5.19 and then I can merge the DAX baseline
> and publish a stable branch for XFS and BTRFS to build upon for v5.20.

Sounds good to /me...

--D

> > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > right thing, but it's a networking errno.  I suppose livable with if it
> > never escapes the kernel, but if it can get back to userspace then a
> > user would be justified in wondering how the heck a filesystem
> > operation generated a networking errno?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  2:43         ` Darrick J. Wong
@ 2022-05-11  5:24           ` Andrew Morton
  2022-05-11  6:21             ` Dave Chinner
  2022-05-11 15:19             ` Darrick J. Wong
  0 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2022-05-11  5:24 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Dave Chinner, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong" <djwong@kernel.org> wrote:

> On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > really matters where al long as it's merged into the xfs for-next
> > > > tree so it gets filesystem test coverage...
> > > 
> > > So how about let the notify_failure() bits go through -mm this cycle,
> > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > baseline to build from?
> > 
> > What are we referring to here?  I think a minimal thing would be the
> > memremap.h and memory-failure.c changes from
> > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?
> > 
> > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > would probably be straining things to slip it into 5.19.
> > 
> > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > right thing, but it's a networking errno.  I suppose livable with if it
> > never escapes the kernel, but if it can get back to userspace then a
> > user would be justified in wondering how the heck a filesystem
> > operation generated a networking errno?
> 
> <shrug> most filesystems return EOPNOTSUPP rather enthusiastically when
> they don't know how to do something...

Can it propagate back to userspace?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  5:24           ` Andrew Morton
@ 2022-05-11  6:21             ` Dave Chinner
  2022-05-11 15:19             ` Darrick J. Wong
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2022-05-11  6:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Darrick J. Wong, Dan Williams, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

On Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
> On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong" <djwong@kernel.org> wrote:
> 
> > On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> > > 
> > > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > > really matters where al long as it's merged into the xfs for-next
> > > > > tree so it gets filesystem test coverage...
> > > > 
> > > > So how about let the notify_failure() bits go through -mm this cycle,
> > > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > > baseline to build from?
> > > 
> > > What are we referring to here?  I think a minimal thing would be the
> > > memremap.h and memory-failure.c changes from
> > > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?
> > > 
> > > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > > would probably be straining things to slip it into 5.19.
> > > 
> > > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > > right thing, but it's a networking errno.  I suppose livable with if it
> > > never escapes the kernel, but if it can get back to userspace then a
> > > user would be justified in wondering how the heck a filesystem
> > > operation generated a networking errno?
> > 
> > <shrug> most filesystems return EOPNOTSUPP rather enthusiastically when
> > they don't know how to do something...
> 
> Can it propagate back to userspace?

Maybe not this one, but the point Darrick is making is that we
really don't care if it does because we've been propagating it to
userspace in documented filesystem APIs for at least 15 years now.

e.g:

$ man 2 fallocate
.....
Errors
.....
       EOPNOTSUPP
	     The filesystem containing the file referred to by fd
	     does not support this operation; or the mode is not
	     supported by the filesystem containing the file
	     referred to by fd.
.....

Other random examples:

pwritev2(RWF_NOWAIT) can return -EOPNOTSUPP on buffered writes.
Documented in the man page.

FICLONERANGE on an filesystem that doesn't support reflink will
return -EOPNOTSUPP. Documented in the man page.

mmap(MAP_SYNC) returns -EOPNOTSUPP if the underlying filesystem
and/or storage doesn't support DAX. Documented in the man page.

I could go on, but I think I've made the point already...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11  5:24           ` Andrew Morton
  2022-05-11  6:21             ` Dave Chinner
@ 2022-05-11 15:19             ` Darrick J. Wong
  2022-05-11 15:46               ` Dan Williams
  1 sibling, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2022-05-11 15:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dan Williams, Dave Chinner, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
> On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong" <djwong@kernel.org> wrote:
> 
> > On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> > > 
> > > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > > really matters where al long as it's merged into the xfs for-next
> > > > > tree so it gets filesystem test coverage...
> > > > 
> > > > So how about let the notify_failure() bits go through -mm this cycle,
> > > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > > baseline to build from?
> > > 
> > > What are we referring to here?  I think a minimal thing would be the
> > > memremap.h and memory-failure.c changes from
> > > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?
> > > 
> > > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > > would probably be straining things to slip it into 5.19.
> > > 
> > > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > > right thing, but it's a networking errno.  I suppose livable with if it
> > > never escapes the kernel, but if it can get back to userspace then a
> > > user would be justified in wondering how the heck a filesystem
> > > operation generated a networking errno?
> > 
> > <shrug> most filesystems return EOPNOTSUPP rather enthusiastically when
> > they don't know how to do something...
> 
> Can it propagate back to userspace?

AFAICT, the new code falls back to the current (mf_generic_kill_procs)
failure code if the filesystem doesn't provide a ->memory_failure
function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
convert that to 0 before returning it to userspace.

I suppose the weirder question is going to be what happens when madvise
starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
loses half its brains and even the fs can't deal with it.

--D

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11 15:19             ` Darrick J. Wong
@ 2022-05-11 15:46               ` Dan Williams
  2022-05-12 12:27                 ` Shiyang Ruan
  0 siblings, 1 reply; 38+ messages in thread
From: Dan Williams @ 2022-05-11 15:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Andrew Morton, Dave Chinner, Shiyang Ruan,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Christoph Hellwig, Jane Chu, Goldwyn Rodrigues,
	Al Viro, Matthew Wilcox, Naoya Horiguchi, linmiaohe

On Wed, May 11, 2022 at 8:21 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
> > On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong" <djwong@kernel.org> wrote:
> >
> > > On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > > > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
> > > >
> > > > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > > > really matters where al long as it's merged into the xfs for-next
> > > > > > tree so it gets filesystem test coverage...
> > > > >
> > > > > So how about let the notify_failure() bits go through -mm this cycle,
> > > > > if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
> > > > > baseline to build from?
> > > >
> > > > What are we referring to here?  I think a minimal thing would be the
> > > > memremap.h and memory-failure.c changes from
> > > > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?
> > > >
> > > > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > > > would probably be straining things to slip it into 5.19.
> > > >
> > > > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > > > right thing, but it's a networking errno.  I suppose livable with if it
> > > > never escapes the kernel, but if it can get back to userspace then a
> > > > user would be justified in wondering how the heck a filesystem
> > > > operation generated a networking errno?
> > >
> > > <shrug> most filesystems return EOPNOTSUPP rather enthusiastically when
> > > they don't know how to do something...
> >
> > Can it propagate back to userspace?
>
> AFAICT, the new code falls back to the current (mf_generic_kill_procs)
> failure code if the filesystem doesn't provide a ->memory_failure
> function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
> return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
> convert that to 0 before returning it to userspace.
>
> I suppose the weirder question is going to be what happens when madvise
> starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
> loses half its brains and even the fs can't deal with it.

Even then that notification is not in a system call context so it
would still result in a SIGBUS notification not a EOPNOTSUPP return
code. The only potential gap I see are what are the possible error
codes that MADV_SOFT_OFFLINE might see? The man page is silent on soft
offline failure codes. Shiyang, that's something to check / update if
necessary.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-11 15:46               ` Dan Williams
@ 2022-05-12 12:27                 ` Shiyang Ruan
  2022-06-02  9:42                   ` Shiyang Ruan
  0 siblings, 1 reply; 38+ messages in thread
From: Shiyang Ruan @ 2022-05-12 12:27 UTC (permalink / raw)
  To: Dan Williams, Darrick J. Wong
  Cc: Andrew Morton, Dave Chinner, Linux Kernel Mailing List,
	linux-xfs, Linux NVDIMM, Linux MM, linux-fsdevel,
	Christoph Hellwig, Jane Chu, Goldwyn Rodrigues, Al Viro,
	Matthew Wilcox, Naoya Horiguchi, linmiaohe



在 2022/5/11 23:46, Dan Williams 写道:
> On Wed, May 11, 2022 at 8:21 AM Darrick J. Wong <djwong@kernel.org> wrote:
>>
>> Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
>>> On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong" <djwong@kernel.org> wrote:
>>>
>>>> On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
>>>>> On Tue, 10 May 2022 18:55:50 -0700 Dan Williams <dan.j.williams@intel.com> wrote:
>>>>>
>>>>>>> It'll need to be a stable branch somewhere, but I don't think it
>>>>>>> really matters where al long as it's merged into the xfs for-next
>>>>>>> tree so it gets filesystem test coverage...
>>>>>>
>>>>>> So how about let the notify_failure() bits go through -mm this cycle,
>>>>>> if Andrew will have it, and then the reflnk work has a clean v5.19-rc1
>>>>>> baseline to build from?
>>>>>
>>>>> What are we referring to here?  I think a minimal thing would be the
>>>>> memremap.h and memory-failure.c changes from
>>>>> https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com ?
>>>>>
>>>>> Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
>>>>> would probably be straining things to slip it into 5.19.
>>>>>
>>>>> The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
>>>>> right thing, but it's a networking errno.  I suppose livable with if it
>>>>> never escapes the kernel, but if it can get back to userspace then a
>>>>> user would be justified in wondering how the heck a filesystem
>>>>> operation generated a networking errno?
>>>>
>>>> <shrug> most filesystems return EOPNOTSUPP rather enthusiastically when
>>>> they don't know how to do something...
>>>
>>> Can it propagate back to userspace?
>>
>> AFAICT, the new code falls back to the current (mf_generic_kill_procs)
>> failure code if the filesystem doesn't provide a ->memory_failure
>> function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
>> return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
>> convert that to 0 before returning it to userspace.
>>
>> I suppose the weirder question is going to be what happens when madvise
>> starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
>> loses half its brains and even the fs can't deal with it.
> 
> Even then that notification is not in a system call context so it
> would still result in a SIGBUS notification not a EOPNOTSUPP return
> code. The only potential gap I see are what are the possible error
> codes that MADV_SOFT_OFFLINE might see? The man page is silent on soft
> offline failure codes. Shiyang, that's something to check / update if
> necessary.

According to the code around MADV_SOFT_OFFLINE, it will return -EIO when 
the backend is NVDIMM.

Here is the logic:
  madvise_inject_error() {
      ...
      if (MADV_SOFT_OFFLINE) {
          ret = soft_offline_page() {
              ...
              /* Only online pages can be soft-offlined (esp., not 
ZONE_DEVICE). */
              page = pfn_to_online_page(pfn);
              if (!page) {
                  put_ref_page(ref_page);
                  return -EIO;
              }
              ...
          }
      } else {
          ret = memory_failure()
      }
      return ret
  }


--
Thanks,
Ruan.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-12 12:27                 ` Shiyang Ruan
@ 2022-06-02  9:42                   ` Shiyang Ruan
  2022-06-02 17:18                     ` Darrick J. Wong
  0 siblings, 1 reply; 38+ messages in thread
From: Shiyang Ruan @ 2022-06-02  9:42 UTC (permalink / raw)
  To: Dan Williams
  Cc: Naoya Horiguchi, Matthew Wilcox, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Darrick J. Wong,
	Linux Kernel Mailing List, linux-xfs, Linux NVDIMM, Linux MM,
	linux-fsdevel, Jane Chu, Goldwyn Rodrigues, Al Viro, linmiaohe

Hi,

Is there any other work I should do with these two patchsets?  I think 
they are good for now.  So... since the 5.19-rc1 is coming, could the 
notify_failure() part be merged as your plan?


--
Thanks,
Ruan.


在 2022/5/12 20:27, Shiyang Ruan 写道:
> 
> 
> 在 2022/5/11 23:46, Dan Williams 写道:
>> On Wed, May 11, 2022 at 8:21 AM Darrick J. Wong <djwong@kernel.org> 
>> wrote:
>>>
>>> Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
>>>> On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong" 
>>>> <djwong@kernel.org> wrote:
>>>>
>>>>> On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
>>>>>> On Tue, 10 May 2022 18:55:50 -0700 Dan Williams 
>>>>>> <dan.j.williams@intel.com> wrote:
>>>>>>
>>>>>>>> It'll need to be a stable branch somewhere, but I don't think it
>>>>>>>> really matters where al long as it's merged into the xfs for-next
>>>>>>>> tree so it gets filesystem test coverage...
>>>>>>>
>>>>>>> So how about let the notify_failure() bits go through -mm this 
>>>>>>> cycle,
>>>>>>> if Andrew will have it, and then the reflnk work has a clean 
>>>>>>> v5.19-rc1
>>>>>>> baseline to build from?
>>>>>>
>>>>>> What are we referring to here?  I think a minimal thing would be the
>>>>>> memremap.h and memory-failure.c changes from
>>>>>> https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com 
>>>>>> ?
>>>>>>
>>>>>> Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
>>>>>> would probably be straining things to slip it into 5.19.
>>>>>>
>>>>>> The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
>>>>>> right thing, but it's a networking errno.  I suppose livable with 
>>>>>> if it
>>>>>> never escapes the kernel, but if it can get back to userspace then a
>>>>>> user would be justified in wondering how the heck a filesystem
>>>>>> operation generated a networking errno?
>>>>>
>>>>> <shrug> most filesystems return EOPNOTSUPP rather enthusiastically 
>>>>> when
>>>>> they don't know how to do something...
>>>>
>>>> Can it propagate back to userspace?
>>>
>>> AFAICT, the new code falls back to the current (mf_generic_kill_procs)
>>> failure code if the filesystem doesn't provide a ->memory_failure
>>> function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
>>> return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
>>> convert that to 0 before returning it to userspace.
>>>
>>> I suppose the weirder question is going to be what happens when madvise
>>> starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
>>> loses half its brains and even the fs can't deal with it.
>>
>> Even then that notification is not in a system call context so it
>> would still result in a SIGBUS notification not a EOPNOTSUPP return
>> code. The only potential gap I see are what are the possible error
>> codes that MADV_SOFT_OFFLINE might see? The man page is silent on soft
>> offline failure codes. Shiyang, that's something to check / update if
>> necessary.
> 
> According to the code around MADV_SOFT_OFFLINE, it will return -EIO when 
> the backend is NVDIMM.
> 
> Here is the logic:
>   madvise_inject_error() {
>       ...
>       if (MADV_SOFT_OFFLINE) {
>           ret = soft_offline_page() {
>               ...
>               /* Only online pages can be soft-offlined (esp., not 
> ZONE_DEVICE). */
>               page = pfn_to_online_page(pfn);
>               if (!page) {
>                   put_ref_page(ref_page);
>                   return -EIO;
>               }
>               ...
>           }
>       } else {
>           ret = memory_failure()
>       }
>       return ret
>   }
> 
> 
> -- 
> Thanks,
> Ruan.
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-06-02  9:42                   ` Shiyang Ruan
@ 2022-06-02 17:18                     ` Darrick J. Wong
  2022-06-02 17:47                       ` Andrew Morton
  0 siblings, 1 reply; 38+ messages in thread
From: Darrick J. Wong @ 2022-06-02 17:18 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: Dan Williams, Naoya Horiguchi, Matthew Wilcox, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Linux Kernel Mailing List,
	linux-xfs, Linux NVDIMM, Linux MM, linux-fsdevel, Jane Chu,
	Goldwyn Rodrigues, Al Viro, linmiaohe

On Thu, Jun 02, 2022 at 05:42:13PM +0800, Shiyang Ruan wrote:
> Hi,
> 
> Is there any other work I should do with these two patchsets?  I think they
> are good for now.  So... since the 5.19-rc1 is coming, could the
> notify_failure() part be merged as your plan?

Hmm.  I don't see any of the patches 1-5,7-13 in current upstream, so
I'm guessing this means Andrew isn't taking it for 5.19?

--D

> 
> 
> --
> Thanks,
> Ruan.
> 
> 
> 在 2022/5/12 20:27, Shiyang Ruan 写道:
> > 
> > 
> > 在 2022/5/11 23:46, Dan Williams 写道:
> > > On Wed, May 11, 2022 at 8:21 AM Darrick J. Wong <djwong@kernel.org>
> > > wrote:
> > > > 
> > > > Oan Tue, May 10, 2022 at 10:24:28PM -0700, Andrew Morton wrote:
> > > > > On Tue, 10 May 2022 19:43:01 -0700 "Darrick J. Wong"
> > > > > <djwong@kernel.org> wrote:
> > > > > 
> > > > > > On Tue, May 10, 2022 at 07:28:53PM -0700, Andrew Morton wrote:
> > > > > > > On Tue, 10 May 2022 18:55:50 -0700 Dan Williams
> > > > > > > <dan.j.williams@intel.com> wrote:
> > > > > > > 
> > > > > > > > > It'll need to be a stable branch somewhere, but I don't think it
> > > > > > > > > really matters where al long as it's merged into the xfs for-next
> > > > > > > > > tree so it gets filesystem test coverage...
> > > > > > > > 
> > > > > > > > So how about let the notify_failure() bits go
> > > > > > > > through -mm this cycle,
> > > > > > > > if Andrew will have it, and then the reflnk work
> > > > > > > > has a clean v5.19-rc1
> > > > > > > > baseline to build from?
> > > > > > > 
> > > > > > > What are we referring to here?  I think a minimal thing would be the
> > > > > > > memremap.h and memory-failure.c changes from
> > > > > > > https://lkml.kernel.org/r/20220508143620.1775214-4-ruansy.fnst@fujitsu.com
> > > > > > > ?
> > > > > > > 
> > > > > > > Sure, I can scoot that into 5.19-rc1 if you think that's best.  It
> > > > > > > would probably be straining things to slip it into 5.19.
> > > > > > > 
> > > > > > > The use of EOPNOTSUPP is a bit suspect, btw.  It *sounds* like the
> > > > > > > right thing, but it's a networking errno.  I suppose
> > > > > > > livable with if it
> > > > > > > never escapes the kernel, but if it can get back to userspace then a
> > > > > > > user would be justified in wondering how the heck a filesystem
> > > > > > > operation generated a networking errno?
> > > > > > 
> > > > > > <shrug> most filesystems return EOPNOTSUPP rather
> > > > > > enthusiastically when
> > > > > > they don't know how to do something...
> > > > > 
> > > > > Can it propagate back to userspace?
> > > > 
> > > > AFAICT, the new code falls back to the current (mf_generic_kill_procs)
> > > > failure code if the filesystem doesn't provide a ->memory_failure
> > > > function or if it returns -EOPNOSUPP.  mf_generic_kill_procs can also
> > > > return -EOPNOTSUPP, but all the memory_failure() callers (madvise, etc.)
> > > > convert that to 0 before returning it to userspace.
> > > > 
> > > > I suppose the weirder question is going to be what happens when madvise
> > > > starts returning filesystem errors like EIO or EFSCORRUPTED when pmem
> > > > loses half its brains and even the fs can't deal with it.
> > > 
> > > Even then that notification is not in a system call context so it
> > > would still result in a SIGBUS notification not a EOPNOTSUPP return
> > > code. The only potential gap I see are what are the possible error
> > > codes that MADV_SOFT_OFFLINE might see? The man page is silent on soft
> > > offline failure codes. Shiyang, that's something to check / update if
> > > necessary.
> > 
> > According to the code around MADV_SOFT_OFFLINE, it will return -EIO when
> > the backend is NVDIMM.
> > 
> > Here is the logic:
> >   madvise_inject_error() {
> >       ...
> >       if (MADV_SOFT_OFFLINE) {
> >           ret = soft_offline_page() {
> >               ...
> >               /* Only online pages can be soft-offlined (esp., not
> > ZONE_DEVICE). */
> >               page = pfn_to_online_page(pfn);
> >               if (!page) {
> >                   put_ref_page(ref_page);
> >                   return -EIO;
> >               }
> >               ...
> >           }
> >       } else {
> >           ret = memory_failure()
> >       }
> >       return ret
> >   }
> > 
> > 
> > -- 
> > Thanks,
> > Ruan.
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-06-02 17:18                     ` Darrick J. Wong
@ 2022-06-02 17:47                       ` Andrew Morton
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2022-06-02 17:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Shiyang Ruan, Dan Williams, Naoya Horiguchi, Matthew Wilcox,
	Christoph Hellwig, Dave Chinner, Linux Kernel Mailing List,
	linux-xfs, Linux NVDIMM, Linux MM, linux-fsdevel, Jane Chu,
	Goldwyn Rodrigues, Al Viro, linmiaohe

On Thu, 2 Jun 2022 10:18:09 -0700 "Darrick J. Wong" <djwong@kernel.org> wrote:

> On Thu, Jun 02, 2022 at 05:42:13PM +0800, Shiyang Ruan wrote:
> > Hi,
> > 
> > Is there any other work I should do with these two patchsets?  I think they
> > are good for now.  So... since the 5.19-rc1 is coming, could the
> > notify_failure() part be merged as your plan?
> 
> Hmm.  I don't see any of the patches 1-5,7-13 in current upstream, so
> I'm guessing this means Andrew isn't taking it for 5.19?

Sorry, the volume of commentary led me to believe that it wasn't
considered finalized.  Shall take a look now.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
                   ` (15 preceding siblings ...)
  2022-05-11  0:03 ` Darrick J. Wong
@ 2022-06-02 18:56 ` Andrew Morton
  2022-06-03  1:07   ` Shiyang Ruan
  16 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2022-06-02 18:56 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu, rgoldwyn, viro, willy,
	naoya.horiguchi, linmiaohe

On Sun, 8 May 2022 22:36:06 +0800 Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:

> This is a combination of two patchsets:
>  1.fsdax-rmap: https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/
>  2.fsdax-reflink: https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/

I'm getting lost in conflicts trying to get this merged up.  Mainly
memory-failure.c due to patch series "mm, hwpoison: enable 1GB hugepage
support".

Could you please take a look at what's in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm a few hours from
now?  Or the next linux-next.

And I suggest that converting it all into a single 14-patch series
would be more straightforward.

Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-06-02 18:56 ` Andrew Morton
@ 2022-06-03  1:07   ` Shiyang Ruan
  2022-06-03  1:36     ` Shiyang Ruan
  0 siblings, 1 reply; 38+ messages in thread
From: Shiyang Ruan @ 2022-06-03  1:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu, rgoldwyn, viro, willy,
	naoya.horiguchi, linmiaohe



在 2022/6/3 2:56, Andrew Morton 写道:
> On Sun, 8 May 2022 22:36:06 +0800 Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote:
> 
>> This is a combination of two patchsets:
>>   1.fsdax-rmap: https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/
>>   2.fsdax-reflink: https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/
> 
> I'm getting lost in conflicts trying to get this merged up.  Mainly
> memory-failure.c due to patch series "mm, hwpoison: enable 1GB hugepage
> support".
> 
> Could you please take a look at what's in the mm-unstable branch at
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm a few hours from
> now?  Or the next linux-next.
> 
> And I suggest that converting it all into a single 14-patch series
> would be more straightforward.

The patchset in this thread is the 14-patch series.  I have solved many 
conflicts.  It's an updated / newest version, and a combination of the 2 
urls quoted above.  In an other word, instead of using this two:

 >> This is a combination of two patchsets:
 >>   1.fsdax-rmap: https://...
 >>   2.fsdax-reflink: https://...

you could take this (the url of the current thread):
https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.fnst@fujitsu.com/

My description misleaded you.  Sorry for that.


--
Thanks,
Ruan.

> 
> Thanks.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink
  2022-06-03  1:07   ` Shiyang Ruan
@ 2022-06-03  1:36     ` Shiyang Ruan
  0 siblings, 0 replies; 38+ messages in thread
From: Shiyang Ruan @ 2022-06-03  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu, rgoldwyn, viro, willy,
	naoya.horiguchi, linmiaohe



在 2022/6/3 9:07, Shiyang Ruan 写道:
> 
> 
> 在 2022/6/3 2:56, Andrew Morton 写道:
>> On Sun, 8 May 2022 22:36:06 +0800 Shiyang Ruan 
>> <ruansy.fnst@fujitsu.com> wrote:
>>
>>> This is a combination of two patchsets:
>>>   1.fsdax-rmap: 
>>> https://lore.kernel.org/linux-xfs/20220419045045.1664996-1-ruansy.fnst@fujitsu.com/ 
>>>
>>>   2.fsdax-reflink: 
>>> https://lore.kernel.org/linux-xfs/20210928062311.4012070-1-ruansy.fnst@fujitsu.com/ 
>>>
>>
>> I'm getting lost in conflicts trying to get this merged up.  Mainly
>> memory-failure.c due to patch series "mm, hwpoison: enable 1GB hugepage
>> support".
>>
>> Could you please take a look at what's in the mm-unstable branch at
>> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm a few hours from
>> now?  Or the next linux-next.

OK, let me rebase this patchset on your mm-unstable branch.


--
Thanks,
Ruan.

>>
>> And I suggest that converting it all into a single 14-patch series
>> would be more straightforward.
> 
> The patchset in this thread is the 14-patch series.  I have solved many 
> conflicts.  It's an updated / newest version, and a combination of the 2 
> urls quoted above.  In an other word, instead of using this two:
> 
>  >> This is a combination of two patchsets:
>  >>   1.fsdax-rmap: https://...
>  >>   2.fsdax-reflink: https://...
> 
> you could take this (the url of the current thread):
> https://lore.kernel.org/linux-xfs/20220508143620.1775214-1-ruansy.fnst@fujitsu.com/ 
> 
> 
> My description misleaded you.  Sorry for that.
> 
> 
> -- 
> Thanks,
> Ruan.
> 
>>
>> Thanks.
> 
> 
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-06-03  1:36 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-08 14:36 [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Shiyang Ruan
2022-05-08 14:36 ` [PATCH v14 01/07] dax: Introduce holder for dax_device Shiyang Ruan
2022-05-08 14:36 ` [PATCH v14 02/07] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
2022-05-08 14:36 ` [PATCH v14 03/07] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
2022-05-08 14:36 ` [PATCH v14 04/07] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
2022-05-08 14:36 ` [PATCH v14 05/07] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
2022-05-08 14:36 ` [PATCH v14 06/07] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
2022-05-08 14:36 ` [PATCH v14 07/07] fsdax: set a CoW flag when associate reflink mappings Shiyang Ruan
2022-05-08 14:36 ` [PATCH v11 01/07] fsdax: Output address in dax_iomap_pfn() and rename it Shiyang Ruan
2022-05-08 14:36 ` [PATCH v11 02/07] fsdax: Introduce dax_iomap_cow_copy() Shiyang Ruan
2022-05-08 14:36 ` [PATCH v11 03/07] fsdax: Replace mmap entry in case of CoW Shiyang Ruan
2022-05-08 14:36 ` [PATCH v11 04/07] fsdax: Add dax_iomap_cow_copy() for dax zero Shiyang Ruan
2022-05-08 14:36 ` [PATCH v11 05/07] fsdax: Dedup file range to use a compare function Shiyang Ruan
2022-05-08 14:36 ` [PATCH v11 06/07] xfs: support CoW in fsdax mode Shiyang Ruan
2022-05-10  5:45   ` Christoph Hellwig
2022-05-10 10:06     ` Shiyang Ruan
2022-05-11  2:25   ` [PATCH v11.1 " Shiyang Ruan
2022-05-08 14:36 ` [PATCH v11 07/07] xfs: Add dax dedupe support Shiyang Ruan
2022-05-10  9:32 ` [PATCHSETS] v14 fsdax-rmap + v11 fsdax-reflink Christoph Hellwig
2022-05-11  0:03 ` Darrick J. Wong
2022-05-11  1:48   ` Dave Chinner
2022-05-11  1:55     ` Dan Williams
2022-05-11  2:19       ` Dave Chinner
2022-05-11  2:28       ` Andrew Morton
2022-05-11  2:43         ` Darrick J. Wong
2022-05-11  5:24           ` Andrew Morton
2022-05-11  6:21             ` Dave Chinner
2022-05-11 15:19             ` Darrick J. Wong
2022-05-11 15:46               ` Dan Williams
2022-05-12 12:27                 ` Shiyang Ruan
2022-06-02  9:42                   ` Shiyang Ruan
2022-06-02 17:18                     ` Darrick J. Wong
2022-06-02 17:47                       ` Andrew Morton
2022-05-11  4:20         ` Dan Williams
2022-05-11  4:34           ` Darrick J. Wong
2022-06-02 18:56 ` Andrew Morton
2022-06-03  1:07   ` Shiyang Ruan
2022-06-03  1:36     ` Shiyang Ruan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).