linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
@ 2020-11-23  0:41 Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 1/6] fs: introduce ->storage_lost() for memory-failure Shiyang Ruan
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-11-23  0:41 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Change from v1:
  - Intorduce ->block_lost() for block device
  - Support mapped device
  - Add 'not available' warning for realtime device in XFS
  - Rebased to v5.10-rc1

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device, by introducing an interface ->memory_failure() of struct
pagemap.  The interface is called by memory_failure() in mm, and
implemented by pmem device.  Then pmem device calls its ->block_lost()
to find the filesystem which the damaged page located in, and call
->storage_lost() to track files or metadata assocaited with this page.
Finally we are able to try to fix the damaged data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.

The call trace is like this:
 memory_failure()
   pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
    gendisk->fops->block_lost()   => pmem_block_lost() or
                                         md_blk_block_lost()
     sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
      xfs_rmap_query_range()
       xfs_storage_lost_helper()
        mf_recover_controller->recover_fn => \ 
                            memory_failure_dev_pagemap_kill_procs()

The collect_procs() and kill_procs() are moved into a callback which
is passed from memory_failure() to xfs_storage_lost_helper().  So we
can call it when a file assocaited is found, instead of creating a
file list and iterate it.

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.10-rc1)

Shiyang Ruan (6):
  fs: introduce ->storage_lost() for memory-failure
  blk: introduce ->block_lost() to handle memory-failure
  md: implement ->block_lost() for memory-failure
  pagemap: introduce ->memory_failure()
  mm, fsdax: refactor dax handler in memory-failure
  fsdax: remove useless (dis)associate functions

 block/genhd.c                 |  12 ++++
 drivers/md/dm-linear.c        |   8 +++
 drivers/md/dm.c               |  64 +++++++++++++++++
 drivers/nvdimm/pmem.c         |  50 +++++++++++++
 fs/block_dev.c                |  23 ++++++
 fs/dax.c                      |  64 ++---------------
 fs/xfs/xfs_super.c            |  87 +++++++++++++++++++++++
 include/linux/blkdev.h        |   2 +
 include/linux/dax.h           |   5 +-
 include/linux/device-mapper.h |   2 +
 include/linux/fs.h            |   2 +
 include/linux/genhd.h         |   9 +++
 include/linux/memremap.h      |   3 +
 include/linux/mm.h            |  14 ++++
 mm/memory-failure.c           | 127 +++++++++++++++++++++-------------
 15 files changed, 362 insertions(+), 110 deletions(-)

-- 
2.29.2





^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 1/6] fs: introduce ->storage_lost() for memory-failure
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
@ 2020-11-23  0:41 ` Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 2/6] blk: introduce ->block_lost() to handle memory-failure Shiyang Ruan
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-11-23  0:41 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

This function is used to handle errors which may cause data lost in
filesystem.  Such as memory-failure in fsdax mode.

In XFS, it requires "rmapbt" feature in order to query for files or
metadata which associated to the broken block.  Then we could call fs
recover functions to try to repair the damaged data.(did not implemented
in this patch)

After that, the memory-failure also needs to kill processes who are
using those files.  The struct mf_recover_controller is created to store
necessary parameters.

Only support data device.  Realtime device is not supported for now.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/xfs/xfs_super.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 include/linux/mm.h |  6 ++++
 3 files changed, 95 insertions(+)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e3e229e52512..f9a109217a80 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,11 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bit.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -1103,6 +1108,87 @@ xfs_fs_free_cached_objects(
 	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
+static int
+xfs_storage_lost_helper(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xfs_inode		*ip;
+	struct mf_recover_controller	*mfrc = priv;
+	int rc = 0;
+
+	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {
+		// TODO check and try to fix metadata
+	} else {
+		/*
+		 * Get files that incore, filter out others that are not in use.
+		 */
+		rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
+			      XFS_IGET_INCORE, 0, &ip);
+		if (rc || !ip)
+			return rc;
+		if (!VFS_I(ip)->i_mapping)
+			goto out;
+
+		rc = mfrc->recover_fn(mfrc->pfn, mfrc->flags,
+				      VFS_I(ip)->i_mapping, rec->rm_offset);
+
+		// TODO try to fix data
+out:
+		xfs_irele(ip);
+	}
+
+	return rc;
+}
+
+static int
+xfs_fs_storage_lost(
+	struct super_block	*sb,
+	struct block_device	*bdev,
+	loff_t			offset,
+	void			*priv)
+{
+	struct xfs_mount	*mp = XFS_M(sb);
+	struct xfs_trans	*tp = NULL;
+	struct xfs_btree_cur	*cur = NULL;
+	struct xfs_rmap_irec	rmap_low, rmap_high;
+	struct xfs_buf		*agf_bp = NULL;
+	xfs_fsblock_t		fsbno = XFS_B_TO_FSB(mp, offset);
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	xfs_agblock_t		agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
+	int			error = 0;
+
+	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_bdev == bdev) {
+		xfs_warn(mp, "storage lost support not available for realtime device!");
+		return 0;
+	}
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
+	if (error)
+		return error;
+
+	cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
+
+	/* Construct a range for rmap query */
+	memset(&rmap_low, 0, sizeof(rmap_low));
+	memset(&rmap_high, 0xFF, sizeof(rmap_high));
+	rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
+
+	error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
+				     xfs_storage_lost_helper, priv);
+	if (error == -ECANCELED)
+		error = 0;
+
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(tp, agf_bp);
+	return error;
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1116,6 +1202,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.storage_lost		= xfs_fs_storage_lost,
 };
 
 static int
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0bd126418bb6..8c4f753e3ed9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1949,6 +1949,8 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	int (*storage_lost)(struct super_block *sb, struct block_device *bdev,
+			    loff_t offset, void *priv);
 };
 
 /*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..872b51ebe57b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3031,6 +3031,12 @@ extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
 extern int soft_offline_page(unsigned long pfn, int flags);
 
+struct mf_recover_controller {
+	int (*recover_fn)(unsigned long pfn, int flags,
+		struct address_space *mapping, pgoff_t index);
+	unsigned long pfn;
+	int flags;
+};
 
 /*
  * Error handlers for various types of pages.
-- 
2.29.2





^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 2/6] blk: introduce ->block_lost() to handle memory-failure
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 1/6] fs: introduce ->storage_lost() for memory-failure Shiyang Ruan
@ 2020-11-23  0:41 ` Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 3/6] md: implement ->block_lost() for memory-failure Shiyang Ruan
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-11-23  0:41 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

In fsdax mode, the memory failure happens on block device.  So, it is
needed to introduce an interface for block devices.  Each kind of block
device can handle the memory failure in ther own ways.

Usually, a block device is used directly to mkfs on it.  The filesystem
on it is easily to obtain by 'get_super()'.  The block device can also
be divided into several partitions, or used as a target by mapped
device.  It is hard to get filesystem's superblock in this two ways.
So, add 'bdget_disk_sector()' to get the block device of a partition
where the broken sector located in.  And add
'bd_disk_holder_block_lost()' iterate the mapped devices on it.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 block/genhd.c          | 12 ++++++++++++
 drivers/nvdimm/pmem.c  | 27 +++++++++++++++++++++++++++
 fs/block_dev.c         | 23 +++++++++++++++++++++++
 include/linux/blkdev.h |  2 ++
 include/linux/genhd.h  |  9 +++++++++
 5 files changed, 73 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 0a273211fec2..2c7304f123fa 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1055,6 +1055,18 @@ struct block_device *bdget_disk(struct gendisk *disk, int partno)
 }
 EXPORT_SYMBOL(bdget_disk);
 
+struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)
+{
+	struct block_device *bdev = NULL;
+	struct hd_struct *part = disk_map_sector_rcu(disk, sector);
+
+	if (part)
+		bdev = bdget_part(part);
+
+	return bdev;
+}
+EXPORT_SYMBOL(bdget_disk_sector);
+
 /*
  * print a full list of all partitions - intended for places where the root
  * filesystem can't be mounted and thus to give the victim some idea of what
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 875076b0ea6c..d75a3f370f3c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -253,6 +253,32 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
 	return blk_status_to_errno(rc);
 }
 
+int pmem_block_lost(struct gendisk *disk, struct block_device *bdev,
+		    loff_t disk_offset, void *data)
+{
+	struct super_block *sb;
+	sector_t bdev_sector, disk_sector = disk_offset >> SECTOR_SHIFT;
+	int rc = 0;
+
+	bdev = bdget_disk_sector(disk, disk_sector);
+	if (!bdev)
+		return -ENODEV;
+
+	bdev_sector = disk_sector - get_start_sect(bdev);
+	sb = get_super(bdev);
+	if (!sb) {
+		rc = bd_disk_holder_block_lost(bdev, bdev_sector, data);
+		goto out;
+	} else if (sb->s_op->storage_lost)
+		rc = sb->s_op->storage_lost(sb, bdev,
+					    bdev_sector << SECTOR_SHIFT, data);
+	drop_super(sb);
+
+out:
+	bdput(bdev);
+	return rc;
+}
+
 /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */
 __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
 		long nr_pages, void **kaddr, pfn_t *pfn)
@@ -281,6 +307,7 @@ static const struct block_device_operations pmem_fops = {
 	.owner =		THIS_MODULE,
 	.submit_bio =		pmem_submit_bio,
 	.rw_page =		pmem_rw_page,
+	.block_lost =		pmem_block_lost,
 };
 
 static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 9e84b1928b94..e1e30828fb9f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1171,6 +1171,29 @@ struct bd_holder_disk {
 	int			refcnt;
 };
 
+int bd_disk_holder_block_lost(struct block_device *bdev, sector_t offset,
+			      void *data)
+{
+	struct bd_holder_disk *holder;
+	struct gendisk *disk;
+	int rc = 0;
+
+	if (list_empty(&(bdev->bd_holder_disks)))
+		return -ENODEV;
+
+	list_for_each_entry(holder, &bdev->bd_holder_disks, list) {
+		disk = holder->disk;
+		if (disk->fops->block_lost) {
+			rc = disk->fops->block_lost(disk, bdev,
+					       offset << SECTOR_SHIFT, data);
+			if (rc != -ENODEV)
+				break;
+		}
+	}
+	return rc;
+}
+EXPORT_SYMBOL_GPL(bd_disk_holder_block_lost);
+
 static struct bd_holder_disk *bd_find_holder_disk(struct block_device *bdev,
 						  struct gendisk *disk)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 639cae2c158b..ddeb268cc938 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1855,6 +1855,8 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	int (*block_lost)(struct gendisk *disk, struct block_device *bdev,
+			  loff_t offset, void *data);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 38f23d757013..9d8f5b5dab9f 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -303,6 +303,8 @@ static inline void add_disk_no_queue_reg(struct gendisk *disk)
 extern void del_gendisk(struct gendisk *gp);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
+extern struct block_device *bdget_disk_sector(struct gendisk *disk,
+					      sector_t sector);
 
 extern void set_device_ro(struct block_device *bdev, int flag);
 extern void set_disk_ro(struct gendisk *disk, int flag);
@@ -381,9 +383,16 @@ int blkdev_ioctl(struct block_device *, fmode_t, unsigned, unsigned long);
 long compat_blkdev_ioctl(struct file *, unsigned, unsigned long);
 
 #ifdef CONFIG_SYSFS
+int bd_disk_holder_block_lost(struct block_device *bdev, sector_t offset,
+			      void *data);
 int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk);
 void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk);
 #else
+static inline int bd_disk_holder_block_lost(struct block_device *bdev,
+					    sector_t offset, void *data)
+{
+	return 0;
+}
 static inline int bd_link_disk_holder(struct block_device *bdev,
 				      struct gendisk *disk)
 {
-- 
2.29.2





^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 3/6] md: implement ->block_lost() for memory-failure
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 1/6] fs: introduce ->storage_lost() for memory-failure Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 2/6] blk: introduce ->block_lost() to handle memory-failure Shiyang Ruan
@ 2020-11-23  0:41 ` Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 4/6] pagemap: introduce ->memory_failure() Shiyang Ruan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-11-23  0:41 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

Mapped device is a type of block device.  It could be created on the
pmem where memory-failure happens.  LVM created with pmem device is one
kind use case.

What we need to do is:
 1. find out the filesystem where lost block located in
      Iterate all targets and test if the target is the broken device.
 2. translate the offset on the one of target devices to the whole
    mapped device
      Intorduce ->remap() for each target, to reverse map the offset.
      It is implemented by each target.  The linear target's is
      implemented in this patch.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 drivers/md/dm-linear.c        |  8 +++++
 drivers/md/dm.c               | 64 +++++++++++++++++++++++++++++++++++
 include/linux/device-mapper.h |  2 ++
 3 files changed, 74 insertions(+)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 00774b5d7668..6595701800e6 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -119,6 +119,13 @@ static void linear_status(struct dm_target *ti, status_type_t type,
 	}
 }
 
+static sector_t linear_remap(struct dm_target *ti, sector_t offset)
+{
+	struct linear_c *lc = (struct linear_c *) ti->private;
+
+	return offset - dm_target_offset(ti, lc->start);
+}
+
 static int linear_prepare_ioctl(struct dm_target *ti, struct block_device **bdev)
 {
 	struct linear_c *lc = (struct linear_c *) ti->private;
@@ -238,6 +245,7 @@ static struct target_type linear_target = {
 	.ctr    = linear_ctr,
 	.dtr    = linear_dtr,
 	.map    = linear_map,
+	.remap  = linear_remap,
 	.status = linear_status,
 	.prepare_ioctl = linear_prepare_ioctl,
 	.iterate_devices = linear_iterate_devices,
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index c18fc2548518..00e075fb4cdb 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -505,6 +505,69 @@ static int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 #define dm_blk_report_zones		NULL
 #endif /* CONFIG_BLK_DEV_ZONED */
 
+struct dm_blk_block_lost {
+	struct block_device *bdev;
+	sector_t offset;
+};
+
+static int dm_blk_block_lost_fn(struct dm_target *ti, struct dm_dev *dev,
+				sector_t start, sector_t len, void *data)
+{
+	struct dm_blk_block_lost *bl = data;
+
+	return bl->bdev == (void *)dev->bdev &&
+			(start <= bl->offset && bl->offset < start + len);
+}
+
+static int dm_blk_block_lost(struct gendisk *disk,
+			     struct block_device *target_bdev,
+			     loff_t target_offset, void *data)
+{
+	struct mapped_device *md = disk->private_data;
+	struct block_device *md_bdev = md->bdev;
+	struct dm_table *map;
+	struct dm_target *ti;
+	struct super_block *sb;
+	int srcu_idx, i, rc = 0;
+	bool found = false;
+	sector_t disk_sec, target_sec = to_sector(target_offset);
+
+	map = dm_get_live_table(md, &srcu_idx);
+	if (!map)
+		return -ENODEV;
+
+	for (i = 0; i < dm_table_get_num_targets(map); i++) {
+		ti = dm_table_get_target(map, i);
+		if (ti->type->iterate_devices && ti->type->remap) {
+			struct dm_blk_block_lost bl = {target_bdev, target_sec};
+			found = ti->type->iterate_devices(ti, dm_blk_block_lost_fn, &bl);
+			if (!found)
+				continue;
+			disk_sec = ti->type->remap(ti, target_sec);
+			break;
+		}
+	}
+
+	if (!found) {
+		rc = -ENODEV;
+		goto out;
+	}
+
+	sb = get_super(md_bdev);
+	if (!sb) {
+		rc = bd_disk_holder_block_lost(md_bdev, disk_sec, data);
+		goto out;
+	} else if (sb->s_op->storage_lost) {
+		loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
+		rc = sb->s_op->storage_lost(sb, md_bdev, off, data);
+	}
+	drop_super(sb);
+
+out:
+	dm_put_live_table(md, srcu_idx);
+	return rc;
+}
+
 static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
 			    struct block_device **bdev)
 	__acquires(md->io_barrier)
@@ -3082,6 +3145,7 @@ static const struct block_device_operations dm_blk_dops = {
 	.ioctl = dm_blk_ioctl,
 	.getgeo = dm_blk_getgeo,
 	.report_zones = dm_blk_report_zones,
+	.block_lost = dm_blk_block_lost,
 	.pr_ops = &dm_pr_ops,
 	.owner = THIS_MODULE
 };
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 61a66fb8ebb3..bc28bd5bc436 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -58,6 +58,7 @@ typedef void (*dm_dtr_fn) (struct dm_target *ti);
  * = 2: The target wants to push back the io
  */
 typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio);
+typedef sector_t (*dm_remap_fn) (struct dm_target *ti, sector_t offset);
 typedef int (*dm_clone_and_map_request_fn) (struct dm_target *ti,
 					    struct request *rq,
 					    union map_info *map_context,
@@ -175,6 +176,7 @@ struct target_type {
 	dm_ctr_fn ctr;
 	dm_dtr_fn dtr;
 	dm_map_fn map;
+	dm_remap_fn remap;
 	dm_clone_and_map_request_fn clone_and_map_rq;
 	dm_release_clone_request_fn release_clone_rq;
 	dm_endio_fn end_io;
-- 
2.29.2





^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 4/6] pagemap: introduce ->memory_failure()
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (2 preceding siblings ...)
  2020-11-23  0:41 ` [RFC PATCH v2 3/6] md: implement ->block_lost() for memory-failure Shiyang Ruan
@ 2020-11-23  0:41 ` Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 5/6] mm, fsdax: refactor dax handler in memory-failure Shiyang Ruan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-11-23  0:41 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

When memory-failure occurs, we call this function which is implemented
by each devices.  For fsdax, pmem device implements it.  Pmem device
will find out the block device where the error page located in, gets the
filesystem on this block device, and finally call ->storage_lost() to
handle the error in filesystem layer.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 drivers/nvdimm/pmem.c    | 23 +++++++++++++++++++++++
 include/linux/memremap.h |  3 +++
 2 files changed, 26 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d75a3f370f3c..1b1dbca090fa 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -390,9 +390,32 @@ static void pmem_release_disk(void *__pmem)
 	put_disk(pmem->disk);
 }
 
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+		struct mf_recover_controller *mfrc)
+{
+	struct pmem_device *pdev;
+	struct gendisk *disk;
+	loff_t disk_offset;
+	int rc = 0;
+
+	pdev = container_of(pgmap, struct pmem_device, pgmap);
+	disk = pdev->disk;
+	if (!disk)
+		return -ENXIO;
+
+	disk_offset = PFN_PHYS(mfrc->pfn) - pdev->phys_addr - pdev->data_offset;
+	if (disk->fops->block_lost) {
+		rc = disk->fops->block_lost(disk, NULL, disk_offset, mfrc);
+		if (rc == -ENODEV)
+			rc = -ENXIO;
+	}
+	return rc;
+}
+
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
 	.kill			= pmem_pagemap_kill,
 	.cleanup		= pmem_pagemap_cleanup,
+	.memory_failure		= pmem_pagemap_memory_failure,
 };
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..6aae0fe394ba 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -7,6 +7,7 @@
 
 struct resource;
 struct device;
+struct mf_recover_controller;
 
 /**
  * struct vmem_altmap - pre-allocated storage for vmemmap_populate
@@ -87,6 +88,8 @@ struct dev_pagemap_ops {
 	 * the page back to a CPU accessible page.
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+	int (*memory_failure)(struct dev_pagemap *pgmap,
+			      struct mf_recover_controller *mfrc);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
-- 
2.29.2





^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 5/6] mm, fsdax: refactor dax handler in memory-failure
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (3 preceding siblings ...)
  2020-11-23  0:41 ` [RFC PATCH v2 4/6] pagemap: introduce ->memory_failure() Shiyang Ruan
@ 2020-11-23  0:41 ` Shiyang Ruan
  2020-11-23  0:41 ` [RFC PATCH v2 6/6] fsdax: remove useless (dis)associate functions Shiyang Ruan
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-11-23  0:41 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

With the ->memory_failure() implemented in pmem device and
->storage_lost() in XFS, we are able to track files or metadata
and process them further.

We don't track files by page->mapping, page->index any more, so
some of functions who obtain ->mapping, ->index from struct page
parameter need to be changed by directly passing mapping and index.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c            |  18 +++----
 include/linux/dax.h |   5 +-
 include/linux/mm.h  |   8 +++
 mm/memory-failure.c | 127 +++++++++++++++++++++++++++-----------------
 4 files changed, 94 insertions(+), 64 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1b..34471acde683 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -379,14 +379,14 @@ static struct page *dax_busy_page(void *entry)
 }
 
 /*
- * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
+ * dax_lock - Lock the DAX entry corresponding to a page
  * @page: The page whose entry we want to lock
  *
  * Context: Process context.
  * Return: A cookie to pass to dax_unlock_page() or 0 if the entry could
  * not be locked.
  */
-dax_entry_t dax_lock_page(struct page *page)
+dax_entry_t dax_lock(struct address_space *mapping, pgoff_t index)
 {
 	XA_STATE(xas, NULL, 0);
 	void *entry;
@@ -394,8 +394,6 @@ dax_entry_t dax_lock_page(struct page *page)
 	/* Ensure page->mapping isn't freed while we look at it */
 	rcu_read_lock();
 	for (;;) {
-		struct address_space *mapping = READ_ONCE(page->mapping);
-
 		entry = NULL;
 		if (!mapping || !dax_mapping(mapping))
 			break;
@@ -413,11 +411,7 @@ dax_entry_t dax_lock_page(struct page *page)
 
 		xas.xa = &mapping->i_pages;
 		xas_lock_irq(&xas);
-		if (mapping != page->mapping) {
-			xas_unlock_irq(&xas);
-			continue;
-		}
-		xas_set(&xas, page->index);
+		xas_set(&xas, index);
 		entry = xas_load(&xas);
 		if (dax_is_locked(entry)) {
 			rcu_read_unlock();
@@ -433,10 +427,10 @@ dax_entry_t dax_lock_page(struct page *page)
 	return (dax_entry_t)entry;
 }
 
-void dax_unlock_page(struct page *page, dax_entry_t cookie)
+void dax_unlock(struct address_space *mapping, pgoff_t index,
+		dax_entry_t cookie)
 {
-	struct address_space *mapping = page->mapping;
-	XA_STATE(xas, &mapping->i_pages, page->index);
+	XA_STATE(xas, &mapping->i_pages, index);
 
 	if (S_ISCHR(mapping->host->i_mode))
 		return;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b52f084aa643..a8d697eb886c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -150,8 +150,9 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
 struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
-dax_entry_t dax_lock_page(struct page *page);
-void dax_unlock_page(struct page *page, dax_entry_t cookie);
+dax_entry_t dax_lock(struct address_space *mapping, pgoff_t index);
+void dax_unlock(struct address_space *mapping, pgoff_t index,
+		dax_entry_t cookie);
 #else
 static inline bool bdev_dax_supported(struct block_device *bdev,
 		int blocksize)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 872b51ebe57b..729448ed10b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1141,6 +1141,14 @@ static inline bool is_device_private_page(const struct page *page)
 		page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_fsdax_page(const struct page *page)
+{
+	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+		IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+		is_zone_device_page(page) &&
+		page->pgmap->type == MEMORY_DEVICE_FS_DAX;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
 	return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c0bb186bba62..f8f80458746e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -120,6 +120,9 @@ static int hwpoison_filter_dev(struct page *p)
 	if (PageSlab(p))
 		return -EINVAL;
 
+	if (is_device_fsdax_page(p))
+		return 0;
+
 	mapping = page_mapping(p);
 	if (mapping == NULL || mapping->host == NULL)
 		return -EINVAL;
@@ -290,9 +293,8 @@ void shake_page(struct page *p, int access)
 EXPORT_SYMBOL_GPL(shake_page);
 
 static unsigned long dev_pagemap_mapping_shift(struct page *page,
-		struct vm_area_struct *vma)
+		struct vm_area_struct *vma, unsigned long address)
 {
-	unsigned long address = vma_address(page, vma);
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -333,8 +335,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
-		       struct vm_area_struct *vma,
-		       struct list_head *to_kill)
+		       struct address_space *mapping, pgoff_t pgoff,
+		       struct vm_area_struct *vma, struct list_head *to_kill)
 {
 	struct to_kill *tk;
 
@@ -344,12 +346,18 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 		return;
 	}
 
-	tk->addr = page_address_in_vma(p, vma);
-	if (is_zone_device_page(p))
-		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-	else
-		tk->size_shift = page_shift(compound_head(p));
-
+	if (is_device_fsdax_page(p)) {
+		tk->addr = vma->vm_start +
+				((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+		tk->size_shift = dev_pagemap_mapping_shift(p, vma, tk->addr);
+	} else {
+		tk->addr = page_address_in_vma(p, vma);
+		if (is_zone_device_page(p)) {
+			tk->size_shift = dev_pagemap_mapping_shift(p, vma,
+							vma_address(p, vma));
+		} else
+			tk->size_shift = page_shift(compound_head(p));
+	}
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
 	 * "tk->size_shift" is always non-zero for !is_zone_device_page(),
@@ -495,7 +503,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 			if (!page_mapped_in_vma(page, vma))
 				continue;
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, NULL, 0, vma, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -505,24 +513,20 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 /*
  * Collect processes when the error hit a file mapped page.
  */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-				int force_early)
+static void collect_procs_file(struct page *page, struct address_space *mapping,
+		pgoff_t pgoff, struct list_head *to_kill, int force_early)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
-	struct address_space *mapping = page->mapping;
-	pgoff_t pgoff;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	pgoff = page_to_pgoff(page);
 	for_each_process(tsk) {
 		struct task_struct *t = task_early_kill(tsk, force_early);
-
 		if (!t)
 			continue;
-		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
-				      pgoff) {
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 			/*
 			 * Send early kill signal to tasks where a vma covers
 			 * the page but the corrupted page is not necessarily
@@ -530,8 +534,10 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			 * Assume applications who requested early kill want
 			 * to be informed of all such data corruptions.
 			 */
-			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+			if (vma->vm_mm == t->mm) {
+				add_to_kill(t, page, mapping, pgoff, vma,
+					    to_kill);
+			}
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -550,7 +556,8 @@ static void collect_procs(struct page *page, struct list_head *tokill,
 	if (PageAnon(page))
 		collect_procs_anon(page, tokill, force_early);
 	else
-		collect_procs_file(page, tokill, force_early);
+		collect_procs_file(page, page->mapping, page_to_pgoff(page),
+				   tokill, force_early);
 }
 
 static const char *action_name[] = {
@@ -1221,14 +1228,14 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	return res;
 }
 
-static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
-		struct dev_pagemap *pgmap)
+static int memory_failure_dev_pagemap_kill_procs(unsigned long pfn, int flags,
+		struct address_space *mapping, pgoff_t index)
 {
 	struct page *page = pfn_to_page(pfn);
 	const bool unmap_success = true;
 	unsigned long size = 0;
 	struct to_kill *tk;
-	LIST_HEAD(tokill);
+	LIST_HEAD(to_kill);
 	int rc = -EBUSY;
 	loff_t start;
 	dax_entry_t cookie;
@@ -1240,28 +1247,9 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	 * also prevents changes to the mapping of this pfn until
 	 * poison signaling is complete.
 	 */
-	cookie = dax_lock_page(page);
+	cookie = dax_lock(mapping, index);
 	if (!cookie)
-		goto out;
-
-	if (hwpoison_filter(page)) {
-		rc = 0;
-		goto unlock;
-	}
-
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		/*
-		 * TODO: Handle HMM pages which may need coordination
-		 * with device-side memory.
-		 */
 		goto unlock;
-	}
-
-	/*
-	 * Use this flag as an indication that the dax page has been
-	 * remapped UC to prevent speculative consumption of poison.
-	 */
-	SetPageHWPoison(page);
 
 	/*
 	 * Unlike System-RAM there is no possibility to swap in a
@@ -1270,9 +1258,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	 * SIGBUS (i.e. MF_MUST_KILL)
 	 */
 	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
-	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
+	collect_procs_file(page, mapping, index, &to_kill,
+			   flags & MF_ACTION_REQUIRED);
 
-	list_for_each_entry(tk, &tokill, nd)
+	list_for_each_entry(tk, &to_kill, nd)
 		if (tk->size_shift)
 			size = max(size, 1UL << tk->size_shift);
 	if (size) {
@@ -1282,13 +1271,51 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		 * actual size of the mapping being torn down is
 		 * communicated in siginfo, see kill_proc()
 		 */
-		start = (page->index << PAGE_SHIFT) & ~(size - 1);
-		unmap_mapping_range(page->mapping, start, start + size, 0);
+		start = (index << PAGE_SHIFT) & ~(size - 1);
+		unmap_mapping_range(mapping, start, start + size, 0);
 	}
-	kill_procs(&tokill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
+
+	kill_procs(&to_kill, flags & MF_MUST_KILL, !unmap_success, pfn, flags);
 	rc = 0;
 unlock:
-	dax_unlock_page(page, cookie);
+	dax_unlock(mapping, index, cookie);
+	return rc;
+}
+
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	struct page *page = pfn_to_page(pfn);
+	struct mf_recover_controller mfrc = {
+		.recover_fn = memory_failure_dev_pagemap_kill_procs,
+		.pfn = pfn,
+		.flags = flags,
+	};
+	int rc;
+
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		goto out;
+	}
+
+	if (hwpoison_filter(page)) {
+		rc = 0;
+		goto out;
+	}
+
+	/*
+	 * Use this flag as an indication that the dax page has been
+	 * remapped UC to prevent speculative consumption of poison.
+	 */
+	SetPageHWPoison(page);
+
+	/* call driver to handle the memory failure */
+	if (pgmap->ops->memory_failure)
+		rc = pgmap->ops->memory_failure(pgmap, &mfrc);
+
 out:
 	/* drop pgmap ref acquired in caller */
 	put_dev_pagemap(pgmap);
-- 
2.29.2





^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 6/6] fsdax: remove useless (dis)associate functions
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (4 preceding siblings ...)
  2020-11-23  0:41 ` [RFC PATCH v2 5/6] mm, fsdax: refactor dax handler in memory-failure Shiyang Ruan
@ 2020-11-23  0:41 ` Shiyang Ruan
  2020-11-29 22:47 ` [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Dave Chinner
  2020-12-14 20:58 ` Jane Chu
  7 siblings, 0 replies; 15+ messages in thread
From: Shiyang Ruan @ 2020-11-23  0:41 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

Since owner tarcking is triggerred by pmem device, these functions are
useless.  So remove it.

Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com>
---
 fs/dax.c | 46 ----------------------------------------------
 1 file changed, 46 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 34471acde683..6c037287cb04 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -323,48 +323,6 @@ static unsigned long dax_end_pfn(void *entry)
 	for (pfn = dax_to_pfn(entry); \
 			pfn < dax_end_pfn(entry); pfn++)
 
-/*
- * TODO: for reflink+dax we need a way to associate a single page with
- * multiple address_space instances at different linear_page_index()
- * offsets.
- */
-static void dax_associate_entry(void *entry, struct address_space *mapping,
-		struct vm_area_struct *vma, unsigned long address)
-{
-	unsigned long size = dax_entry_size(entry), pfn, index;
-	int i = 0;
-
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
-
-	index = linear_page_index(vma, address & ~(size - 1));
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
-
-		WARN_ON_ONCE(page->mapping);
-		page->mapping = mapping;
-		page->index = index + i++;
-	}
-}
-
-static void dax_disassociate_entry(void *entry, struct address_space *mapping,
-		bool trunc)
-{
-	unsigned long pfn;
-
-	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
-		return;
-
-	for_each_mapped_pfn(entry, pfn) {
-		struct page *page = pfn_to_page(pfn);
-
-		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-		page->mapping = NULL;
-		page->index = 0;
-	}
-}
-
 static struct page *dax_busy_page(void *entry)
 {
 	unsigned long pfn;
@@ -516,7 +474,6 @@ static void *grab_mapping_entry(struct xa_state *xas,
 			xas_lock_irq(xas);
 		}
 
-		dax_disassociate_entry(entry, mapping, false);
 		xas_store(xas, NULL);	/* undo the PMD join */
 		dax_wake_entry(xas, entry, true);
 		mapping->nrexceptional--;
@@ -653,7 +610,6 @@ static int __dax_invalidate_entry(struct address_space *mapping,
 	    (xas_get_mark(&xas, PAGECACHE_TAG_DIRTY) ||
 	     xas_get_mark(&xas, PAGECACHE_TAG_TOWRITE)))
 		goto out;
-	dax_disassociate_entry(entry, mapping, trunc);
 	xas_store(&xas, NULL);
 	mapping->nrexceptional--;
 	ret = 1;
@@ -747,8 +703,6 @@ static void *dax_insert_entry(struct xa_state *xas,
 	if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
 		void *old;
 
-		dax_disassociate_entry(entry, mapping, false);
-		dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
 		/*
 		 * Only swap our new entry into the page cache if the current
 		 * entry is a zero page or an empty entry.  If a normal PTE or
-- 
2.29.2





^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (5 preceding siblings ...)
  2020-11-23  0:41 ` [RFC PATCH v2 6/6] fsdax: remove useless (dis)associate functions Shiyang Ruan
@ 2020-11-29 22:47 ` Dave Chinner
  2020-12-02  7:12   ` Ruan Shiyang
  2020-12-14 20:58 ` Jane Chu
  7 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-11-29 22:47 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	linux-raid, darrick.wong, dan.j.williams, hch, song, rgoldwyn,
	qi.fuli, y-goto

On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Change from v1:
>   - Intorduce ->block_lost() for block device
>   - Support mapped device
>   - Add 'not available' warning for realtime device in XFS
>   - Rebased to v5.10-rc1
> 
> This patchset moves owner tracking from dax_assocaite_entry() to pmem
> device, by introducing an interface ->memory_failure() of struct
> pagemap.  The interface is called by memory_failure() in mm, and
> implemented by pmem device.  Then pmem device calls its ->block_lost()
> to find the filesystem which the damaged page located in, and call
> ->storage_lost() to track files or metadata assocaited with this page.
> Finally we are able to try to fix the damaged data in filesystem and do
> other necessary processing, such as killing processes who are using the
> files affected.
> 
> The call trace is like this:
>  memory_failure()
>    pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
>     gendisk->fops->block_lost()   => pmem_block_lost() or
>                                          md_blk_block_lost()
>      sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
>       xfs_rmap_query_range()
>        xfs_storage_lost_helper()
>         mf_recover_controller->recover_fn => \ 
>                             memory_failure_dev_pagemap_kill_procs()
> 
> The collect_procs() and kill_procs() are moved into a callback which
> is passed from memory_failure() to xfs_storage_lost_helper().  So we
> can call it when a file assocaited is found, instead of creating a
> file list and iterate it.
> 
> The fsdax & reflink support for XFS is not contained in this patchset.

This looks promising - the overall architecture is a lot more
generic and less dependent on knowing about memory, dax or memory
failures. A few comments that I think would further improve
understanding the patchset and the implementation:

- the order of the patches is inverted. It should start with a
  single patch introducing the mf_recover_controller structure for
  callbacks, then introduce pgmap->ops->memory_failure, then
  ->block_lost, then the pmem and md implementations of ->block
  list, then ->storage_lost and the XFS implementations of
  ->storage_lost.

- I think the names "block_lost" and "storage_lost" are misleading.
  It's more like a "media failure" or a general "data corruption"
  event at a specific physical location. The data may not be "lost"
  but only damaged, so we might be able to recover from it without
  "losing" anything. Hence I think they could be better named,
  perhaps just "->corrupt_range"

- need to pass a {offset,len} pair through the chain, not just a
  single offset. This will allow other types of devices to report
  different ranges of failures, from a single sector to an entire
  device.

- I'm not sure that passing the mf_recover_controller structure
  through the corruption event chain is the right thing to do here.
  A block device could generate this storage failure callback if it
  detects an unrecoverable error (e.g. during a MD media scrub or
  rebuild/resilver failure) and in that case we don't have PFNs or
  memory device failure functions to perform.

  IOWs, I think the action that is taken needs to be independent of
  the source that generated the error. Even for a pmem device, we
  can be using the page cache, so it may be possible to recover the
  pmem error by writing the cached page (if it exists) back over the
  pmem.

  Hence I think that the recover function probably needs to be moved
  to the address space ops, because what we do to recover from the
  error is going to be dependent on type of mapping the filesystem
  is using. If it's a DAX mapping, we call back into a generic DAX
  function that does the vma walk and process kill functions. If it
  is a page cache mapping, then if the page is cached then we can
  try to re-write it to disk to fix the bad data, otherwise we treat
  it like a writeback error and report it on the next
  write/fsync/close operation done on that file.

  This gets rid of the mf_recover_controller altogether and allows
  the interface to be used by any sort of block device for any sort
  of bottom-up reporting of media/device failures.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-11-29 22:47 ` [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Dave Chinner
@ 2020-12-02  7:12   ` Ruan Shiyang
  2020-12-06 22:55     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Ruan Shiyang @ 2020-12-02  7:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	linux-raid, darrick.wong, dan.j.williams, hch, song, rgoldwyn,
	qi.fuli, y-goto

Hi Dave,

On 2020/11/30 上午6:47, Dave Chinner wrote:
> On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
>> 
>> The call trace is like this:
>>   memory_failure()
>>     pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
>>      gendisk->fops->block_lost()   => pmem_block_lost() or
>>                                           md_blk_block_lost()
>>       sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
>>        xfs_rmap_query_range()
>>         xfs_storage_lost_helper()
>>          mf_recover_controller->recover_fn => \
>>                              memory_failure_dev_pagemap_kill_procs()
>>
>> The collect_procs() and kill_procs() are moved into a callback which
>> is passed from memory_failure() to xfs_storage_lost_helper().  So we
>> can call it when a file assocaited is found, instead of creating a
>> file list and iterate it.
>>
>> The fsdax & reflink support for XFS is not contained in this patchset.
> 
> This looks promising - the overall architecture is a lot more
> generic and less dependent on knowing about memory, dax or memory
> failures. A few comments that I think would further improve
> understanding the patchset and the implementation:

Thanks for your kindly comment.  It gives me confidence.

> 
> - the order of the patches is inverted. It should start with a
>    single patch introducing the mf_recover_controller structure for
>    callbacks, then introduce pgmap->ops->memory_failure, then
>    ->block_lost, then the pmem and md implementations of ->block
>    list, then ->storage_lost and the XFS implementations of
>    ->storage_lost.

Yes, it will be easier to understand the patchset in this order.

But I have something unsure: for example, I introduce ->memory_failure() 
firstly, but the implementation of ->memory_failure() needs to call 
->block_lost() which is supposed to be introduced in the next patch. So, 
I am not sure the code is supposed to be what in the implementation of 
->memory_failure() in pmem?  To avoid this situation, I committed the 
patches in the inverted order: lowest level first, then its caller, and 
then caller's caller.

I am trying to sort out the order.  How about this:
  Patch i.
    Introduce ->memory_failure()
       - just introduce interface, without implementation
  Patch i++.
    Introduce ->block_lost()
       - introduce interface and implement ->memory_failure()
          in pmem, so that it can call ->block_lost()
  Patch i++.
    (similar with above, skip...)

> 
> - I think the names "block_lost" and "storage_lost" are misleading.
>    It's more like a "media failure" or a general "data corruption"
>    event at a specific physical location. The data may not be "lost"
>    but only damaged, so we might be able to recover from it without
>    "losing" anything. Hence I think they could be better named,
>    perhaps just "->corrupt_range"

'corrupt' sounds better.  (I'm not good at naming functions...)

> 
> - need to pass a {offset,len} pair through the chain, not just a
>    single offset. This will allow other types of devices to report
>    different ranges of failures, from a single sector to an entire
>    device.

Yes, it's better to add the length.  I restrictively thought that 
memory-failure on pmem should affect one single page at one time.

> 
> - I'm not sure that passing the mf_recover_controller structure
>    through the corruption event chain is the right thing to do here.
>    A block device could generate this storage failure callback if it
>    detects an unrecoverable error (e.g. during a MD media scrub or
>    rebuild/resilver failure) and in that case we don't have PFNs or
>    memory device failure functions to perform.
> 
>    IOWs, I think the action that is taken needs to be independent of
>    the source that generated the error. Even for a pmem device, we
>    can be using the page cache, so it may be possible to recover the
>    pmem error by writing the cached page (if it exists) back over the
>    pmem.
> 
>    Hence I think that the recover function probably needs to be moved
>    to the address space ops, because what we do to recover from the
>    error is going to be dependent on type of mapping the filesystem
>    is using. If it's a DAX mapping, we call back into a generic DAX
>    function that does the vma walk and process kill functions. If it
>    is a page cache mapping, then if the page is cached then we can
>    try to re-write it to disk to fix the bad data, otherwise we treat
>    it like a writeback error and report it on the next
>    write/fsync/close operation done on that file.
> 
>    This gets rid of the mf_recover_controller altogether and allows
>    the interface to be used by any sort of block device for any sort
>    of bottom-up reporting of media/device failures.

Moving the recover function to the address_space ops looks a better 
idea. But I think that the error handler for page cache mapping is 
finished well in memory-failure.  The memory-failure is also reused to 
handles anonymous page.  If we move the recover function to 
address_space ops, I think we also need to refactor the existing handler 
for page cache mapping, which may affect anonymous page handling.  This 
makes me confused...


I rewrote the call trace:
memory_failure()
  * dax mapping case
  pgmap->ops->memory_failure()          =>
                                    pmem_pgmap_memory_failure()
   gendisk->fops->block_corrupt_range() =>
                                    - pmem_block_corrupt_range()
                                    - md_blk_block_corrupt_range()
    sb->s_ops->storage_currupt_range()  =>
                                    xfs_fs_storage_corrupt_range()
     xfs_rmap_query_range()
      xfs_storage_lost_helper()
       mapping->a_ops->corrupt_range()  =>
                                    xfs_dax_aops.xfs_dax_corrupt_range
        memory_failure_dev_pagemap_kill_procs()

  * page cache mapping case
  mapping->a_ops->corrupt_range()       =>
                                    xfs_address_space_operations.xfs_xxx
   memory_failure_generic_kill_procs()

It's rough and not completed yet.  Hope for your comment.

-- 
Thanks,
Ruan Shiyang.

> 
> Cheers,
> 
> Dave.
> 




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-12-02  7:12   ` Ruan Shiyang
@ 2020-12-06 22:55     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2020-12-06 22:55 UTC (permalink / raw)
  To: Ruan Shiyang
  Cc: linux-kernel, linux-xfs, linux-nvdimm, linux-mm, linux-fsdevel,
	linux-raid, darrick.wong, dan.j.williams, hch, song, rgoldwyn,
	qi.fuli, y-goto

On Wed, Dec 02, 2020 at 03:12:20PM +0800, Ruan Shiyang wrote:
> Hi Dave,
> 
> On 2020/11/30 上午6:47, Dave Chinner wrote:
> > On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> > > 
> > > The call trace is like this:
> > >   memory_failure()
> > >     pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
> > >      gendisk->fops->block_lost()   => pmem_block_lost() or
> > >                                           md_blk_block_lost()
> > >       sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
> > >        xfs_rmap_query_range()
> > >         xfs_storage_lost_helper()
> > >          mf_recover_controller->recover_fn => \
> > >                              memory_failure_dev_pagemap_kill_procs()
> > > 
> > > The collect_procs() and kill_procs() are moved into a callback which
> > > is passed from memory_failure() to xfs_storage_lost_helper().  So we
> > > can call it when a file assocaited is found, instead of creating a
> > > file list and iterate it.
> > > 
> > > The fsdax & reflink support for XFS is not contained in this patchset.
> > 
> > This looks promising - the overall architecture is a lot more
> > generic and less dependent on knowing about memory, dax or memory
> > failures. A few comments that I think would further improve
> > understanding the patchset and the implementation:
> 
> Thanks for your kindly comment.  It gives me confidence.
> 
> > 
> > - the order of the patches is inverted. It should start with a
> >    single patch introducing the mf_recover_controller structure for
> >    callbacks, then introduce pgmap->ops->memory_failure, then
> >    ->block_lost, then the pmem and md implementations of ->block
> >    list, then ->storage_lost and the XFS implementations of
> >    ->storage_lost.
> 
> Yes, it will be easier to understand the patchset in this order.
> 
> But I have something unsure: for example, I introduce ->memory_failure()
> firstly, but the implementation of ->memory_failure() needs to call
> ->block_lost() which is supposed to be introduced in the next patch. So, I
> am not sure the code is supposed to be what in the implementation of
> ->memory_failure() in pmem?  To avoid this situation, I committed the
> patches in the inverted order: lowest level first, then its caller, and then
> caller's caller.

Well, there's two things here. The first is the infrastructure, the
second is the drivers that use the infrastructure. You can introduce
a method in one patch, and then the driver that uses it in another.
Or you can introduce a driver skeleton that doesn't nothing until
more infrastructure is added. so...

> 
> I am trying to sort out the order.  How about this:
>  Patch i.
>    Introduce ->memory_failure()
>       - just introduce interface, without implementation
>  Patch i++.
>    Introduce ->block_lost()
>       - introduce interface and implement ->memory_failure()
>          in pmem, so that it can call ->block_lost()
>  Patch i++.
>    (similar with above, skip...)

So this is better, but you don't need to add the pmem driver use of
"->block_lost" in the patch that adds the method. IOWs, something
like:

P1: introduce ->memory_failure API, all the required documentation
and add the call sites in the infrastructure that trigger it

P2: introduce ->corrupted_range to the block device API, all the
required documentation and any generic block infrastructure that
needs to call it.

P3: introduce ->corrupted_range to the superblock ops API, all the
required documentation

P4: add ->corrupted_range() API to the address space ops, all the
required documentation

P5: factor the existing kill procs stuff to be able to be called on
via generic_mapping_kill_range()

P5: add dax_mapping_kill_range()

P6: add the pmem driver support for ->memory_failure

P7: add the block device driver support for ->corrupted_range

P8: add filesystem support for sb_ops->corrupted_range.

P9: add filesystem support for aops->corrupted_range.

> >    This gets rid of the mf_recover_controller altogether and allows
> >    the interface to be used by any sort of block device for any sort
> >    of bottom-up reporting of media/device failures.
> 
> Moving the recover function to the address_space ops looks a better idea.
> But I think that the error handler for page cache mapping is finished well
> in memory-failure.  The memory-failure is also reused to handles anonymous
> page.

Yes, anonymous page handling can remain there, we're only concerned
about handling file mapped pages here right now. If we end up
sharing page cache pages across reflink mappings, we'll have exactly
the same issue we have now with DAX....

> If we move the recover function to address_space ops, I think we also
> need to refactor the existing handler for page cache mapping, which may
> affect anonymous page handling.  This makes me confused...

Make the handling of the page the error occurred in conditional on
!PageAnon().

> I rewrote the call trace:
> memory_failure()
>  * dax mapping case
>  pgmap->ops->memory_failure()          =>
>                                    pmem_pgmap_memory_failure()
>   gendisk->fops->block_corrupt_range() =>
>                                    - pmem_block_corrupt_range()
>                                    - md_blk_block_corrupt_range()
>    sb->s_ops->storage_currupt_range()  =>
>                                    xfs_fs_storage_corrupt_range()

No need for block/storage prefixes in these...

>     xfs_rmap_query_range()
>      xfs_storage_lost_helper()
>       mapping->a_ops->corrupt_range()  =>
>                                    xfs_dax_aops.xfs_dax_corrupt_range
>        memory_failure_dev_pagemap_kill_procs()

This assumes we find a user data mapping. We might find the
corrupted storage contained metadata, in which case we'll be
shutting down the filesystem, not trying to kill user procs...

Also, we don't need aops->corrupt_range() here as we are already in
the filesystem code and if we find a mapping in memory we can just
do "if (IS_DAX(mapping->host))" to call the right kill procs
implementation for the mapping we've found. aops->corrupt_range is
for generic code working on a mapping to inform the filesystem that
there is a bad range in the mapping (my apologies for getting that
all mixed up in my last email).

>  * page cache mapping case
>  mapping->a_ops->corrupt_range()       =>
>                                    xfs_address_space_operations.xfs_xxx
>   memory_failure_generic_kill_procs()

We need the aops->corrupted_range() to call into the filesystem so
it can do a similar reverse mapping lookup to
sb->s_ops->corrupted_range.  Yes, the page cache should already have
a mapping attached to the page, but we do not know whether it is the
only mapping that exists for that page. e.g. if/when we implement
multiple-mapped shared read-only reflink pages in the page cache
which results in the same problem we have with DAX pages right now.

Overall, though, it seems like you're on the right path. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (6 preceding siblings ...)
  2020-11-29 22:47 ` [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Dave Chinner
@ 2020-12-14 20:58 ` Jane Chu
  2020-12-15 11:58   ` Ruan Shiyang
  7 siblings, 1 reply; 15+ messages in thread
From: Jane Chu @ 2020-12-14 20:58 UTC (permalink / raw)
  To: Shiyang Ruan, linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

Hi, Shiyang,

On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Change from v1:
>    - Intorduce ->block_lost() for block device
>    - Support mapped device
>    - Add 'not available' warning for realtime device in XFS
>    - Rebased to v5.10-rc1
> 
> This patchset moves owner tracking from dax_assocaite_entry() to pmem
> device, by introducing an interface ->memory_failure() of struct
> pagemap.  The interface is called by memory_failure() in mm, and
> implemented by pmem device.  Then pmem device calls its ->block_lost()
> to find the filesystem which the damaged page located in, and call
> ->storage_lost() to track files or metadata assocaited with this page.
> Finally we are able to try to fix the damaged data in filesystem and do

Does that mean clearing poison? if so, would you mind to elaborate 
specifically which change does that?

Thanks!
-jane

> other necessary processing, such as killing processes who are using the
> files affected.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-12-14 20:58 ` Jane Chu
@ 2020-12-15 11:58   ` Ruan Shiyang
  2020-12-15 19:05     ` Jane Chu
  0 siblings, 1 reply; 15+ messages in thread
From: Ruan Shiyang @ 2020-12-15 11:58 UTC (permalink / raw)
  To: Jane Chu, linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

Hi Jane

On 2020/12/15 上午4:58, Jane Chu wrote:
> Hi, Shiyang,
> 
> On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
>> This patchset is a try to resolve the problem of tracking shared page
>> for fsdax.
>>
>> Change from v1:
>>    - Intorduce ->block_lost() for block device
>>    - Support mapped device
>>    - Add 'not available' warning for realtime device in XFS
>>    - Rebased to v5.10-rc1
>>
>> This patchset moves owner tracking from dax_assocaite_entry() to pmem
>> device, by introducing an interface ->memory_failure() of struct
>> pagemap.  The interface is called by memory_failure() in mm, and
>> implemented by pmem device.  Then pmem device calls its ->block_lost()
>> to find the filesystem which the damaged page located in, and call
>> ->storage_lost() to track files or metadata assocaited with this page.
>> Finally we are able to try to fix the damaged data in filesystem and do
> 
> Does that mean clearing poison? if so, would you mind to elaborate 
> specifically which change does that?

Recovering data for filesystem (or pmem device) has not been done in 
this patchset...  I just triggered the handler for the files sharing the 
corrupted page here.


--
Thanks,
Ruan Shiyang.

> 
> Thanks!
> -jane
> 
>> other necessary processing, such as killing processes who are using the
>> files affected.
> 
> 




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-12-15 11:58   ` Ruan Shiyang
@ 2020-12-15 19:05     ` Jane Chu
  2020-12-15 23:10       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Jane Chu @ 2020-12-15 19:05 UTC (permalink / raw)
  To: Ruan Shiyang, linux-kernel, linux-xfs, linux-nvdimm, linux-mm
  Cc: linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, david,
	hch, song, rgoldwyn, qi.fuli, y-goto

On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> Hi Jane
> 
> On 2020/12/15 上午4:58, Jane Chu wrote:
>> Hi, Shiyang,
>>
>> On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
>>> This patchset is a try to resolve the problem of tracking shared page
>>> for fsdax.
>>>
>>> Change from v1:
>>>    - Intorduce ->block_lost() for block device
>>>    - Support mapped device
>>>    - Add 'not available' warning for realtime device in XFS
>>>    - Rebased to v5.10-rc1
>>>
>>> This patchset moves owner tracking from dax_assocaite_entry() to pmem
>>> device, by introducing an interface ->memory_failure() of struct
>>> pagemap.  The interface is called by memory_failure() in mm, and
>>> implemented by pmem device.  Then pmem device calls its ->block_lost()
>>> to find the filesystem which the damaged page located in, and call
>>> ->storage_lost() to track files or metadata assocaited with this page.
>>> Finally we are able to try to fix the damaged data in filesystem and do
>>
>> Does that mean clearing poison? if so, would you mind to elaborate 
>> specifically which change does that?
> 
> Recovering data for filesystem (or pmem device) has not been done in 
> this patchset...  I just triggered the handler for the files sharing the 
> corrupted page here.

Thanks! That confirms my understanding.

With the framework provided by the patchset, how do you envision it to
ease/simplify poison recovery from the user's perspective?

And how does it help in dealing with page faults upon poisoned
dax page?

thanks!
-jane



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-12-15 19:05     ` Jane Chu
@ 2020-12-15 23:10       ` Dave Chinner
  2020-12-16  2:46         ` Darrick J. Wong
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-12-15 23:10 UTC (permalink / raw)
  To: Jane Chu
  Cc: Ruan Shiyang, linux-kernel, linux-xfs, linux-nvdimm, linux-mm,
	linux-fsdevel, linux-raid, darrick.wong, dan.j.williams, hch,
	song, rgoldwyn, qi.fuli, y-goto

On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote:
> On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> > Hi Jane
> > 
> > On 2020/12/15 上午4:58, Jane Chu wrote:
> > > Hi, Shiyang,
> > > 
> > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> > > > This patchset is a try to resolve the problem of tracking shared page
> > > > for fsdax.
> > > > 
> > > > Change from v1:
> > > >    - Intorduce ->block_lost() for block device
> > > >    - Support mapped device
> > > >    - Add 'not available' warning for realtime device in XFS
> > > >    - Rebased to v5.10-rc1
> > > > 
> > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem
> > > > device, by introducing an interface ->memory_failure() of struct
> > > > pagemap.  The interface is called by memory_failure() in mm, and
> > > > implemented by pmem device.  Then pmem device calls its ->block_lost()
> > > > to find the filesystem which the damaged page located in, and call
> > > > ->storage_lost() to track files or metadata assocaited with this page.
> > > > Finally we are able to try to fix the damaged data in filesystem and do
> > > 
> > > Does that mean clearing poison? if so, would you mind to elaborate
> > > specifically which change does that?
> > 
> > Recovering data for filesystem (or pmem device) has not been done in
> > this patchset...  I just triggered the handler for the files sharing the
> > corrupted page here.
> 
> Thanks! That confirms my understanding.
> 
> With the framework provided by the patchset, how do you envision it to
> ease/simplify poison recovery from the user's perspective?

At the moment, I'd say no change what-so-ever. THe behaviour is
necessary so that we can kill whatever user application maps
multiply-shared physical blocks if there's a memory error. THe
recovery method from that is unchanged. The only advantage may be
that the filesystem (if rmap enabled) can tell you the exact file
and offset into the file where data was corrupted.

However, it can be worse, too: it may also now completely shut down
the filesystem if the filesystem discovers the error is in metadata
rather than user data. That's much more complex to recover from, and
right now will require downtime to take the filesystem offline and
run fsck to correct the error. That may trash whatever the metadata
that can't be recovered points to, so you still have a uesr data
recovery process to perform after this...

> And how does it help in dealing with page faults upon poisoned
> dax page?

It doesn't. If the page is poisoned, the same behaviour will occur
as does now. This is simply error reporting infrastructure, not
error handling.

Future work might change how we correct the faults found in the
storage, but I think the user visible behaviour is going to be "kill
apps mapping corrupted data" for a long time yet....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink
  2020-12-15 23:10       ` Dave Chinner
@ 2020-12-16  2:46         ` Darrick J. Wong
  0 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-12-16  2:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jane Chu, Ruan Shiyang, linux-kernel, linux-xfs, linux-nvdimm,
	linux-mm, linux-fsdevel, linux-raid, dan.j.williams, hch, song,
	rgoldwyn, qi.fuli, y-goto

On Wed, Dec 16, 2020 at 10:10:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote:
> > On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> > > Hi Jane
> > > 
> > > On 2020/12/15 上午4:58, Jane Chu wrote:
> > > > Hi, Shiyang,
> > > > 
> > > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> > > > > This patchset is a try to resolve the problem of tracking shared page
> > > > > for fsdax.
> > > > > 
> > > > > Change from v1:
> > > > >    - Intorduce ->block_lost() for block device
> > > > >    - Support mapped device
> > > > >    - Add 'not available' warning for realtime device in XFS
> > > > >    - Rebased to v5.10-rc1
> > > > > 
> > > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem
> > > > > device, by introducing an interface ->memory_failure() of struct
> > > > > pagemap.  The interface is called by memory_failure() in mm, and
> > > > > implemented by pmem device.  Then pmem device calls its ->block_lost()
> > > > > to find the filesystem which the damaged page located in, and call
> > > > > ->storage_lost() to track files or metadata assocaited with this page.
> > > > > Finally we are able to try to fix the damaged data in filesystem and do
> > > > 
> > > > Does that mean clearing poison? if so, would you mind to elaborate
> > > > specifically which change does that?
> > > 
> > > Recovering data for filesystem (or pmem device) has not been done in
> > > this patchset...  I just triggered the handler for the files sharing the
> > > corrupted page here.
> > 
> > Thanks! That confirms my understanding.
> > 
> > With the framework provided by the patchset, how do you envision it to
> > ease/simplify poison recovery from the user's perspective?
> 
> At the moment, I'd say no change what-so-ever. THe behaviour is
> necessary so that we can kill whatever user application maps
> multiply-shared physical blocks if there's a memory error. THe
> recovery method from that is unchanged. The only advantage may be
> that the filesystem (if rmap enabled) can tell you the exact file
> and offset into the file where data was corrupted.
> 
> However, it can be worse, too: it may also now completely shut down
> the filesystem if the filesystem discovers the error is in metadata
> rather than user data. That's much more complex to recover from, and
> right now will require downtime to take the filesystem offline and
> run fsck to correct the error. That may trash whatever the metadata
> that can't be recovered points to, so you still have a uesr data
> recovery process to perform after this...

...though for the future future I'd like to bypass the default behaviors
if there's somebody watching the sb notification that will also kick off
the appropriate repair activities.  The xfs auto-repair parts are coming
along nicely.  Dunno about userspace, though I figure if we can do
userspace page faults then some people could probably do autorepair
too.

--D

> > And how does it help in dealing with page faults upon poisoned
> > dax page?
> 
> It doesn't. If the page is poisoned, the same behaviour will occur
> as does now. This is simply error reporting infrastructure, not
> error handling.
> 
> Future work might change how we correct the faults found in the
> storage, but I think the user visible behaviour is going to be "kill
> apps mapping corrupted data" for a long time yet....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-12-16  2:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-23  0:41 [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Shiyang Ruan
2020-11-23  0:41 ` [RFC PATCH v2 1/6] fs: introduce ->storage_lost() for memory-failure Shiyang Ruan
2020-11-23  0:41 ` [RFC PATCH v2 2/6] blk: introduce ->block_lost() to handle memory-failure Shiyang Ruan
2020-11-23  0:41 ` [RFC PATCH v2 3/6] md: implement ->block_lost() for memory-failure Shiyang Ruan
2020-11-23  0:41 ` [RFC PATCH v2 4/6] pagemap: introduce ->memory_failure() Shiyang Ruan
2020-11-23  0:41 ` [RFC PATCH v2 5/6] mm, fsdax: refactor dax handler in memory-failure Shiyang Ruan
2020-11-23  0:41 ` [RFC PATCH v2 6/6] fsdax: remove useless (dis)associate functions Shiyang Ruan
2020-11-29 22:47 ` [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Dave Chinner
2020-12-02  7:12   ` Ruan Shiyang
2020-12-06 22:55     ` Dave Chinner
2020-12-14 20:58 ` Jane Chu
2020-12-15 11:58   ` Ruan Shiyang
2020-12-15 19:05     ` Jane Chu
2020-12-15 23:10       ` Dave Chinner
2020-12-16  2:46         ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).