[PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink

nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink
@ 2021-09-24 13:09 Shiyang Ruan
  2021-09-24 13:09 ` [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock() Shiyang Ruan
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

This patchset is aimed to support shared pages tracking for fsdax.

Changes from [V6 RESEND]:
  - Move ->memory_failure() into the patch who implements it
  - Change the parameter of ->memory_failure():
      unsigned long nr_pfns -> size_t size
  - Remove changes(2 patches) for mapped device
  - Add some necessary comments for functions or interfaces
  - P1: Make a pre-patch for changes for dax_{read,write}_lock()
  - P2: Change the parameter of ->notify_failure():
      void *data -> int flags
  - P3: keep the original dax_lock_page() logic
  - P5: Rewrite the lock function for file's mapping and index:
      dax_lock_mapping_entry()
  - P6: use the new dax_lock_mapping_entry() to lock dax entry, and add
      size parameter to handle a range of failure
  - P7: add a cross range calculation between memory_failure range and
      founded extent range
  - Rebased to v5.15-rc1

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device driver, by introducing an interface ->memory_failure() for struct
pagemap.  This interface is called by memory_failure() in mm, and
implemented by pmem device.

Then call holder operations to find the filesystem which the corrupted
data located in, and call filesystem handler to track files or metadata
associated with this page.

Finally we are able to try to fix the corrupted data in filesystem and
do other necessary processing, such as killing processes who are using
the files affected.

The call trace is like this:
memory_failure()
|* fsdax case
|------------
|pgmap->ops->memory_failure()      => pmem_pgmap_memory_failure()
| dax_holder_notify_failure()      =>
|  dax_device->holder_ops->notify_failure() =>
|                                     - xfs_dax_notify_failure()
|  |* xfs_dax_notify_failure()
|  |--------------------------
|  |   xfs_rmap_query_range()
|  |    xfs_dax_notify_failure_fn()
|  |    * corrupted on metadata
|  |       try to recover data, call xfs_force_shutdown()
|  |    * corrupted on file data
|  |       try to recover data, call mf_dax_kill_procs()
|* normal case
|-------------
 mf_generic_kill_procs()

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.15-rc1)
==

Shiyang Ruan (8):
  dax: Use rwsem for dax_{read,write}_lock()
  dax: Introduce holder for dax_device
  mm: factor helpers for memory_failure_dev_pagemap
  pagemap,pmem: Introduce ->memory_failure()
  fsdax: Introduce dax_lock_mapping_entry()
  mm: Introduce mf_dax_kill_procs() for fsdax case
  xfs: Implement ->notify_failure() for XFS
  fsdax: add exception for reflinked files

 drivers/dax/device.c       |  11 +-
 drivers/dax/super.c        | 121 +++++++++++++++++---
 drivers/md/dm-writecache.c |   7 +-
 drivers/nvdimm/pmem.c      |  11 ++
 fs/dax.c                   | 115 ++++++++++++++-----
 fs/xfs/xfs_fsops.c         |   3 +
 fs/xfs/xfs_mount.h         |   1 +
 fs/xfs/xfs_super.c         | 188 +++++++++++++++++++++++++++++++
 include/linux/dax.h        |  80 ++++++++++++-
 include/linux/memremap.h   |   9 ++
 include/linux/mm.h         |   2 +
 mm/memory-failure.c        | 225 ++++++++++++++++++++++++++-----------
 12 files changed, 643 insertions(+), 130 deletions(-)

-- 
2.33.0




^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock()
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 17:48   ` Darrick J. Wong
  2021-10-15  6:30   ` Christoph Hellwig
  2021-09-24 13:09 ` [PATCH v7 2/8] dax: Introduce holder for dax_device Shiyang Ruan
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

In order to introduce dax holder registration, we need a write lock for
dax.  Because of the rarity of notification failures and the infrequency
of registration events, it would be better to be a global lock rather
than per-device.  So, change the current lock to rwsem and introduce a
write lock for registration.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/device.c       | 11 +++++-----
 drivers/dax/super.c        | 43 ++++++++++++++++++++++----------------
 drivers/md/dm-writecache.c |  7 +++----
 fs/dax.c                   | 26 +++++++++++------------
 include/linux/dax.h        |  9 ++++----
 5 files changed, 49 insertions(+), 47 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index dd8222a42808..cc7b835509f9 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -198,7 +198,6 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 	struct file *filp = vmf->vma->vm_file;
 	unsigned long fault_size;
 	vm_fault_t rc = VM_FAULT_SIGBUS;
-	int id;
 	pfn_t pfn;
 	struct dev_dax *dev_dax = filp->private_data;
 
@@ -206,7 +205,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
 			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
 
-	id = dax_read_lock();
+	dax_read_lock();
 	switch (pe_size) {
 	case PE_SIZE_PTE:
 		fault_size = PAGE_SIZE;
@@ -246,7 +245,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 			page->index = pgoff + i;
 		}
 	}
-	dax_read_unlock(id);
+	dax_read_unlock();
 
 	return rc;
 }
@@ -284,7 +283,7 @@ static const struct vm_operations_struct dax_vm_ops = {
 static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 {
 	struct dev_dax *dev_dax = filp->private_data;
-	int rc, id;
+	int rc;
 
 	dev_dbg(&dev_dax->dev, "trace\n");
 
@@ -292,9 +291,9 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * We lock to check dax_dev liveness and will re-check at
 	 * fault time.
 	 */
-	id = dax_read_lock();
+	dax_read_lock();
 	rc = check_vma(dev_dax, vma, __func__);
-	dax_read_unlock(id);
+	dax_read_unlock();
 	if (rc)
 		return rc;
 
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index fc89e91beea7..48ce86501d93 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -36,7 +36,7 @@ struct dax_device {
 };
 
 static dev_t dax_devt;
-DEFINE_STATIC_SRCU(dax_srcu);
+static DECLARE_RWSEM(dax_rwsem);
 static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
@@ -46,18 +46,28 @@ static struct super_block *dax_superblock __read_mostly;
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
 static DEFINE_SPINLOCK(dax_host_lock);
 
-int dax_read_lock(void)
+void dax_read_lock(void)
 {
-	return srcu_read_lock(&dax_srcu);
+	down_read(&dax_rwsem);
 }
 EXPORT_SYMBOL_GPL(dax_read_lock);
 
-void dax_read_unlock(int id)
+void dax_read_unlock(void)
 {
-	srcu_read_unlock(&dax_srcu, id);
+	up_read(&dax_rwsem);
 }
 EXPORT_SYMBOL_GPL(dax_read_unlock);
 
+void dax_write_lock(void)
+{
+	down_write(&dax_rwsem);
+}
+
+void dax_write_unlock(void)
+{
+	up_write(&dax_rwsem);
+}
+
 static int dax_host_hash(const char *host)
 {
 	return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
@@ -70,14 +80,14 @@ static int dax_host_hash(const char *host)
 static struct dax_device *dax_get_by_host(const char *host)
 {
 	struct dax_device *dax_dev, *found = NULL;
-	int hash, id;
+	int hash;
 
 	if (!host)
 		return NULL;
 
 	hash = dax_host_hash(host);
 
-	id = dax_read_lock();
+	dax_read_lock();
 	spin_lock(&dax_host_lock);
 	hlist_for_each_entry(dax_dev, &dax_host_list[hash], list) {
 		if (!dax_alive(dax_dev)
@@ -89,7 +99,7 @@ static struct dax_device *dax_get_by_host(const char *host)
 		break;
 	}
 	spin_unlock(&dax_host_lock);
-	dax_read_unlock(id);
+	dax_read_unlock();
 
 	return found;
 }
@@ -130,7 +140,7 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
 	pfn_t pfn, end_pfn;
 	sector_t last_page;
 	long len, len2;
-	int err, id;
+	int err;
 
 	if (blocksize != PAGE_SIZE) {
 		pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
@@ -155,14 +165,14 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
 		return false;
 	}
 
-	id = dax_read_lock();
+	dax_read_lock();
 	len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
 	len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
 
 	if (len < 1 || len2 < 1) {
 		pr_info("%pg: error: dax access failed (%ld)\n",
 				bdev, len < 1 ? len : len2);
-		dax_read_unlock(id);
+		dax_read_unlock();
 		return false;
 	}
 
@@ -192,7 +202,7 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
 		put_dev_pagemap(end_pgmap);
 
 	}
-	dax_read_unlock(id);
+	dax_read_unlock();
 
 	if (!dax_enabled) {
 		pr_info("%pg: error: dax support not enabled\n", bdev);
@@ -206,16 +216,15 @@ bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
 		int blocksize, sector_t start, sector_t len)
 {
 	bool ret = false;
-	int id;
 
 	if (!dax_dev)
 		return false;
 
-	id = dax_read_lock();
+	dax_read_lock();
 	if (dax_alive(dax_dev) && dax_dev->ops->dax_supported)
 		ret = dax_dev->ops->dax_supported(dax_dev, bdev, blocksize,
 						  start, len);
-	dax_read_unlock(id);
+	dax_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(dax_supported);
@@ -410,7 +419,7 @@ EXPORT_SYMBOL_GPL(__set_dax_synchronous);
 
 bool dax_alive(struct dax_device *dax_dev)
 {
-	lockdep_assert_held(&dax_srcu);
+	lockdep_assert_held(&dax_rwsem);
 	return test_bit(DAXDEV_ALIVE, &dax_dev->flags);
 }
 EXPORT_SYMBOL_GPL(dax_alive);
@@ -428,8 +437,6 @@ void kill_dax(struct dax_device *dax_dev)
 
 	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
 
-	synchronize_srcu(&dax_srcu);
-
 	spin_lock(&dax_host_lock);
 	hlist_del_init(&dax_dev->list);
 	spin_unlock(&dax_host_lock);
diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
index 18320444fb0a..1067b3e98220 100644
--- a/drivers/md/dm-writecache.c
+++ b/drivers/md/dm-writecache.c
@@ -260,7 +260,6 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 	loff_t s;
 	long p, da;
 	pfn_t pfn;
-	int id;
 	struct page **pages;
 	sector_t offset;
 
@@ -284,7 +283,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 	}
 	offset >>= PAGE_SHIFT - 9;
 
-	id = dax_read_lock();
+	dax_read_lock();
 
 	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
 	if (da < 0) {
@@ -334,7 +333,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 		wc->memory_vmapped = true;
 	}
 
-	dax_read_unlock(id);
+	dax_read_unlock();
 
 	wc->memory_map += (size_t)wc->start_sector << SECTOR_SHIFT;
 	wc->memory_map_size -= (size_t)wc->start_sector << SECTOR_SHIFT;
@@ -343,7 +342,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
 err3:
 	kvfree(pages);
 err2:
-	dax_read_unlock(id);
+	dax_read_unlock();
 err1:
 	return r;
 }
diff --git a/fs/dax.c b/fs/dax.c
index 4e3e5a283a91..798c43f09eee 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -715,22 +715,21 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
 	void *vto, *kaddr;
 	pgoff_t pgoff;
 	long rc;
-	int id;
 
 	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
 	if (rc)
 		return rc;
 
-	id = dax_read_lock();
+	dax_read_lock();
 	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
 	if (rc < 0) {
-		dax_read_unlock(id);
+		dax_read_unlock();
 		return rc;
 	}
 	vto = kmap_atomic(to);
 	copy_user_page(vto, (void __force *)kaddr, vaddr, to);
 	kunmap_atomic(vto);
-	dax_read_unlock(id);
+	dax_read_unlock();
 	return 0;
 }
 
@@ -1015,13 +1014,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 {
 	const sector_t sector = dax_iomap_sector(iomap, pos);
 	pgoff_t pgoff;
-	int id, rc;
+	int rc;
 	long length;
 
 	rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
 	if (rc)
 		return rc;
-	id = dax_read_lock();
+	dax_read_lock();
 	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
 				   NULL, pfnp);
 	if (length < 0) {
@@ -1038,7 +1037,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
 		goto out;
 	rc = 0;
 out:
-	dax_read_unlock(id);
+	dax_read_unlock();
 	return rc;
 }
 
@@ -1130,7 +1129,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 {
 	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
 	pgoff_t pgoff;
-	long rc, id;
+	long rc;
 	void *kaddr;
 	bool page_aligned = false;
 	unsigned offset = offset_in_page(pos);
@@ -1144,14 +1143,14 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 	if (rc)
 		return rc;
 
-	id = dax_read_lock();
+	dax_read_lock();
 
 	if (page_aligned)
 		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
 	else
 		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
 	if (rc < 0) {
-		dax_read_unlock(id);
+		dax_read_unlock();
 		return rc;
 	}
 
@@ -1159,7 +1158,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
 		memset(kaddr + offset, 0, size);
 		dax_flush(iomap->dax_dev, kaddr + offset, size);
 	}
-	dax_read_unlock(id);
+	dax_read_unlock();
 	return size;
 }
 
@@ -1174,7 +1173,6 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 	loff_t end = pos + length, done = 0;
 	ssize_t ret = 0;
 	size_t xfer;
-	int id;
 
 	if (iov_iter_rw(iter) == READ) {
 		end = min(end, i_size_read(iomi->inode));
@@ -1199,7 +1197,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 					      (end - 1) >> PAGE_SHIFT);
 	}
 
-	id = dax_read_lock();
+	dax_read_lock();
 	while (pos < end) {
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		const size_t size = ALIGN(length + offset, PAGE_SIZE);
@@ -1251,7 +1249,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 		if (xfer < map_len)
 			break;
 	}
-	dax_read_unlock(id);
+	dax_read_unlock();
 
 	return done ? done : ret;
 }
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2619d94c308d..097b3304f9b9 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -177,15 +177,14 @@ static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 #endif
 
 #if IS_ENABLED(CONFIG_DAX)
-int dax_read_lock(void);
-void dax_read_unlock(int id);
+void dax_read_lock(void);
+void dax_read_unlock(void);
 #else
-static inline int dax_read_lock(void)
+static inline void dax_read_lock(void)
 {
-	return 0;
 }
 
-static inline void dax_read_unlock(int id)
+static inline void dax_read_unlock(void)
 {
 }
 #endif /* CONFIG_DAX */
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 2/8] dax: Introduce holder for dax_device
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
  2021-09-24 13:09 ` [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock() Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 18:00   ` Darrick J. Wong
  2021-09-24 13:09 ` [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

To easily track filesystem from a pmem device, we introduce a holder for
dax_device structure, and also its operation.  This holder is used to
remember who is using this dax_device:
 - When it is the backend of a filesystem, the holder will be the
   superblock of this filesystem.
 - When this pmem device is one of the targets in a mapped device, the
   holder will be this mapped device.  In this case, the mapped device
   has its own dax_device and it will follow the first rule.  So that we
   can finally track to the filesystem we needed.

The holder and holder_ops will be set when filesystem is being mounted,
or an target device is being activated.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h | 29 ++++++++++++++++++++++
 2 files changed, 88 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 48ce86501d93..7d4a11dcba90 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -23,7 +23,10 @@
  * @cdev: optional character interface for "device dax"
  * @host: optional name for lookups where the device path is not available
  * @private: dax driver private data
+ * @holder_data: holder of a dax_device: could be filesystem or mapped device
  * @flags: state and boolean properties
+ * @ops: operations for dax_device
+ * @holder_ops: operations for the inner holder
  */
 struct dax_device {
 	struct hlist_node list;
@@ -31,8 +34,10 @@ struct dax_device {
 	struct cdev cdev;
 	const char *host;
 	void *private;
+	void *holder_data;
 	unsigned long flags;
 	const struct dax_operations *ops;
+	const struct dax_holder_operations *holder_ops;
 };
 
 static dev_t dax_devt;
@@ -374,6 +379,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 }
 EXPORT_SYMBOL_GPL(dax_zero_page_range);
 
+int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
+			      size_t size, int flags)
+{
+	int rc;
+
+	dax_read_lock();
+	if (!dax_alive(dax_dev)) {
+		rc = -ENXIO;
+		goto out;
+	}
+
+	if (!dax_dev->holder_data) {
+		rc = -EOPNOTSUPP;
+		goto out;
+	}
+
+	rc = dax_dev->holder_ops->notify_failure(dax_dev, offset, size, flags);
+out:
+	dax_read_unlock();
+	return rc;
+}
+EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 void arch_wb_cache_pmem(void *addr, size_t size);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
@@ -618,6 +646,37 @@ void put_dax(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(put_dax);
 
+void dax_set_holder(struct dax_device *dax_dev, void *holder,
+		const struct dax_holder_operations *ops)
+{
+	dax_write_lock();
+	if (!dax_alive(dax_dev)) {
+		dax_write_unlock();
+		return;
+	}
+
+	dax_dev->holder_data = holder;
+	dax_dev->holder_ops = ops;
+	dax_write_unlock();
+}
+EXPORT_SYMBOL_GPL(dax_set_holder);
+
+void *dax_get_holder(struct dax_device *dax_dev)
+{
+	void *holder;
+
+	dax_read_lock();
+	if (!dax_alive(dax_dev)) {
+		dax_read_unlock();
+		return NULL;
+	}
+
+	holder = dax_dev->holder_data;
+	dax_read_unlock();
+	return holder;
+}
+EXPORT_SYMBOL_GPL(dax_get_holder);
+
 /**
  * inode_dax: convert a public inode into its dax_dev
  * @inode: An inode with i_cdev pointing to a dax_dev
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 097b3304f9b9..d273d59723cd 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -38,9 +38,24 @@ struct dax_operations {
 	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
 };
 
+struct dax_holder_operations {
+	/*
+	 * notify_failure - notify memory failure into inner holder device
+	 * @dax_dev: the dax device which contains the holder
+	 * @offset: offset on this dax device where memory failure occurs
+	 * @size: length of this memory failure event
+	 * @flags: action flags for memory failure handler
+	 */
+	int (*notify_failure)(struct dax_device *dax_dev, loff_t offset,
+			size_t size, int flags);
+};
+
 extern struct attribute_group dax_attribute_group;
 
 #if IS_ENABLED(CONFIG_DAX)
+void dax_set_holder(struct dax_device *dax_dev, void *holder,
+		const struct dax_holder_operations *ops);
+void *dax_get_holder(struct dax_device *dax_dev);
 struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops, unsigned long flags);
 void put_dax(struct dax_device *dax_dev);
@@ -70,6 +85,18 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
 	return dax_synchronous(dax_dev);
 }
 #else
+static inline struct dax_device *dax_get_by_host(const char *host)
+{
+	return NULL;
+}
+static inline void dax_set_holder(struct dax_device *dax_dev, void *holder,
+		const struct dax_holder_operations *ops)
+{
+}
+static inline void *dax_get_holder(struct dax_device *dax_dev)
+{
+	return NULL;
+}
 static inline struct dax_device *alloc_dax(void *private, const char *host,
 		const struct dax_operations *ops, unsigned long flags)
 {
@@ -198,6 +225,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
 		size_t bytes, struct iov_iter *i);
 int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
 			size_t nr_pages);
+int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
+		size_t size, int flags);
 void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
 
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
  2021-09-24 13:09 ` [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock() Shiyang Ruan
  2021-09-24 13:09 ` [PATCH v7 2/8] dax: Introduce holder for dax_device Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 18:02   ` Darrick J. Wong
  2021-10-15  6:33   ` Christoph Hellwig
  2021-09-24 13:09 ` [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

memory_failure_dev_pagemap code is a bit complex before introduce RMAP
feature for fsdax.  So it is needed to factor some helper functions to
simplify these code.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 mm/memory-failure.c | 140 ++++++++++++++++++++++++--------------------
 1 file changed, 76 insertions(+), 64 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 54879c339024..8ff9b52823c0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1430,6 +1430,79 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
 	return 0;
 }
 
+static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
+		struct address_space *mapping, pgoff_t index, int flags)
+{
+	struct to_kill *tk;
+	unsigned long size = 0;
+
+	list_for_each_entry(tk, to_kill, nd)
+		if (tk->size_shift)
+			size = max(size, 1UL << tk->size_shift);
+	if (size) {
+		/*
+		 * Unmap the largest mapping to avoid breaking up device-dax
+		 * mappings which are constant size. The actual size of the
+		 * mapping being torn down is communicated in siginfo, see
+		 * kill_proc()
+		 */
+		loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
+
+		unmap_mapping_range(mapping, start, size, 0);
+	}
+
+	kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+}
+
+static int mf_generic_kill_procs(unsigned long long pfn, int flags,
+		struct dev_pagemap *pgmap)
+{
+	struct page *page = pfn_to_page(pfn);
+	LIST_HEAD(to_kill);
+	dax_entry_t cookie;
+
+	/*
+	 * Prevent the inode from being freed while we are interrogating
+	 * the address_space, typically this would be handled by
+	 * lock_page(), but dax pages do not use the page lock. This
+	 * also prevents changes to the mapping of this pfn until
+	 * poison signaling is complete.
+	 */
+	cookie = dax_lock_page(page);
+	if (!cookie)
+		return -EBUSY;
+
+	if (hwpoison_filter(page))
+		return 0;
+
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+		/*
+		 * TODO: Handle HMM pages which may need coordination
+		 * with device-side memory.
+		 */
+		return -EBUSY;
+	}
+
+	/*
+	 * Use this flag as an indication that the dax page has been
+	 * remapped UC to prevent speculative consumption of poison.
+	 */
+	SetPageHWPoison(page);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a
+	 * different physical page at a given virtual address, so all
+	 * userspace consumption of ZONE_DEVICE memory necessitates
+	 * SIGBUS (i.e. MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	collect_procs(page, &to_kill, true);
+
+	unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
+	dax_unlock_page(page, cookie);
+	return 0;
+}
+
 static int memory_failure_hugetlb(unsigned long pfn, int flags)
 {
 	struct page *p = pfn_to_page(pfn);
@@ -1519,12 +1592,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		struct dev_pagemap *pgmap)
 {
 	struct page *page = pfn_to_page(pfn);
-	unsigned long size = 0;
-	struct to_kill *tk;
 	LIST_HEAD(tokill);
-	int rc = -EBUSY;
-	loff_t start;
-	dax_entry_t cookie;
+	int rc = -ENXIO;
 
 	if (flags & MF_COUNT_INCREASED)
 		/*
@@ -1533,67 +1602,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 		put_page(page);
 
 	/* device metadata space is not recoverable */
-	if (!pgmap_pfn_valid(pgmap, pfn)) {
-		rc = -ENXIO;
-		goto out;
-	}
-
-	/*
-	 * Prevent the inode from being freed while we are interrogating
-	 * the address_space, typically this would be handled by
-	 * lock_page(), but dax pages do not use the page lock. This
-	 * also prevents changes to the mapping of this pfn until
-	 * poison signaling is complete.
-	 */
-	cookie = dax_lock_page(page);
-	if (!cookie)
+	if (!pgmap_pfn_valid(pgmap, pfn))
 		goto out;
 
-	if (hwpoison_filter(page)) {
-		rc = 0;
-		goto unlock;
-	}
-
-	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		/*
-		 * TODO: Handle HMM pages which may need coordination
-		 * with device-side memory.
-		 */
-		goto unlock;
-	}
-
-	/*
-	 * Use this flag as an indication that the dax page has been
-	 * remapped UC to prevent speculative consumption of poison.
-	 */
-	SetPageHWPoison(page);
-
-	/*
-	 * Unlike System-RAM there is no possibility to swap in a
-	 * different physical page at a given virtual address, so all
-	 * userspace consumption of ZONE_DEVICE memory necessitates
-	 * SIGBUS (i.e. MF_MUST_KILL)
-	 */
-	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
-	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
-
-	list_for_each_entry(tk, &tokill, nd)
-		if (tk->size_shift)
-			size = max(size, 1UL << tk->size_shift);
-	if (size) {
-		/*
-		 * Unmap the largest mapping to avoid breaking up
-		 * device-dax mappings which are constant size. The
-		 * actual size of the mapping being torn down is
-		 * communicated in siginfo, see kill_proc()
-		 */
-		start = (page->index << PAGE_SHIFT) & ~(size - 1);
-		unmap_mapping_range(page->mapping, start, size, 0);
-	}
-	kill_procs(&tokill, flags & MF_MUST_KILL, false, pfn, flags);
-	rc = 0;
-unlock:
-	dax_unlock_page(page, cookie);
+	rc = mf_generic_kill_procs(pfn, flags, pgmap);
 out:
 	/* drop pgmap ref acquired in caller */
 	put_dev_pagemap(pgmap);
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (2 preceding siblings ...)
  2021-09-24 13:09 ` [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 18:05   ` Darrick J. Wong
  2021-10-15  6:36   ` Christoph Hellwig
  2021-09-24 13:09 ` [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the filesystem in which
the corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers.  If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/nvdimm/pmem.c    | 11 +++++++++++
 include/linux/memremap.h |  9 +++++++++
 mm/memory-failure.c      | 14 ++++++++++++++
 3 files changed, 34 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 72de88ff0d30..0dfafad8fcc5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -362,9 +362,20 @@ static void pmem_release_disk(void *__pmem)
 	del_gendisk(pmem->disk);
 }
 
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
+		unsigned long pfn, size_t size, int flags)
+{
+	struct pmem_device *pmem =
+			container_of(pgmap, struct pmem_device, pgmap);
+	loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
+
+	return dax_holder_notify_failure(pmem->dax_dev, offset, size, flags);
+}
+
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
 	.kill			= pmem_pagemap_kill,
 	.cleanup		= pmem_pagemap_cleanup,
+	.memory_failure		= pmem_pagemap_memory_failure,
 };
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index c0e9d35889e8..36d47bacd46d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -87,6 +87,15 @@ struct dev_pagemap_ops {
 	 * the page back to a CPU accessible page.
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+	/*
+	 * Handle the memory failure happens on a range of pfns.  Notify the
+	 * processes who are using these pfns, and try to recover the data on
+	 * them if necessary.  The flag is finally passed to the recover
+	 * function through the whole notify routine.
+	 */
+	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+			      size_t size, int flags);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8ff9b52823c0..85eab206b68f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1605,6 +1605,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	if (!pgmap_pfn_valid(pgmap, pfn))
 		goto out;
 
+	/*
+	 * Call driver's implementation to handle the memory failure, otherwise
+	 * fall back to generic handler.
+	 */
+	if (pgmap->ops->memory_failure) {
+		rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
+		/*
+		 * Fall back to generic handler too if operation is not
+		 * supported inside the driver/device/filesystem.
+		 */
+		if (rc != EOPNOTSUPP)
+			goto out;
+	}
+
 	rc = mf_generic_kill_procs(pfn, flags, pgmap);
 out:
 	/* drop pgmap ref acquired in caller */
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry()
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (3 preceding siblings ...)
  2021-09-24 13:09 ` [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 18:17   ` Darrick J. Wong
  2021-09-24 13:09 ` [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

The current dax_lock_page() locks dax entry by obtaining mapping and
index in page.  To support 1-to-N RMAP in NVDIMM, we need a new function
to lock a specific dax entry corresponding to this file's mapping,index.
And BTW, output the page corresponding to the specific dax entry for
caller use.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/dax.c            | 65 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/dax.h | 15 +++++++++++
 2 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 798c43f09eee..509b65e60478 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -390,7 +390,7 @@ static struct page *dax_busy_page(void *entry)
 }
 
 /*
- * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
+ * dax_lock_page - Lock the DAX entry corresponding to a page
  * @page: The page whose entry we want to lock
  *
  * Context: Process context.
@@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
 	dax_unlock_entry(&xas, (void *)cookie);
 }
 
+/*
+ * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
+ * @mapping: the file's mapping whose entry we want to lock
+ * @index: the offset within this file
+ * @page: output the dax page corresponding to this dax entry
+ *
+ * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
+ * could not be locked.
+ */
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
+		struct page **page)
+{
+	XA_STATE(xas, NULL, 0);
+	void *entry;
+
+	rcu_read_lock();
+	for (;;) {
+		entry = NULL;
+		if (!dax_mapping(mapping))
+			break;
+
+		xas.xa = &mapping->i_pages;
+		xas_lock_irq(&xas);
+		xas_set(&xas, index);
+		entry = xas_load(&xas);
+		if (dax_is_locked(entry)) {
+			rcu_read_unlock();
+			wait_entry_unlocked(&xas, entry);
+			rcu_read_lock();
+			continue;
+		}
+		if (!entry ||
+		    dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
+			/*
+			 * Because we are looking for entry from file's mapping
+			 * and index, so the entry may not be inserted for now,
+			 * or even a zero/empty entry.  We don't think this is
+			 * an error case.  So, return a special value and do
+			 * not output @page.
+			 */
+			entry = (void *)~0UL;
+		} else {
+			*page = pfn_to_page(dax_to_pfn(entry));
+			dax_lock_entry(&xas, entry);
+		}
+		xas_unlock_irq(&xas);
+		break;
+	}
+	rcu_read_unlock();
+	return (dax_entry_t)entry;
+}
+
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
+		dax_entry_t cookie)
+{
+	XA_STATE(xas, &mapping->i_pages, index);
+
+	if (cookie == ~0UL)
+		return;
+
+	dax_unlock_entry(&xas, (void *)cookie);
+}
+
 /*
  * Find page cache entry at given index. If it is a DAX entry, return it
  * with the entry locked. If the page cache doesn't contain an entry at
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d273d59723cd..65411bee4312 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -156,6 +156,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
 struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
+dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+		unsigned long index, struct page **page);
+void dax_unlock_mapping_entry(struct address_space *mapping,
+		unsigned long index, dax_entry_t cookie);
 #else
 #define generic_fsdax_supported		NULL
 
@@ -201,6 +205,17 @@ static inline dax_entry_t dax_lock_page(struct page *page)
 static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 {
 }
+
+static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
+		unsigned long index, struct page **page)
+{
+	return 0;
+}
+
+static inline void dax_unlock_mapping_entry(struct address_space *mapping,
+		unsigned long index, dax_entry_t cookie)
+{
+}
 #endif
 
 #if IS_ENABLED(CONFIG_DAX)
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (4 preceding siblings ...)
  2021-09-24 13:09 ` [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 19:32   ` Darrick J. Wong
  2021-09-24 13:09 ` [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
  2021-09-24 13:09 ` [PATCH v7 8/8] fsdax: add exception for reflinked files Shiyang Ruan
  7 siblings, 1 reply; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

This function is called at the end of RMAP routine, i.e. filesystem
recovery function, to collect and kill processes using a shared page of
DAX file.  The difference between mf_generic_kill_procs() is,
it accepts file's mapping,offset instead of struct page.  Because
different file's mappings and offsets may share the same page in fsdax
mode.  So, it is called when filesystem RMAP results are found.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/dax.c            | 10 ------
 include/linux/dax.h |  9 +++++
 include/linux/mm.h  |  2 ++
 mm/memory-failure.c | 83 ++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 86 insertions(+), 18 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 509b65e60478..2536c105ec7f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -852,16 +852,6 @@ static void *dax_insert_entry(struct xa_state *xas,
 	return entry;
 }
 
-static inline
-unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
-{
-	unsigned long address;
-
-	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-	return address;
-}
-
 /* Walk all mappings of a given index of a file and writeprotect them */
 static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
 		unsigned long pfn)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 65411bee4312..3d90becbd160 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -258,6 +258,15 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+static inline unsigned long pgoff_address(pgoff_t pgoff,
+		struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
 
 #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
 void hmem_register_device(int target_nid, struct resource *r);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..d06af0051e53 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3114,6 +3114,8 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 };
+extern int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+			     size_t size, int flags);
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 85eab206b68f..a9d0d487d205 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -302,10 +302,9 @@ void shake_page(struct page *p)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
+static unsigned long dev_pagemap_mapping_shift(unsigned long address,
 		struct vm_area_struct *vma)
 {
-	unsigned long address = vma_address(page, vma);
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
@@ -345,7 +344,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
-static void add_to_kill(struct task_struct *tsk, struct page *p,
+static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
 		       struct vm_area_struct *vma,
 		       struct list_head *to_kill)
 {
@@ -358,9 +357,15 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 	}
 
 	tk->addr = page_address_in_vma(p, vma);
-	if (is_zone_device_page(p))
-		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-	else
+	if (is_zone_device_page(p)) {
+		/*
+		 * Since page->mapping is no more used for fsdax, we should
+		 * calculate the address in a fsdax way.
+		 */
+		if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
+			tk->addr = pgoff_address(pgoff, vma);
+		tk->size_shift = dev_pagemap_mapping_shift(tk->addr, vma);
+	} else
 		tk->size_shift = page_shift(compound_head(p));
 
 	/*
@@ -508,7 +513,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 			if (!page_mapped_in_vma(page, vma))
 				continue;
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, 0, vma, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -544,7 +549,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			 * to be informed of all such data corruptions.
 			 */
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, 0, vma, to_kill);
+		}
+	}
+	read_unlock(&tasklist_lock);
+	i_mmap_unlock_read(mapping);
+}
+
+/*
+ * Collect processes when the error hit a fsdax page.
+ */
+static void collect_procs_fsdax(struct page *page, struct address_space *mapping,
+		pgoff_t pgoff, struct list_head *to_kill)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *tsk;
+
+	i_mmap_lock_read(mapping);
+	read_lock(&tasklist_lock);
+	for_each_process(tsk) {
+		struct task_struct *t = task_early_kill(tsk, true);
+
+		if (!t)
+			continue;
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+			if (vma->vm_mm == t->mm)
+				add_to_kill(t, page, pgoff, vma, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -1503,6 +1533,43 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
 	return 0;
 }
 
+/**
+ * mf_dax_kill_procs - Collect and kill processes who are using this file range
+ * @mapping:	the file in use
+ * @index:	start offset of the range
+ * @size:	length of the range
+ * @flags:	memory failure flags
+ */
+int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
+		size_t size, int flags)
+{
+	LIST_HEAD(to_kill);
+	dax_entry_t cookie;
+	struct page *page;
+	size_t end = (index << PAGE_SHIFT) + size;
+
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+
+	for (; (index << PAGE_SHIFT) < end; index++) {
+		page = NULL;
+		cookie = dax_lock_mapping_entry(mapping, index, &page);
+		if (!cookie)
+			return -EBUSY;
+		if (!page)
+			goto unlock;
+
+		SetPageHWPoison(page);
+
+		collect_procs_fsdax(page, mapping, index, &to_kill);
+		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
+				index, flags);
+unlock:
+		dax_unlock_mapping_entry(mapping, index, cookie);
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
+
 static int memory_failure_hugetlb(unsigned long pfn, int flags)
 {
 	struct page *p = pfn_to_page(pfn);
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (5 preceding siblings ...)
  2021-09-24 13:09 ` [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 19:21   ` Darrick J. Wong
  2021-10-15  6:41   ` Christoph Hellwig
  2021-09-24 13:09 ` [PATCH v7 8/8] fsdax: add exception for reflinked files Shiyang Ruan
  7 siblings, 2 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

This function is used to handle errors which may cause data lost in
filesystem.  Such as memory failure in fsdax mode.

If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data.  For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata.

After that, the memory failure needs to notify the processes who are
using those files.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 drivers/dax/super.c |  19 +++++
 fs/xfs/xfs_fsops.c  |   3 +
 fs/xfs/xfs_mount.h  |   1 +
 fs/xfs/xfs_super.c  | 188 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |  18 +++++
 5 files changed, 229 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 7d4a11dcba90..22091e7fb0ef 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -135,6 +135,25 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
 }
 EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
 
+void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
+		const struct dax_holder_operations *ops)
+{
+	dax_set_holder(dax_dev, holder, ops);
+}
+EXPORT_SYMBOL_GPL(fs_dax_register_holder);
+
+void fs_dax_unregister_holder(struct dax_device *dax_dev)
+{
+	dax_set_holder(dax_dev, NULL, NULL);
+}
+EXPORT_SYMBOL_GPL(fs_dax_unregister_holder);
+
+void *fs_dax_get_holder(struct dax_device *dax_dev)
+{
+	return dax_get_holder(dax_dev);
+}
+EXPORT_SYMBOL_GPL(fs_dax_get_holder);
+
 bool generic_fsdax_supported(struct dax_device *dax_dev,
 		struct block_device *bdev, int blocksize, sector_t start,
 		sector_t sectors)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 33e26690a8c4..4c2d3d4ca5a5 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -542,6 +542,9 @@ xfs_do_force_shutdown(
 	} else if (flags & SHUTDOWN_CORRUPT_INCORE) {
 		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
 		why = "Corruption of in-memory data";
+	} else if (flags & SHUTDOWN_CORRUPT_META) {
+		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
+		why = "Corruption of on-disk metadata";
 	} else {
 		tag = XFS_PTAG_SHUTDOWN_IOERROR;
 		why = "Metadata I/O Error";
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e091f3b3fa15..d0f6da23e3df 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -434,6 +434,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_LOG_IO_ERROR	0x0002	/* write attempt to the log failed */
 #define SHUTDOWN_FORCE_UMOUNT	0x0004	/* shutdown from a forced unmount */
 #define SHUTDOWN_CORRUPT_INCORE	0x0008	/* corrupt in-memory data structures */
+#define SHUTDOWN_CORRUPT_META	0x0010  /* corrupt metadata on device */
 
 #define XFS_SHUTDOWN_STRINGS \
 	{ SHUTDOWN_META_IO_ERROR,	"metadata_io" }, \
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c4e0cd1c1c8c..46fdf44b5ec2 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -37,11 +37,19 @@
 #include "xfs_reflink.h"
 #include "xfs_pwork.h"
 #include "xfs_ag.h"
+#include "xfs_alloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bit.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
 #include <linux/fs_parser.h>
+#include <linux/mm.h>
+#include <linux/dax.h>
 
+static const struct dax_holder_operations xfs_dax_holder_operations;
 static const struct super_operations xfs_super_operations;
 
 static struct kset *xfs_kset;		/* top-level xfs sysfs dir */
@@ -377,6 +385,8 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp->m_logdev_targp);
 		xfs_blkdev_put(logdev);
+		if (dax_logdev)
+			fs_dax_unregister_holder(dax_logdev);
 		fs_put_dax(dax_logdev);
 	}
 	if (mp->m_rtdev_targp) {
@@ -385,9 +395,13 @@ xfs_close_devices(
 
 		xfs_free_buftarg(mp->m_rtdev_targp);
 		xfs_blkdev_put(rtdev);
+		if (dax_rtdev)
+			fs_dax_unregister_holder(dax_rtdev);
 		fs_put_dax(dax_rtdev);
 	}
 	xfs_free_buftarg(mp->m_ddev_targp);
+	if (dax_ddev)
+		fs_dax_unregister_holder(dax_ddev);
 	fs_put_dax(dax_ddev);
 }
 
@@ -411,6 +425,9 @@ xfs_open_devices(
 	struct block_device	*logdev = NULL, *rtdev = NULL;
 	int			error;
 
+	if (dax_ddev)
+		fs_dax_register_holder(dax_ddev, mp,
+				&xfs_dax_holder_operations);
 	/*
 	 * Open real time and log devices - order is important.
 	 */
@@ -419,6 +436,9 @@ xfs_open_devices(
 		if (error)
 			goto out;
 		dax_logdev = fs_dax_get_by_bdev(logdev);
+		if (dax_logdev != dax_ddev && dax_logdev)
+			fs_dax_register_holder(dax_logdev, mp,
+					&xfs_dax_holder_operations);
 	}
 
 	if (mp->m_rtname) {
@@ -433,6 +453,9 @@ xfs_open_devices(
 			goto out_close_rtdev;
 		}
 		dax_rtdev = fs_dax_get_by_bdev(rtdev);
+		if (dax_rtdev)
+			fs_dax_register_holder(dax_rtdev, mp,
+					&xfs_dax_holder_operations);
 	}
 
 	/*
@@ -1110,6 +1133,171 @@ xfs_fs_free_cached_objects(
 	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
+struct notify_failure_info {
+	xfs_agblock_t startblock;
+	xfs_filblks_t blockcount;
+	int flags;
+};
+
+static loff_t
+xfs_notify_failure_start(
+	struct xfs_mount			*mp,
+	const struct xfs_rmap_irec		*rec,
+	const struct notify_failure_info	*notify)
+{
+	loff_t start = rec->rm_offset;
+
+	if (notify->startblock > rec->rm_startblock)
+		start += XFS_FSB_TO_B(mp,
+				notify->startblock - rec->rm_startblock);
+	return start;
+}
+
+static size_t
+xfs_notify_failure_size(
+	struct xfs_mount			*mp,
+	const struct xfs_rmap_irec		*rec,
+	const struct notify_failure_info	*notify)
+{
+	xfs_agblock_t rec_start = rec->rm_startblock;
+	xfs_agblock_t rec_end = rec->rm_startblock + rec->rm_blockcount;
+	xfs_agblock_t notify_start = notify->startblock;
+	xfs_agblock_t notify_end = notify->startblock + notify->blockcount;
+	xfs_agblock_t cross_start = max(rec_start, notify_start);
+	xfs_agblock_t cross_end = min(rec_end, notify_end);
+
+	return XFS_FSB_TO_B(mp, cross_end - cross_start);
+}
+
+static int
+xfs_dax_notify_failure_fn(
+	struct xfs_btree_cur		*cur,
+	const struct xfs_rmap_irec	*rec,
+	void				*data)
+{
+	struct xfs_mount		*mp = cur->bc_mp;
+	struct xfs_inode		*ip;
+	struct address_space		*mapping;
+	struct notify_failure_info	*notify = data;
+	int				error = 0;
+
+	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
+	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+		// TODO check and try to fix metadata
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
+		return -EFSCORRUPTED;
+	}
+
+	/* Get files that incore, filter out others that are not in use. */
+	error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
+			 0, &ip);
+	if (error)
+		return error;
+
+	mapping = VFS_I(ip)->i_mapping;
+	if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) {
+		loff_t offset = xfs_notify_failure_start(mp, rec, notify);
+		size_t size = xfs_notify_failure_size(mp, rec, notify);
+
+		error = mf_dax_kill_procs(mapping, offset >> PAGE_SHIFT, size,
+					  notify->flags);
+	}
+	// TODO try to fix data
+	xfs_irele(ip);
+
+	return error;
+}
+
+static loff_t
+xfs_dax_bdev_offset(
+	struct xfs_mount *mp,
+	struct dax_device *dax_dev,
+	loff_t disk_offset)
+{
+	struct block_device *bdev;
+
+	if (mp->m_ddev_targp->bt_daxdev == dax_dev)
+		bdev = mp->m_ddev_targp->bt_bdev;
+	else if (mp->m_logdev_targp->bt_daxdev == dax_dev)
+		bdev = mp->m_logdev_targp->bt_bdev;
+	else
+		bdev = mp->m_rtdev_targp->bt_bdev;
+
+	return disk_offset - (get_start_sect(bdev) << SECTOR_SHIFT);
+}
+
+static int
+xfs_dax_notify_failure(
+	struct dax_device	*dax_dev,
+	loff_t			offset,
+	size_t			len,
+	int			flags)
+{
+	struct xfs_mount	*mp = fs_dax_get_holder(dax_dev);
+	struct xfs_trans	*tp = NULL;
+	struct xfs_btree_cur	*cur = NULL;
+	struct xfs_buf		*agf_bp = NULL;
+	struct xfs_rmap_irec	rmap_low, rmap_high;
+	loff_t			bdev_offset = xfs_dax_bdev_offset(mp, dax_dev,
+								  offset);
+	xfs_fsblock_t		fsbno = XFS_B_TO_FSB(mp, bdev_offset);
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+	int			error = 0;
+	struct notify_failure_info notify = {
+		.startblock	= XFS_FSB_TO_AGBNO(mp, fsbno),
+		.blockcount	= XFS_B_TO_FSB(mp, len),
+		.flags		= flags,
+	};
+
+	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
+		xfs_warn(mp,
+			 "notify_failure() not supported on realtime device!");
+		return -EOPNOTSUPP;
+	}
+
+	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
+	    mp->m_logdev_targp != mp->m_ddev_targp) {
+		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
+		return -EFSCORRUPTED;
+	}
+
+	if (!xfs_has_rmapbt(mp)) {
+		xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
+		return -EOPNOTSUPP;
+	}
+
+	error = xfs_trans_alloc_empty(mp, &tp);
+	if (error)
+		return error;
+
+	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
+	if (error)
+		goto out_cancel_tp;
+
+	cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agf_bp->b_pag);
+
+	/* Construct a range for rmap query */
+	memset(&rmap_low, 0, sizeof(rmap_low));
+	memset(&rmap_high, 0xFF, sizeof(rmap_high));
+	rmap_low.rm_startblock = rmap_high.rm_startblock = notify.startblock;
+	rmap_low.rm_blockcount = rmap_high.rm_blockcount = notify.blockcount;
+
+	error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
+				     xfs_dax_notify_failure_fn, &notify);
+
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(tp, agf_bp);
+
+out_cancel_tp:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+static const struct dax_holder_operations xfs_dax_holder_operations = {
+	.notify_failure		= xfs_dax_notify_failure,
+};
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 3d90becbd160..0f1fa7a4a616 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -149,6 +149,10 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
+		const struct dax_holder_operations *ops);
+void fs_dax_unregister_holder(struct dax_device *dax_dev);
+void *fs_dax_get_holder(struct dax_device *dax_dev);
 int dax_writeback_mapping_range(struct address_space *mapping,
 		struct dax_device *dax_dev, struct writeback_control *wbc);
 
@@ -179,6 +183,20 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
 	return NULL;
 }
 
+static inline void fs_dax_register_holder(struct dax_device *dax_dev,
+		void *holder, const struct dax_holder_operations *ops)
+{
+}
+
+static inline void fs_dax_unregister_holder(struct dax_device *dax_dev)
+{
+}
+
+static inline void *fs_dax_get_holder(struct dax_device *dax_dev)
+{
+	return NULL;
+}
+
 static inline struct page *dax_layout_busy_page(struct address_space *mapping)
 {
 	return NULL;
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v7 8/8] fsdax: add exception for reflinked files
  2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
                   ` (6 preceding siblings ...)
  2021-09-24 13:09 ` [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
@ 2021-09-24 13:09 ` Shiyang Ruan
  2021-10-14 19:24   ` Darrick J. Wong
  7 siblings, 1 reply; 26+ messages in thread
From: Shiyang Ruan @ 2021-09-24 13:09 UTC (permalink / raw)
  To: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel
  Cc: djwong, dan.j.williams, david, hch, jane.chu

For reflinked files, one dax page may be associated more than once with
different fime mapping and index.  It will report warning.  Now, since
we have introduced dax-RMAP for this case and also have to keep its
functionality for other filesystems who are not support rmap, I add this
exception here.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
 fs/dax.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2536c105ec7f..1a57211b1bc9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -352,9 +352,10 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
 	for_each_mapped_pfn(entry, pfn) {
 		struct page *page = pfn_to_page(pfn);
 
-		WARN_ON_ONCE(page->mapping);
-		page->mapping = mapping;
-		page->index = index + i++;
+		if (!page->mapping) {
+			page->mapping = mapping;
+			page->index = index + i++;
+		}
 	}
 }
 
@@ -370,9 +371,10 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
 		struct page *page = pfn_to_page(pfn);
 
 		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
-		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
-		page->mapping = NULL;
-		page->index = 0;
+		if (page->mapping == mapping) {
+			page->mapping = NULL;
+			page->index = 0;
+		}
 	}
 }
 
-- 
2.33.0




^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock()
  2021-09-24 13:09 ` [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock() Shiyang Ruan
@ 2021-10-14 17:48   ` Darrick J. Wong
  2021-10-20  5:19     ` Shiyang Ruan
  2021-10-15  6:30   ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 17:48 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:52PM +0800, Shiyang Ruan wrote:
> In order to introduce dax holder registration, we need a write lock for
> dax.  Because of the rarity of notification failures and the infrequency
> of registration events, it would be better to be a global lock rather
> than per-device.  So, change the current lock to rwsem and introduce a
> write lock for registration.

Urgh, I totally thought dax_read_lock was a global lock on something
relating to the global dax_device state until I noticed this comment
above kill_dax():

/*
 * Note, rcu is not protecting the liveness of dax_dev, rcu is ensuring
 * that any fault handlers or operations that might have seen
 * dax_alive(), have completed.  Any operations that start after
 * synchronize_srcu() has run will abort upon seeing !dax_alive().
 */

So dax_srcu ensures stability in the dax_device's ALIVE state while any
code that relies on that aliveness runs.  As a side effect, it'll block
kill_dax (and I guess run_dax) while those functions run.  It doesn't
protect any global state at all... but this isn't made obvious in the
API by (for example) passing the dax_device into dax_read_lock.

IOWs, It's not protecting against the dax_device getting freed or
anything resembling global state.  So that's probably why you note above
that this /could/ be a per-device synchronization primitive, right?

If that's the case, then why shouldn't this be a per-device item?  As
written here, any code that takes dax_write_lock() will block every dax
device in the system while it does some work on a single dax device.
Being an rwsem, it  will also have to wait for every other dax device
access to complete before it can begin.  That seems excessive,
particularly if in the future we start hooking up lots of pmem to a
single host.

I have more to say around kill_dax() below.

> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/device.c       | 11 +++++-----
>  drivers/dax/super.c        | 43 ++++++++++++++++++++++----------------
>  drivers/md/dm-writecache.c |  7 +++----
>  fs/dax.c                   | 26 +++++++++++------------
>  include/linux/dax.h        |  9 ++++----
>  5 files changed, 49 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index dd8222a42808..cc7b835509f9 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -198,7 +198,6 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>  	struct file *filp = vmf->vma->vm_file;
>  	unsigned long fault_size;
>  	vm_fault_t rc = VM_FAULT_SIGBUS;
> -	int id;
>  	pfn_t pfn;
>  	struct dev_dax *dev_dax = filp->private_data;
>  
> @@ -206,7 +205,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>  			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
>  			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	switch (pe_size) {
>  	case PE_SIZE_PTE:
>  		fault_size = PAGE_SIZE;
> @@ -246,7 +245,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>  			page->index = pgoff + i;
>  		}
>  	}
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  
>  	return rc;
>  }
> @@ -284,7 +283,7 @@ static const struct vm_operations_struct dax_vm_ops = {
>  static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
>  {
>  	struct dev_dax *dev_dax = filp->private_data;
> -	int rc, id;
> +	int rc;
>  
>  	dev_dbg(&dev_dax->dev, "trace\n");
>  
> @@ -292,9 +291,9 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
>  	 * We lock to check dax_dev liveness and will re-check at
>  	 * fault time.
>  	 */
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	rc = check_vma(dev_dax, vma, __func__);
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  	if (rc)
>  		return rc;
>  
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index fc89e91beea7..48ce86501d93 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -36,7 +36,7 @@ struct dax_device {
>  };
>  
>  static dev_t dax_devt;
> -DEFINE_STATIC_SRCU(dax_srcu);
> +static DECLARE_RWSEM(dax_rwsem);
>  static struct vfsmount *dax_mnt;
>  static DEFINE_IDA(dax_minor_ida);
>  static struct kmem_cache *dax_cache __read_mostly;
> @@ -46,18 +46,28 @@ static struct super_block *dax_superblock __read_mostly;
>  static struct hlist_head dax_host_list[DAX_HASH_SIZE];
>  static DEFINE_SPINLOCK(dax_host_lock);
>  
> -int dax_read_lock(void)
> +void dax_read_lock(void)
>  {
> -	return srcu_read_lock(&dax_srcu);
> +	down_read(&dax_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(dax_read_lock);
>  
> -void dax_read_unlock(int id)
> +void dax_read_unlock(void)
>  {
> -	srcu_read_unlock(&dax_srcu, id);
> +	up_read(&dax_rwsem);
>  }
>  EXPORT_SYMBOL_GPL(dax_read_unlock);
>  
> +void dax_write_lock(void)
> +{
> +	down_write(&dax_rwsem);
> +}
> +
> +void dax_write_unlock(void)
> +{
> +	up_write(&dax_rwsem);
> +}
> +
>  static int dax_host_hash(const char *host)
>  {
>  	return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
> @@ -70,14 +80,14 @@ static int dax_host_hash(const char *host)
>  static struct dax_device *dax_get_by_host(const char *host)
>  {
>  	struct dax_device *dax_dev, *found = NULL;
> -	int hash, id;
> +	int hash;
>  
>  	if (!host)
>  		return NULL;
>  
>  	hash = dax_host_hash(host);
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	spin_lock(&dax_host_lock);
>  	hlist_for_each_entry(dax_dev, &dax_host_list[hash], list) {
>  		if (!dax_alive(dax_dev)
> @@ -89,7 +99,7 @@ static struct dax_device *dax_get_by_host(const char *host)
>  		break;
>  	}
>  	spin_unlock(&dax_host_lock);
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  
>  	return found;
>  }
> @@ -130,7 +140,7 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
>  	pfn_t pfn, end_pfn;
>  	sector_t last_page;
>  	long len, len2;
> -	int err, id;
> +	int err;
>  
>  	if (blocksize != PAGE_SIZE) {
>  		pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
> @@ -155,14 +165,14 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
>  		return false;
>  	}
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
>  	len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
>  
>  	if (len < 1 || len2 < 1) {
>  		pr_info("%pg: error: dax access failed (%ld)\n",
>  				bdev, len < 1 ? len : len2);
> -		dax_read_unlock(id);
> +		dax_read_unlock();
>  		return false;
>  	}
>  
> @@ -192,7 +202,7 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
>  		put_dev_pagemap(end_pgmap);
>  
>  	}
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  
>  	if (!dax_enabled) {
>  		pr_info("%pg: error: dax support not enabled\n", bdev);
> @@ -206,16 +216,15 @@ bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
>  		int blocksize, sector_t start, sector_t len)
>  {
>  	bool ret = false;
> -	int id;
>  
>  	if (!dax_dev)
>  		return false;
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	if (dax_alive(dax_dev) && dax_dev->ops->dax_supported)
>  		ret = dax_dev->ops->dax_supported(dax_dev, bdev, blocksize,
>  						  start, len);
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(dax_supported);
> @@ -410,7 +419,7 @@ EXPORT_SYMBOL_GPL(__set_dax_synchronous);
>  
>  bool dax_alive(struct dax_device *dax_dev)
>  {
> -	lockdep_assert_held(&dax_srcu);
> +	lockdep_assert_held(&dax_rwsem);
>  	return test_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  }
>  EXPORT_SYMBOL_GPL(dax_alive);
> @@ -428,8 +437,6 @@ void kill_dax(struct dax_device *dax_dev)
>  
>  	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>  
> -	synchronize_srcu(&dax_srcu);

Shouldn't you take the dax_write_lock() around the clear_bit call to
maintain the behavior that kill_dax can't proceed until all the
functions that depend on DAXDEV_ALIVE state have finished?

--D

> -
>  	spin_lock(&dax_host_lock);
>  	hlist_del_init(&dax_dev->list);
>  	spin_unlock(&dax_host_lock);
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 18320444fb0a..1067b3e98220 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -260,7 +260,6 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>  	loff_t s;
>  	long p, da;
>  	pfn_t pfn;
> -	int id;
>  	struct page **pages;
>  	sector_t offset;
>  
> @@ -284,7 +283,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>  	}
>  	offset >>= PAGE_SHIFT - 9;
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  
>  	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
>  	if (da < 0) {
> @@ -334,7 +333,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>  		wc->memory_vmapped = true;
>  	}
>  
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  
>  	wc->memory_map += (size_t)wc->start_sector << SECTOR_SHIFT;
>  	wc->memory_map_size -= (size_t)wc->start_sector << SECTOR_SHIFT;
> @@ -343,7 +342,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>  err3:
>  	kvfree(pages);
>  err2:
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  err1:
>  	return r;
>  }
> diff --git a/fs/dax.c b/fs/dax.c
> index 4e3e5a283a91..798c43f09eee 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -715,22 +715,21 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
>  	void *vto, *kaddr;
>  	pgoff_t pgoff;
>  	long rc;
> -	int id;
>  
>  	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
>  	if (rc)
>  		return rc;
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
>  	if (rc < 0) {
> -		dax_read_unlock(id);
> +		dax_read_unlock();
>  		return rc;
>  	}
>  	vto = kmap_atomic(to);
>  	copy_user_page(vto, (void __force *)kaddr, vaddr, to);
>  	kunmap_atomic(vto);
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  	return 0;
>  }
>  
> @@ -1015,13 +1014,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  {
>  	const sector_t sector = dax_iomap_sector(iomap, pos);
>  	pgoff_t pgoff;
> -	int id, rc;
> +	int rc;
>  	long length;
>  
>  	rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
>  	if (rc)
>  		return rc;
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
>  				   NULL, pfnp);
>  	if (length < 0) {
> @@ -1038,7 +1037,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>  		goto out;
>  	rc = 0;
>  out:
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  	return rc;
>  }
>  
> @@ -1130,7 +1129,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
>  	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>  	pgoff_t pgoff;
> -	long rc, id;
> +	long rc;
>  	void *kaddr;
>  	bool page_aligned = false;
>  	unsigned offset = offset_in_page(pos);
> @@ -1144,14 +1143,14 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  	if (rc)
>  		return rc;
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  
>  	if (page_aligned)
>  		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
>  	else
>  		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
>  	if (rc < 0) {
> -		dax_read_unlock(id);
> +		dax_read_unlock();
>  		return rc;
>  	}
>  
> @@ -1159,7 +1158,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  		memset(kaddr + offset, 0, size);
>  		dax_flush(iomap->dax_dev, kaddr + offset, size);
>  	}
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  	return size;
>  }
>  
> @@ -1174,7 +1173,6 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>  	loff_t end = pos + length, done = 0;
>  	ssize_t ret = 0;
>  	size_t xfer;
> -	int id;
>  
>  	if (iov_iter_rw(iter) == READ) {
>  		end = min(end, i_size_read(iomi->inode));
> @@ -1199,7 +1197,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>  					      (end - 1) >> PAGE_SHIFT);
>  	}
>  
> -	id = dax_read_lock();
> +	dax_read_lock();
>  	while (pos < end) {
>  		unsigned offset = pos & (PAGE_SIZE - 1);
>  		const size_t size = ALIGN(length + offset, PAGE_SIZE);
> @@ -1251,7 +1249,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>  		if (xfer < map_len)
>  			break;
>  	}
> -	dax_read_unlock(id);
> +	dax_read_unlock();
>  
>  	return done ? done : ret;
>  }
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 2619d94c308d..097b3304f9b9 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -177,15 +177,14 @@ static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
>  #endif
>  
>  #if IS_ENABLED(CONFIG_DAX)
> -int dax_read_lock(void);
> -void dax_read_unlock(int id);
> +void dax_read_lock(void);
> +void dax_read_unlock(void);
>  #else
> -static inline int dax_read_lock(void)
> +static inline void dax_read_lock(void)
>  {
> -	return 0;
>  }
>  
> -static inline void dax_read_unlock(int id)
> +static inline void dax_read_unlock(void)
>  {
>  }
>  #endif /* CONFIG_DAX */
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/8] dax: Introduce holder for dax_device
  2021-09-24 13:09 ` [PATCH v7 2/8] dax: Introduce holder for dax_device Shiyang Ruan
@ 2021-10-14 18:00   ` Darrick J. Wong
  2021-10-20  6:58     ` Shiyang Ruan
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 18:00 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:53PM +0800, Shiyang Ruan wrote:
> To easily track filesystem from a pmem device, we introduce a holder for
> dax_device structure, and also its operation.  This holder is used to
> remember who is using this dax_device:
>  - When it is the backend of a filesystem, the holder will be the
>    superblock of this filesystem.
>  - When this pmem device is one of the targets in a mapped device, the
>    holder will be this mapped device.  In this case, the mapped device
>    has its own dax_device and it will follow the first rule.  So that we
>    can finally track to the filesystem we needed.
> 
> The holder and holder_ops will be set when filesystem is being mounted,
> or an target device is being activated.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h | 29 ++++++++++++++++++++++
>  2 files changed, 88 insertions(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 48ce86501d93..7d4a11dcba90 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -23,7 +23,10 @@
>   * @cdev: optional character interface for "device dax"
>   * @host: optional name for lookups where the device path is not available
>   * @private: dax driver private data
> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
>   * @flags: state and boolean properties
> + * @ops: operations for dax_device
> + * @holder_ops: operations for the inner holder
>   */
>  struct dax_device {
>  	struct hlist_node list;
> @@ -31,8 +34,10 @@ struct dax_device {
>  	struct cdev cdev;
>  	const char *host;
>  	void *private;
> +	void *holder_data;
>  	unsigned long flags;
>  	const struct dax_operations *ops;
> +	const struct dax_holder_operations *holder_ops;
>  };
>  
>  static dev_t dax_devt;
> @@ -374,6 +379,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>  }
>  EXPORT_SYMBOL_GPL(dax_zero_page_range);
>  
> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
> +			      size_t size, int flags)
> +{
> +	int rc;
> +
> +	dax_read_lock();
> +	if (!dax_alive(dax_dev)) {
> +		rc = -ENXIO;
> +		goto out;
> +	}
> +
> +	if (!dax_dev->holder_data) {
> +		rc = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	rc = dax_dev->holder_ops->notify_failure(dax_dev, offset, size, flags);

Shouldn't this check if dax_dev->holder_ops != NULL before dereferencing
it for the function call?  Imagine an implementation that wants to
attach a ->notify_failure function to a dax_device, maintains its own
lookup table, and decides that it doesn't need to set holder_data.

(Or, imagine someone who writes a garbage into holder_data and *boom*)

How does the locking work here?  If there's a media failure, we'll take
dax_rwsem and call ->notify_failure.  If the ->notify_failure function
wants to access the pmem to handle the error by calling back into the
dax code, will that cause nested locking on dax_rwsem?

Jumping ahead a bit, I think the rmap btree accesses that the xfs
implementation performs can cause xfs_buf(fer) cache IO, which would
trigger that if the buffers aren't already in memory, if I'm reading
this correctly?

> +out:
> +	dax_read_unlock();
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
> +
>  #ifdef CONFIG_ARCH_HAS_PMEM_API
>  void arch_wb_cache_pmem(void *addr, size_t size);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
> @@ -618,6 +646,37 @@ void put_dax(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(put_dax);
>  
> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
> +		const struct dax_holder_operations *ops)
> +{
> +	dax_write_lock();
> +	if (!dax_alive(dax_dev)) {
> +		dax_write_unlock();
> +		return;
> +	}
> +
> +	dax_dev->holder_data = holder;
> +	dax_dev->holder_ops = ops;
> +	dax_write_unlock();

I guess this means that the holder has to detach itself before anyone
calls kill_dax, or else a dead dax device ends up with a dangling
reference to the holder?

> +}
> +EXPORT_SYMBOL_GPL(dax_set_holder);
> +
> +void *dax_get_holder(struct dax_device *dax_dev)
> +{
> +	void *holder;
> +
> +	dax_read_lock();
> +	if (!dax_alive(dax_dev)) {
> +		dax_read_unlock();
> +		return NULL;
> +	}
> +
> +	holder = dax_dev->holder_data;
> +	dax_read_unlock();
> +	return holder;
> +}
> +EXPORT_SYMBOL_GPL(dax_get_holder);
> +
>  /**
>   * inode_dax: convert a public inode into its dax_dev
>   * @inode: An inode with i_cdev pointing to a dax_dev
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 097b3304f9b9..d273d59723cd 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -38,9 +38,24 @@ struct dax_operations {
>  	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
>  };
>  
> +struct dax_holder_operations {
> +	/*
> +	 * notify_failure - notify memory failure into inner holder device
> +	 * @dax_dev: the dax device which contains the holder
> +	 * @offset: offset on this dax device where memory failure occurs
> +	 * @size: length of this memory failure event
> +	 * @flags: action flags for memory failure handler
> +	 */
> +	int (*notify_failure)(struct dax_device *dax_dev, loff_t offset,
> +			size_t size, int flags);

Shouldn't size be u64 or something?  Let's say that 8GB of your pmem go
bad, wouldn't you want a single call?  Though I guess the current
implementation only goes a single page at a time, doesn't it?

> +};
> +
>  extern struct attribute_group dax_attribute_group;
>  
>  #if IS_ENABLED(CONFIG_DAX)
> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
> +		const struct dax_holder_operations *ops);
> +void *dax_get_holder(struct dax_device *dax_dev);
>  struct dax_device *alloc_dax(void *private, const char *host,
>  		const struct dax_operations *ops, unsigned long flags);
>  void put_dax(struct dax_device *dax_dev);
> @@ -70,6 +85,18 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
>  	return dax_synchronous(dax_dev);
>  }
>  #else
> +static inline struct dax_device *dax_get_by_host(const char *host)

Not sure why this is being added here?  AFAICT none of the patches call
this function...?

--D

> +{
> +	return NULL;
> +}
> +static inline void dax_set_holder(struct dax_device *dax_dev, void *holder,
> +		const struct dax_holder_operations *ops)
> +{
> +}
> +static inline void *dax_get_holder(struct dax_device *dax_dev)
> +{
> +	return NULL;
> +}
>  static inline struct dax_device *alloc_dax(void *private, const char *host,
>  		const struct dax_operations *ops, unsigned long flags)
>  {
> @@ -198,6 +225,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>  		size_t bytes, struct iov_iter *i);
>  int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>  			size_t nr_pages);
> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
> +		size_t size, int flags);
>  void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
>  
>  ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap
  2021-09-24 13:09 ` [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
@ 2021-10-14 18:02   ` Darrick J. Wong
  2021-10-15  6:33   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 18:02 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:54PM +0800, Shiyang Ruan wrote:
> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
> feature for fsdax.  So it is needed to factor some helper functions to
> simplify these code.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>

This looks like a reasonable hoist...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  mm/memory-failure.c | 140 ++++++++++++++++++++++++--------------------
>  1 file changed, 76 insertions(+), 64 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 54879c339024..8ff9b52823c0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1430,6 +1430,79 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
>  	return 0;
>  }
>  
> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> +		struct address_space *mapping, pgoff_t index, int flags)
> +{
> +	struct to_kill *tk;
> +	unsigned long size = 0;
> +
> +	list_for_each_entry(tk, to_kill, nd)
> +		if (tk->size_shift)
> +			size = max(size, 1UL << tk->size_shift);
> +	if (size) {
> +		/*
> +		 * Unmap the largest mapping to avoid breaking up device-dax
> +		 * mappings which are constant size. The actual size of the
> +		 * mapping being torn down is communicated in siginfo, see
> +		 * kill_proc()
> +		 */
> +		loff_t start = (index << PAGE_SHIFT) & ~(size - 1);
> +
> +		unmap_mapping_range(mapping, start, size, 0);
> +	}
> +
> +	kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
> +}
> +
> +static int mf_generic_kill_procs(unsigned long long pfn, int flags,
> +		struct dev_pagemap *pgmap)
> +{
> +	struct page *page = pfn_to_page(pfn);
> +	LIST_HEAD(to_kill);
> +	dax_entry_t cookie;
> +
> +	/*
> +	 * Prevent the inode from being freed while we are interrogating
> +	 * the address_space, typically this would be handled by
> +	 * lock_page(), but dax pages do not use the page lock. This
> +	 * also prevents changes to the mapping of this pfn until
> +	 * poison signaling is complete.
> +	 */
> +	cookie = dax_lock_page(page);
> +	if (!cookie)
> +		return -EBUSY;
> +
> +	if (hwpoison_filter(page))
> +		return 0;
> +
> +	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> +		/*
> +		 * TODO: Handle HMM pages which may need coordination
> +		 * with device-side memory.
> +		 */
> +		return -EBUSY;
> +	}
> +
> +	/*
> +	 * Use this flag as an indication that the dax page has been
> +	 * remapped UC to prevent speculative consumption of poison.
> +	 */
> +	SetPageHWPoison(page);
> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a
> +	 * different physical page at a given virtual address, so all
> +	 * userspace consumption of ZONE_DEVICE memory necessitates
> +	 * SIGBUS (i.e. MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +	collect_procs(page, &to_kill, true);
> +
> +	unmap_and_kill(&to_kill, pfn, page->mapping, page->index, flags);
> +	dax_unlock_page(page, cookie);
> +	return 0;
> +}
> +
>  static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  {
>  	struct page *p = pfn_to_page(pfn);
> @@ -1519,12 +1592,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  		struct dev_pagemap *pgmap)
>  {
>  	struct page *page = pfn_to_page(pfn);
> -	unsigned long size = 0;
> -	struct to_kill *tk;
>  	LIST_HEAD(tokill);
> -	int rc = -EBUSY;
> -	loff_t start;
> -	dax_entry_t cookie;
> +	int rc = -ENXIO;
>  
>  	if (flags & MF_COUNT_INCREASED)
>  		/*
> @@ -1533,67 +1602,10 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  		put_page(page);
>  
>  	/* device metadata space is not recoverable */
> -	if (!pgmap_pfn_valid(pgmap, pfn)) {
> -		rc = -ENXIO;
> -		goto out;
> -	}
> -
> -	/*
> -	 * Prevent the inode from being freed while we are interrogating
> -	 * the address_space, typically this would be handled by
> -	 * lock_page(), but dax pages do not use the page lock. This
> -	 * also prevents changes to the mapping of this pfn until
> -	 * poison signaling is complete.
> -	 */
> -	cookie = dax_lock_page(page);
> -	if (!cookie)
> +	if (!pgmap_pfn_valid(pgmap, pfn))
>  		goto out;
>  
> -	if (hwpoison_filter(page)) {
> -		rc = 0;
> -		goto unlock;
> -	}
> -
> -	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> -		/*
> -		 * TODO: Handle HMM pages which may need coordination
> -		 * with device-side memory.
> -		 */
> -		goto unlock;
> -	}
> -
> -	/*
> -	 * Use this flag as an indication that the dax page has been
> -	 * remapped UC to prevent speculative consumption of poison.
> -	 */
> -	SetPageHWPoison(page);
> -
> -	/*
> -	 * Unlike System-RAM there is no possibility to swap in a
> -	 * different physical page at a given virtual address, so all
> -	 * userspace consumption of ZONE_DEVICE memory necessitates
> -	 * SIGBUS (i.e. MF_MUST_KILL)
> -	 */
> -	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> -	collect_procs(page, &tokill, flags & MF_ACTION_REQUIRED);
> -
> -	list_for_each_entry(tk, &tokill, nd)
> -		if (tk->size_shift)
> -			size = max(size, 1UL << tk->size_shift);
> -	if (size) {
> -		/*
> -		 * Unmap the largest mapping to avoid breaking up
> -		 * device-dax mappings which are constant size. The
> -		 * actual size of the mapping being torn down is
> -		 * communicated in siginfo, see kill_proc()
> -		 */
> -		start = (page->index << PAGE_SHIFT) & ~(size - 1);
> -		unmap_mapping_range(page->mapping, start, size, 0);
> -	}
> -	kill_procs(&tokill, flags & MF_MUST_KILL, false, pfn, flags);
> -	rc = 0;
> -unlock:
> -	dax_unlock_page(page, cookie);
> +	rc = mf_generic_kill_procs(pfn, flags, pgmap);
>  out:
>  	/* drop pgmap ref acquired in caller */
>  	put_dev_pagemap(pgmap);
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()
  2021-09-24 13:09 ` [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
@ 2021-10-14 18:05   ` Darrick J. Wong
  2021-10-20  5:25     ` Shiyang Ruan
  2021-10-15  6:36   ` Christoph Hellwig
  1 sibling, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 18:05 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:55PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the filesystem in which
> the corrupted page located in.
> 
> With dax_holder notify support, we are able to notify the memory failure
> from pmem driver to upper layers.  If there is something not support in
> the notify routine, memory_failure will fall back to the generic hanlder.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/nvdimm/pmem.c    | 11 +++++++++++
>  include/linux/memremap.h |  9 +++++++++
>  mm/memory-failure.c      | 14 ++++++++++++++
>  3 files changed, 34 insertions(+)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 72de88ff0d30..0dfafad8fcc5 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -362,9 +362,20 @@ static void pmem_release_disk(void *__pmem)
>  	del_gendisk(pmem->disk);
>  }
>  
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> +		unsigned long pfn, size_t size, int flags)
> +{
> +	struct pmem_device *pmem =
> +			container_of(pgmap, struct pmem_device, pgmap);
> +	loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
> +
> +	return dax_holder_notify_failure(pmem->dax_dev, offset, size, flags);
> +}
> +
>  static const struct dev_pagemap_ops fsdax_pagemap_ops = {
>  	.kill			= pmem_pagemap_kill,
>  	.cleanup		= pmem_pagemap_cleanup,
> +	.memory_failure		= pmem_pagemap_memory_failure,
>  };
>  
>  static int pmem_attach_disk(struct device *dev,
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index c0e9d35889e8..36d47bacd46d 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -87,6 +87,15 @@ struct dev_pagemap_ops {
>  	 * the page back to a CPU accessible page.
>  	 */
>  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
> +
> +	/*
> +	 * Handle the memory failure happens on a range of pfns.  Notify the
> +	 * processes who are using these pfns, and try to recover the data on
> +	 * them if necessary.  The flag is finally passed to the recover
> +	 * function through the whole notify routine.
> +	 */
> +	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
> +			      size_t size, int flags);
>  };
>  
>  #define PGMAP_ALTMAP_VALID	(1 << 0)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 8ff9b52823c0..85eab206b68f 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1605,6 +1605,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	if (!pgmap_pfn_valid(pgmap, pfn))
>  		goto out;
>  
> +	/*
> +	 * Call driver's implementation to handle the memory failure, otherwise
> +	 * fall back to generic handler.
> +	 */
> +	if (pgmap->ops->memory_failure) {
> +		rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
> +		/*
> +		 * Fall back to generic handler too if operation is not
> +		 * supported inside the driver/device/filesystem.
> +		 */
> +		if (rc != EOPNOTSUPP)

-EOPNOTSUPP?  (negative errno)

--D

> +			goto out;
> +	}
> +
>  	rc = mf_generic_kill_procs(pfn, flags, pgmap);
>  out:
>  	/* drop pgmap ref acquired in caller */
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry()
  2021-09-24 13:09 ` [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
@ 2021-10-14 18:17   ` Darrick J. Wong
  0 siblings, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 18:17 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:56PM +0800, Shiyang Ruan wrote:
> The current dax_lock_page() locks dax entry by obtaining mapping and
> index in page.  To support 1-to-N RMAP in NVDIMM, we need a new function
> to lock a specific dax entry corresponding to this file's mapping,index.
> And BTW, output the page corresponding to the specific dax entry for
> caller use.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/dax.c            | 65 ++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/dax.h | 15 +++++++++++
>  2 files changed, 79 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 798c43f09eee..509b65e60478 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -390,7 +390,7 @@ static struct page *dax_busy_page(void *entry)
>  }
>  
>  /*
> - * dax_lock_mapping_entry - Lock the DAX entry corresponding to a page
> + * dax_lock_page - Lock the DAX entry corresponding to a page
>   * @page: The page whose entry we want to lock
>   *
>   * Context: Process context.
> @@ -455,6 +455,69 @@ void dax_unlock_page(struct page *page, dax_entry_t cookie)
>  	dax_unlock_entry(&xas, (void *)cookie);
>  }
>  
> +/*
> + * dax_lock_mapping_entry - Lock the DAX entry corresponding to a mapping
> + * @mapping: the file's mapping whose entry we want to lock
> + * @index: the offset within this file
> + * @page: output the dax page corresponding to this dax entry
> + *
> + * Return: A cookie to pass to dax_unlock_mapping_entry() or 0 if the entry
> + * could not be locked.
> + */
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		struct page **page)
> +{
> +	XA_STATE(xas, NULL, 0);
> +	void *entry;
> +
> +	rcu_read_lock();
> +	for (;;) {
> +		entry = NULL;
> +		if (!dax_mapping(mapping))
> +			break;
> +
> +		xas.xa = &mapping->i_pages;
> +		xas_lock_irq(&xas);
> +		xas_set(&xas, index);
> +		entry = xas_load(&xas);
> +		if (dax_is_locked(entry)) {
> +			rcu_read_unlock();
> +			wait_entry_unlocked(&xas, entry);
> +			rcu_read_lock();
> +			continue;
> +		}
> +		if (!entry ||
> +		    dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> +			/*
> +			 * Because we are looking for entry from file's mapping
> +			 * and index, so the entry may not be inserted for now,
> +			 * or even a zero/empty entry.  We don't think this is
> +			 * an error case.  So, return a special value and do
> +			 * not output @page.
> +			 */
> +			entry = (void *)~0UL;

I kinda wonder if these open-coded magic values ~0UL (no entry) and 0
(cannot lock) should be #defines that force-cast the magic value to
dax_entry_t...

...but then I'm not really an expert in the design behind fs/dax.c --
this part looks reasonable enough to me, but I think Dan or Matthew
ought to look this over.

--D

> +		} else {
> +			*page = pfn_to_page(dax_to_pfn(entry));
> +			dax_lock_entry(&xas, entry);
> +		}
> +		xas_unlock_irq(&xas);
> +		break;
> +	}
> +	rcu_read_unlock();
> +	return (dax_entry_t)entry;
> +}
> +
> +void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index,
> +		dax_entry_t cookie)
> +{
> +	XA_STATE(xas, &mapping->i_pages, index);
> +
> +	if (cookie == ~0UL)
> +		return;
> +
> +	dax_unlock_entry(&xas, (void *)cookie);
> +}
> +
>  /*
>   * Find page cache entry at given index. If it is a DAX entry, return it
>   * with the entry locked. If the page cache doesn't contain an entry at
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index d273d59723cd..65411bee4312 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -156,6 +156,10 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
>  struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
>  dax_entry_t dax_lock_page(struct page *page);
>  void dax_unlock_page(struct page *page, dax_entry_t cookie);
> +dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
> +		unsigned long index, struct page **page);
> +void dax_unlock_mapping_entry(struct address_space *mapping,
> +		unsigned long index, dax_entry_t cookie);
>  #else
>  #define generic_fsdax_supported		NULL
>  
> @@ -201,6 +205,17 @@ static inline dax_entry_t dax_lock_page(struct page *page)
>  static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
>  {
>  }
> +
> +static inline dax_entry_t dax_lock_mapping_entry(struct address_space *mapping,
> +		unsigned long index, struct page **page)
> +{
> +	return 0;
> +}
> +
> +static inline void dax_unlock_mapping_entry(struct address_space *mapping,
> +		unsigned long index, dax_entry_t cookie)
> +{
> +}
>  #endif
>  
>  #if IS_ENABLED(CONFIG_DAX)
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS
  2021-09-24 13:09 ` [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
@ 2021-10-14 19:21   ` Darrick J. Wong
  2021-10-15  6:41   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 19:21 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:58PM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem.  Such as memory failure in fsdax mode.
> 
> If the rmap feature of XFS enabled, we can query it to find files and
> metadata which are associated with the corrupt data.  For now all we do
> is kill processes with that file mapped into their address spaces, but
> future patches could actually do something about corrupt metadata.
> 
> After that, the memory failure needs to notify the processes who are
> using those files.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  drivers/dax/super.c |  19 +++++
>  fs/xfs/xfs_fsops.c  |   3 +
>  fs/xfs/xfs_mount.h  |   1 +
>  fs/xfs/xfs_super.c  | 188 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h |  18 +++++
>  5 files changed, 229 insertions(+)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 7d4a11dcba90..22091e7fb0ef 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -135,6 +135,25 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>  
> +void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
> +		const struct dax_holder_operations *ops)
> +{
> +	dax_set_holder(dax_dev, holder, ops);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_register_holder);
> +
> +void fs_dax_unregister_holder(struct dax_device *dax_dev)
> +{
> +	dax_set_holder(dax_dev, NULL, NULL);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_unregister_holder);
> +
> +void *fs_dax_get_holder(struct dax_device *dax_dev)
> +{
> +	return dax_get_holder(dax_dev);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get_holder);
> +
>  bool generic_fsdax_supported(struct dax_device *dax_dev,
>  		struct block_device *bdev, int blocksize, sector_t start,
>  		sector_t sectors)
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..4c2d3d4ca5a5 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -542,6 +542,9 @@ xfs_do_force_shutdown(
>  	} else if (flags & SHUTDOWN_CORRUPT_INCORE) {
>  		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
>  		why = "Corruption of in-memory data";
> +	} else if (flags & SHUTDOWN_CORRUPT_META) {
> +		tag = XFS_PTAG_SHUTDOWN_CORRUPT;
> +		why = "Corruption of on-disk metadata";
>  	} else {
>  		tag = XFS_PTAG_SHUTDOWN_IOERROR;
>  		why = "Metadata I/O Error";
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index e091f3b3fa15..d0f6da23e3df 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -434,6 +434,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int lags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR	0x0002	/* write attempt to the log failed */
>  #define SHUTDOWN_FORCE_UMOUNT	0x0004	/* shutdown from a forced unmount */
>  #define SHUTDOWN_CORRUPT_INCORE	0x0008	/* corrupt in-memory data structures */
> +#define SHUTDOWN_CORRUPT_META	0x0010  /* corrupt metadata on device */
>  
>  #define XFS_SHUTDOWN_STRINGS \
>  	{ SHUTDOWN_META_IO_ERROR,	"metadata_io" }, \
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index c4e0cd1c1c8c..46fdf44b5ec2 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -37,11 +37,19 @@
>  #include "xfs_reflink.h"
>  #include "xfs_pwork.h"
>  #include "xfs_ag.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_bit.h"
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
>  #include <linux/fs_parser.h>
> +#include <linux/mm.h>
> +#include <linux/dax.h>
>  
> +static const struct dax_holder_operations xfs_dax_holder_operations;
>  static const struct super_operations xfs_super_operations;
>  
>  static struct kset *xfs_kset;		/* top-level xfs sysfs dir */
> @@ -377,6 +385,8 @@ xfs_close_devices(
>  
>  		xfs_free_buftarg(mp->m_logdev_targp);
>  		xfs_blkdev_put(logdev);
> +		if (dax_logdev)
> +			fs_dax_unregister_holder(dax_logdev);
>  		fs_put_dax(dax_logdev);
>  	}
>  	if (mp->m_rtdev_targp) {
> @@ -385,9 +395,13 @@ xfs_close_devices(
>  
>  		xfs_free_buftarg(mp->m_rtdev_targp);
>  		xfs_blkdev_put(rtdev);
> +		if (dax_rtdev)
> +			fs_dax_unregister_holder(dax_rtdev);
>  		fs_put_dax(dax_rtdev);
>  	}
>  	xfs_free_buftarg(mp->m_ddev_targp);
> +	if (dax_ddev)
> +		fs_dax_unregister_holder(dax_ddev);
>  	fs_put_dax(dax_ddev);
>  }
>  
> @@ -411,6 +425,9 @@ xfs_open_devices(
>  	struct block_device	*logdev = NULL, *rtdev = NULL;
>  	int			error;
>  
> +	if (dax_ddev)
> +		fs_dax_register_holder(dax_ddev, mp,
> +				&xfs_dax_holder_operations);
>  	/*
>  	 * Open real time and log devices - order is important.
>  	 */
> @@ -419,6 +436,9 @@ xfs_open_devices(
>  		if (error)
>  			goto out;
>  		dax_logdev = fs_dax_get_by_bdev(logdev);
> +		if (dax_logdev != dax_ddev && dax_logdev)
> +			fs_dax_register_holder(dax_logdev, mp,
> +					&xfs_dax_holder_operations);
>  	}
>  
>  	if (mp->m_rtname) {
> @@ -433,6 +453,9 @@ xfs_open_devices(
>  			goto out_close_rtdev;
>  		}
>  		dax_rtdev = fs_dax_get_by_bdev(rtdev);
> +		if (dax_rtdev)
> +			fs_dax_register_holder(dax_rtdev, mp,
> +					&xfs_dax_holder_operations);
>  	}
>  
>  	/*
> @@ -1110,6 +1133,171 @@ xfs_fs_free_cached_objects(
>  	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +struct notify_failure_info {
> +	xfs_agblock_t startblock;

Style nit: tabs between type name and variable name.

> +	xfs_filblks_t blockcount;
> +	int flags;

Are these MF_ flags to be passed from memory_failure to
mf_dax_kill_procs?

> +};
> +
> +static loff_t
> +xfs_notify_failure_start(
> +	struct xfs_mount			*mp,
> +	const struct xfs_rmap_irec		*rec,
> +	const struct notify_failure_info	*notify)
> +{
> +	loff_t start = rec->rm_offset;
> +
> +	if (notify->startblock > rec->rm_startblock)
> +		start += XFS_FSB_TO_B(mp,
> +				notify->startblock - rec->rm_startblock);

I'm confused, is this supposed to return the file pos(ition) of the
failed range in units of bytes or in fs blocks?

If it's units of bytes (like the loff_t return value implies) then this
should be called xfs_notify_failure_pos.

> +	return start;
> +}
> +
> +static size_t
> +xfs_notify_failure_size(
> +	struct xfs_mount			*mp,
> +	const struct xfs_rmap_irec		*rec,
> +	const struct notify_failure_info	*notify)
> +{
> +	xfs_agblock_t rec_start = rec->rm_startblock;
> +	xfs_agblock_t rec_end = rec->rm_startblock + rec->rm_blockcount;

These are "next" variables, not "end".  The end of the record is
startblock + blockcount - 1.

> +	xfs_agblock_t notify_start = notify->startblock;
> +	xfs_agblock_t notify_end = notify->startblock + notify->blockcount;
> +	xfs_agblock_t cross_start = max(rec_start, notify_start);
> +	xfs_agblock_t cross_end = min(rec_end, notify_end);
> +
> +	return XFS_FSB_TO_B(mp, cross_end - cross_start);
> +}
> +
> +static int
> +xfs_dax_notify_failure_fn(
> +	struct xfs_btree_cur		*cur,
> +	const struct xfs_rmap_irec	*rec,
> +	void				*data)
> +{
> +	struct xfs_mount		*mp = cur->bc_mp;
> +	struct xfs_inode		*ip;
> +	struct address_space		*mapping;
> +	struct notify_failure_info	*notify = data;
> +	int				error = 0;
> +
> +	if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> +	    (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> +		// TODO check and try to fix metadata
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	/* Get files that incore, filter out others that are not in use. */
> +	error = xfs_iget(mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
> +			 0, &ip);

If you're going to use _INCORE then you probably want to filter out the
-ENODATA or whatever error code means "inode wasn't loaded", because
returning any nonzero value to rmap_query_range causes it to stop
iterating.

> +	if (error)
> +		return error;
> +
> +	mapping = VFS_I(ip)->i_mapping;
> +	if (IS_ENABLED(CONFIG_MEMORY_FAILURE)) {
> +		loff_t offset = xfs_notify_failure_start(mp, rec, notify);
> +		size_t size = xfs_notify_failure_size(mp, rec, notify);
> +
> +		error = mf_dax_kill_procs(mapping, offset >> PAGE_SHIFT, size,
> +					  notify->flags);
> +	}
> +	// TODO try to fix data
> +	xfs_irele(ip);
> +
> +	return error;
> +}
> +
> +static loff_t
> +xfs_dax_bdev_offset(
> +	struct xfs_mount *mp,
> +	struct dax_device *dax_dev,
> +	loff_t disk_offset)
> +{
> +	struct block_device *bdev;
> +
> +	if (mp->m_ddev_targp->bt_daxdev == dax_dev)
> +		bdev = mp->m_ddev_targp->bt_bdev;
> +	else if (mp->m_logdev_targp->bt_daxdev == dax_dev)
> +		bdev = mp->m_logdev_targp->bt_bdev;
> +	else
> +		bdev = mp->m_rtdev_targp->bt_bdev;
> +
> +	return disk_offset - (get_start_sect(bdev) << SECTOR_SHIFT);
> +}
> +
> +static int
> +xfs_dax_notify_failure(
> +	struct dax_device	*dax_dev,
> +	loff_t			offset,
> +	size_t			len,
> +	int			flags)

Are @flags supposed to contain the MF_ flags that were passed to
memory_failure()?  The variable name should probably be @mf_flags
throughout the patchset if that's the case.

> +{
> +	struct xfs_mount	*mp = fs_dax_get_holder(dax_dev);
> +	struct xfs_trans	*tp = NULL;
> +	struct xfs_btree_cur	*cur = NULL;
> +	struct xfs_buf		*agf_bp = NULL;
> +	struct xfs_rmap_irec	rmap_low, rmap_high;
> +	loff_t			bdev_offset = xfs_dax_bdev_offset(mp, dax_dev,
> +								  offset);

I don't like using loff_t to represent byte offsets into the physical
device.  loff_t should be used only for file byte offsets, and that's
not what we're storing here.

> +	xfs_fsblock_t		fsbno = XFS_B_TO_FSB(mp, bdev_offset);

I think this is still wrong, since fsblocks are segmented (nonlinear)
addresses.  Pass daddr into xfs_dax_notify_ddev_failure like I lay out
below, and then you can do:

	start_fsbno = XFS_DADDR_TO_FSB(mp, daddr);
	agno = XFS_FSB_TO_AGNO(mp, fsbno);
	notify.startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
	notify.blockcount = XFS_BB_TO_FSB(mp, bblen);

(More on this below)

> +	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
> +	int			error = 0;
> +	struct notify_failure_info notify = {
> +		.startblock	= XFS_FSB_TO_AGBNO(mp, fsbno),
> +		.blockcount	= XFS_B_TO_FSB(mp, len),
> +		.flags		= flags,
> +	};
> +
> +	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
> +		xfs_warn(mp,
> +			 "notify_failure() not supported on realtime device!");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> +	    mp->m_logdev_targp != mp->m_ddev_targp) {
> +		xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_META);
> +		return -EFSCORRUPTED;
> +	}

Urk.  xfs_dax_notify_failure should be a short function to dispatch the
notification to the proper handler.  Everything from here down should be
in a separate function xfs_dax_notify_ddev_failure, so that the next
line of xfs_dax_notify_failure is just:

	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset),
			BTOBB(len), flags);

And then we have

static int
xfs_dax_notify_ddev_failure(
	struct xfs_mount	*mp,
	xfs_daddr_t		daddr,
	xfs_daddr_t		bblen,
	int			mf_flags)
{
	if (!xfs_has_rmapbt(mp)) {
		xfs_warn(...);

Because otherwise xfs_dax_notify_failure gets cluttered.

> +
> +	if (!xfs_has_rmapbt(mp)) {
> +		xfs_warn(mp, "notify_failure() needs rmapbt enabled!");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	error = xfs_trans_alloc_empty(mp, &tp);
> +	if (error)
> +		return error;
> +
> +	error = xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp);
> +	if (error)
> +		goto out_cancel_tp;
> +
> +	cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agf_bp->b_pag);

What happens if the failure range spans multiple AGs?  I suppose it's
not technically possible today since we only report a single page at a
time.  But for the general case, I think we actually want (building off
the sample code above) this function to do something like this:

	start_fsbno = XFS_DADDR_TO_FSB(mp, daddr);
	agno = XFS_FSB_TO_AGNO(mp, fsbno);
	end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
	end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);

	for (; agno <= end_agbno; agno++) {
		struct xfs_rmap_irec	rmap_low = { };
		struct xfs_rmap_irec	rmap_high;

		notify.startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
		notify.blockcount = XFS_BB_TO_FSB(mp, bblen);

		/*
		 * init transaction, read agf, init cursor...
		 */

		memset(&rmap_high, 0xFF, sizeof(rmap_high));
		rmap_low.rm_startblock = XFS_FSB_TO_AGBNO(mp, fsbno);
		if (agno == end_agbno)
			rmap_high.rm_startblock = XFS_FSB_TO_AGBNO(mp,
							end_fsbno);

		error = xfs_rmap_query_range(...);
		if (error)
			fail;

		fsbno = XFS_AGB_TO_FSB(mp, agno + 1, 0);
	}

--D

> +
> +	/* Construct a range for rmap query */
> +	memset(&rmap_low, 0, sizeof(rmap_low));
> +	memset(&rmap_high, 0xFF, sizeof(rmap_high));
> +	rmap_low.rm_startblock = rmap_high.rm_startblock = notify.startblock;
> +	rmap_low.rm_blockcount = rmap_high.rm_blockcount = notify.blockcount;
> +
> +	error = xfs_rmap_query_range(cur, &rmap_low, &rmap_high,
> +				     xfs_dax_notify_failure_fn, &notify);
> +
> +	xfs_btree_del_cursor(cur, error);
> +	xfs_trans_brelse(tp, agf_bp);
> +
> +out_cancel_tp:
> +	xfs_trans_cancel(tp);
> +	return error;
> +}
> +
> +static const struct dax_holder_operations xfs_dax_holder_operations = {
> +	.notify_failure		= xfs_dax_notify_failure,
> +};
> +
>  static const struct super_operations xfs_super_operations = {
>  	.alloc_inode		= xfs_fs_alloc_inode,
>  	.destroy_inode		= xfs_fs_destroy_inode,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 3d90becbd160..0f1fa7a4a616 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -149,6 +149,10 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
>  }
>  
>  struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
> +void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
> +		const struct dax_holder_operations *ops);
> +void fs_dax_unregister_holder(struct dax_device *dax_dev);
> +void *fs_dax_get_holder(struct dax_device *dax_dev);
>  int dax_writeback_mapping_range(struct address_space *mapping,
>  		struct dax_device *dax_dev, struct writeback_control *wbc);
>  
> @@ -179,6 +183,20 @@ static inline struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
>  	return NULL;
>  }
>  
> +static inline void fs_dax_register_holder(struct dax_device *dax_dev,
> +		void *holder, const struct dax_holder_operations *ops)
> +{
> +}
> +
> +static inline void fs_dax_unregister_holder(struct dax_device *dax_dev)
> +{
> +}
> +
> +static inline void *fs_dax_get_holder(struct dax_device *dax_dev)
> +{
> +	return NULL;
> +}
> +
>  static inline struct page *dax_layout_busy_page(struct address_space *mapping)
>  {
>  	return NULL;
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 8/8] fsdax: add exception for reflinked files
  2021-09-24 13:09 ` [PATCH v7 8/8] fsdax: add exception for reflinked files Shiyang Ruan
@ 2021-10-14 19:24   ` Darrick J. Wong
  2021-10-15  6:38     ` Christoph Hellwig
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 19:24 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:59PM +0800, Shiyang Ruan wrote:
> For reflinked files, one dax page may be associated more than once with
> different fime mapping and index.  It will report warning.  Now, since
> we have introduced dax-RMAP for this case and also have to keep its
> functionality for other filesystems who are not support rmap, I add this
> exception here.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/dax.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 2536c105ec7f..1a57211b1bc9 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -352,9 +352,10 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
>  	for_each_mapped_pfn(entry, pfn) {
>  		struct page *page = pfn_to_page(pfn);
>  
> -		WARN_ON_ONCE(page->mapping);
> -		page->mapping = mapping;
> -		page->index = index + i++;
> +		if (!page->mapping) {
> +			page->mapping = mapping;
> +			page->index = index + i++;

It feels a little dangerous to have page->mapping for shared storage
point to an actual address_space when there are really multiple
potential address_spaces out there.  If the mm or dax folks are ok with
doing this this way then I'll live with it, but it seems like you'd want
to leave /some/ kind of marker once you know that the page has multiple
owners and therefore regular mm rmap via page->mapping won't work.

--D

> +		}
>  	}
>  }
>  
> @@ -370,9 +371,10 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>  		struct page *page = pfn_to_page(pfn);
>  
>  		WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> -		WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> -		page->mapping = NULL;
> -		page->index = 0;
> +		if (page->mapping == mapping) {
> +			page->mapping = NULL;
> +			page->index = 0;
> +		}
>  	}
>  }
>  
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case
  2021-09-24 13:09 ` [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
@ 2021-10-14 19:32   ` Darrick J. Wong
  2021-10-20  5:47     ` Shiyang Ruan
  0 siblings, 1 reply; 26+ messages in thread
From: Darrick J. Wong @ 2021-10-14 19:32 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:57PM +0800, Shiyang Ruan wrote:
> This function is called at the end of RMAP routine, i.e. filesystem
> recovery function, to collect and kill processes using a shared page of
> DAX file.  The difference between mf_generic_kill_procs() is,
> it accepts file's mapping,offset instead of struct page.  Because
> different file's mappings and offsets may share the same page in fsdax
> mode.  So, it is called when filesystem RMAP results are found.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  fs/dax.c            | 10 ------
>  include/linux/dax.h |  9 +++++
>  include/linux/mm.h  |  2 ++
>  mm/memory-failure.c | 83 ++++++++++++++++++++++++++++++++++++++++-----
>  4 files changed, 86 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 509b65e60478..2536c105ec7f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -852,16 +852,6 @@ static void *dax_insert_entry(struct xa_state *xas,
>  	return entry;
>  }
>  
> -static inline
> -unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
> -{
> -	unsigned long address;
> -
> -	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> -	return address;
> -}
> -
>  /* Walk all mappings of a given index of a file and writeprotect them */
>  static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
>  		unsigned long pfn)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 65411bee4312..3d90becbd160 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -258,6 +258,15 @@ static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
>  }
> +static inline unsigned long pgoff_address(pgoff_t pgoff,
> +		struct vm_area_struct *vma)
> +{
> +	unsigned long address;
> +
> +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +	return address;
> +}
>  
>  #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>  void hmem_register_device(int target_nid, struct resource *r);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 73a52aba448f..d06af0051e53 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3114,6 +3114,8 @@ enum mf_flags {
>  	MF_MUST_KILL = 1 << 2,
>  	MF_SOFT_OFFLINE = 1 << 3,
>  };
> +extern int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> +			     size_t size, int flags);
>  extern int memory_failure(unsigned long pfn, int flags);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
>  extern void memory_failure_queue_kick(int cpu);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 85eab206b68f..a9d0d487d205 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -302,10 +302,9 @@ void shake_page(struct page *p)
>  }
>  EXPORT_SYMBOL_GPL(shake_page);
>  
> -static unsigned long dev_pagemap_mapping_shift(struct page *page,
> +static unsigned long dev_pagemap_mapping_shift(unsigned long address,
>  		struct vm_area_struct *vma)
>  {
> -	unsigned long address = vma_address(page, vma);
>  	pgd_t *pgd;
>  	p4d_t *p4d;
>  	pud_t *pud;
> @@ -345,7 +344,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   */
> -static void add_to_kill(struct task_struct *tsk, struct page *p,
> +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,

Hm, so I guess you're passing the page and the pgoff now because
page->index is meaningless for shared dax pages?  Ok.

>  		       struct vm_area_struct *vma,
>  		       struct list_head *to_kill)
>  {
> @@ -358,9 +357,15 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>  	}
>  
>  	tk->addr = page_address_in_vma(p, vma);
> -	if (is_zone_device_page(p))
> -		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> -	else
> +	if (is_zone_device_page(p)) {
> +		/*
> +		 * Since page->mapping is no more used for fsdax, we should
> +		 * calculate the address in a fsdax way.
> +		 */
> +		if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
> +			tk->addr = pgoff_address(pgoff, vma);
> +		tk->size_shift = dev_pagemap_mapping_shift(tk->addr, vma);
> +	} else
>  		tk->size_shift = page_shift(compound_head(p));
>  
>  	/*
> @@ -508,7 +513,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
>  			if (!page_mapped_in_vma(page, vma))
>  				continue;
>  			if (vma->vm_mm == t->mm)
> -				add_to_kill(t, page, vma, to_kill);
> +				add_to_kill(t, page, 0, vma, to_kill);
>  		}
>  	}
>  	read_unlock(&tasklist_lock);
> @@ -544,7 +549,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  			 * to be informed of all such data corruptions.
>  			 */
>  			if (vma->vm_mm == t->mm)
> -				add_to_kill(t, page, vma, to_kill);
> +				add_to_kill(t, page, 0, vma, to_kill);
> +		}
> +	}
> +	read_unlock(&tasklist_lock);
> +	i_mmap_unlock_read(mapping);
> +}
> +
> +/*
> + * Collect processes when the error hit a fsdax page.
> + */
> +static void collect_procs_fsdax(struct page *page, struct address_space *mapping,
> +		pgoff_t pgoff, struct list_head *to_kill)
> +{
> +	struct vm_area_struct *vma;
> +	struct task_struct *tsk;
> +
> +	i_mmap_lock_read(mapping);
> +	read_lock(&tasklist_lock);
> +	for_each_process(tsk) {
> +		struct task_struct *t = task_early_kill(tsk, true);
> +
> +		if (!t)
> +			continue;
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> +			if (vma->vm_mm == t->mm)
> +				add_to_kill(t, page, pgoff, vma, to_kill);
>  		}
>  	}
>  	read_unlock(&tasklist_lock);
> @@ -1503,6 +1533,43 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
>  	return 0;
>  }
>  
> +/**
> + * mf_dax_kill_procs - Collect and kill processes who are using this file range
> + * @mapping:	the file in use
> + * @index:	start offset of the range
> + * @size:	length of the range

It feels odd that one argument is in units of pgoff_t but the other is
in bytes.

> + * @flags:	memory failure flags
> + */
> +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> +		size_t size, int flags)
> +{
> +	LIST_HEAD(to_kill);
> +	dax_entry_t cookie;
> +	struct page *page;
> +	size_t end = (index << PAGE_SHIFT) + size;
> +
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;

Hm.  What flags will we be passing to the xfs_dax_notify_failure_fn?
Does XFS itself have to care about what's in the flags values, or is it
really just a magic cookie to be passed from the mm layer into the fs
and back to mf_dax_kill_procs?

--D

> +
> +	for (; (index << PAGE_SHIFT) < end; index++) {
> +		page = NULL;
> +		cookie = dax_lock_mapping_entry(mapping, index, &page);
> +		if (!cookie)
> +			return -EBUSY;
> +		if (!page)
> +			goto unlock;
> +
> +		SetPageHWPoison(page);
> +
> +		collect_procs_fsdax(page, mapping, index, &to_kill);
> +		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> +				index, flags);
> +unlock:
> +		dax_unlock_mapping_entry(mapping, index, cookie);
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
> +
>  static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  {
>  	struct page *p = pfn_to_page(pfn);
> -- 
> 2.33.0
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock()
  2021-09-24 13:09 ` [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock() Shiyang Ruan
  2021-10-14 17:48   ` Darrick J. Wong
@ 2021-10-15  6:30   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-10-15  6:30 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:52PM +0800, Shiyang Ruan wrote:
> In order to introduce dax holder registration, we need a write lock for
> dax.  Because of the rarity of notification failures and the infrequency
> of registration events, it would be better to be a global lock rather
> than per-device.  So, change the current lock to rwsem and introduce a
> write lock for registration.

I don't think taking the rw_semaphore everywhere will scale, as
basically any DAX based I/O will take it (in read mode).

So at a minimum we'd need a per-device percpu_rw_semaphore if we want
to go there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap
  2021-09-24 13:09 ` [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
  2021-10-14 18:02   ` Darrick J. Wong
@ 2021-10-15  6:33   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-10-15  6:33 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:54PM +0800, Shiyang Ruan wrote:
> memory_failure_dev_pagemap code is a bit complex before introduce RMAP
> feature for fsdax.  So it is needed to factor some helper functions to
> simplify these code.
> 
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
>  mm/memory-failure.c | 140 ++++++++++++++++++++++++--------------------
>  1 file changed, 76 insertions(+), 64 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 54879c339024..8ff9b52823c0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1430,6 +1430,79 @@ static int try_to_split_thp_page(struct page *page, const char *msg)
>  	return 0;
>  }
>  
> +static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
> +		struct address_space *mapping, pgoff_t index, int flags)
> +{
> +	struct to_kill *tk;
> +	unsigned long size = 0;
> +
> +	list_for_each_entry(tk, to_kill, nd)
> +		if (tk->size_shift)
> +			size = max(size, 1UL << tk->size_shift);
> +	if (size) {

Nit: an empty line here would be nice for readability.

> +	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> +		/*
> +		 * TODO: Handle HMM pages which may need coordination
> +		 * with device-side memory.
> +		 */
> +		return -EBUSY;

We've got rid of the HMM terminology for device private memory, so
I'd reword this update the comment to follow that while you're at it.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()
  2021-09-24 13:09 ` [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
  2021-10-14 18:05   ` Darrick J. Wong
@ 2021-10-15  6:36   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-10-15  6:36 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu

Except for the error code inversion noticed by Darrick this looks fine
to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 8/8] fsdax: add exception for reflinked files
  2021-10-14 19:24   ` Darrick J. Wong
@ 2021-10-15  6:38     ` Christoph Hellwig
  0 siblings, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-10-15  6:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Shiyang Ruan, linux-kernel, linux-xfs, nvdimm, linux-mm,
	linux-fsdevel, dan.j.williams, david, hch, jane.chu

On Thu, Oct 14, 2021 at 12:24:50PM -0700, Darrick J. Wong wrote:
> It feels a little dangerous to have page->mapping for shared storage
> point to an actual address_space when there are really multiple
> potential address_spaces out there.  If the mm or dax folks are ok with
> doing this this way then I'll live with it, but it seems like you'd want
> to leave /some/ kind of marker once you know that the page has multiple
> owners and therefore regular mm rmap via page->mapping won't work.

Yes, I thing poisoning page->mapping for the rmap enabled case seems
like a better idea.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS
  2021-09-24 13:09 ` [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
  2021-10-14 19:21   ` Darrick J. Wong
@ 2021-10-15  6:41   ` Christoph Hellwig
  1 sibling, 0 replies; 26+ messages in thread
From: Christoph Hellwig @ 2021-10-15  6:41 UTC (permalink / raw)
  To: Shiyang Ruan
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel, djwong,
	dan.j.williams, david, hch, jane.chu

On Fri, Sep 24, 2021 at 09:09:58PM +0800, Shiyang Ruan wrote:
> +void fs_dax_register_holder(struct dax_device *dax_dev, void *holder,
> +		const struct dax_holder_operations *ops)
> +{
> +	dax_set_holder(dax_dev, holder, ops);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_register_holder);
> +
> +void fs_dax_unregister_holder(struct dax_device *dax_dev)
> +{
> +	dax_set_holder(dax_dev, NULL, NULL);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_unregister_holder);
> +
> +void *fs_dax_get_holder(struct dax_device *dax_dev)
> +{
> +	return dax_get_holder(dax_dev);
> +}
> +EXPORT_SYMBOL_GPL(fs_dax_get_holder);

These should not be in a XFS patch.  But why do we even need this
wrappers?

> @@ -377,6 +385,8 @@ xfs_close_devices(
>  
>  		xfs_free_buftarg(mp->m_logdev_targp);
>  		xfs_blkdev_put(logdev);
> +		if (dax_logdev)
> +			fs_dax_unregister_holder(dax_logdev);
>  		fs_put_dax(dax_logdev);

I'd prefer to include the fs_dax_unregister_holder in the fs_put_dax
call to avoid callers failing to unregister it.

> @@ -411,6 +425,9 @@ xfs_open_devices(
>  	struct block_device	*logdev = NULL, *rtdev = NULL;
>  	int			error;
>  
> +	if (dax_ddev)
> +		fs_dax_register_holder(dax_ddev, mp,
> +				&xfs_dax_holder_operations);

I'd include the holder registration with fs_dax_get_by_bdev as well.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock()
  2021-10-14 17:48   ` Darrick J. Wong
@ 2021-10-20  5:19     ` Shiyang Ruan
  0 siblings, 0 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-10-20  5:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu



在 2021/10/15 1:48, Darrick J. Wong 写道:
> On Fri, Sep 24, 2021 at 09:09:52PM +0800, Shiyang Ruan wrote:
>> In order to introduce dax holder registration, we need a write lock for
>> dax.  Because of the rarity of notification failures and the infrequency
>> of registration events, it would be better to be a global lock rather
>> than per-device.  So, change the current lock to rwsem and introduce a
>> write lock for registration.
> 
> Urgh, I totally thought dax_read_lock was a global lock on something
> relating to the global dax_device state until I noticed this comment
> above kill_dax():
> 
> /*
>   * Note, rcu is not protecting the liveness of dax_dev, rcu is ensuring
>   * that any fault handlers or operations that might have seen
>   * dax_alive(), have completed.  Any operations that start after
>   * synchronize_srcu() has run will abort upon seeing !dax_alive().
>   */
> 
> So dax_srcu ensures stability in the dax_device's ALIVE state while any
> code that relies on that aliveness runs.  As a side effect, it'll block
> kill_dax (and I guess run_dax) while those functions run.  It doesn't
> protect any global state at all... but this isn't made obvious in the
> API by (for example) passing the dax_device into dax_read_lock.
> 
> IOWs, It's not protecting against the dax_device getting freed or
> anything resembling global state.  So that's probably why you note above
> that this /could/ be a per-device synchronization primitive, right?
> 
> If that's the case, then why shouldn't this be a per-device item?  As
> written here, any code that takes dax_write_lock() will block every dax
> device in the system while it does some work on a single dax device.
> Being an rwsem, it  will also have to wait for every other dax device
> access to complete before it can begin.  That seems excessive,
> particularly if in the future we start hooking up lots of pmem to a
> single host.
> 
> I have more to say around kill_dax() below.
> 
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/dax/device.c       | 11 +++++-----
>>   drivers/dax/super.c        | 43 ++++++++++++++++++++++----------------
>>   drivers/md/dm-writecache.c |  7 +++----
>>   fs/dax.c                   | 26 +++++++++++------------
>>   include/linux/dax.h        |  9 ++++----
>>   5 files changed, 49 insertions(+), 47 deletions(-)
>>
>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>> index dd8222a42808..cc7b835509f9 100644
>> --- a/drivers/dax/device.c
>> +++ b/drivers/dax/device.c
>> @@ -198,7 +198,6 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>   	struct file *filp = vmf->vma->vm_file;
>>   	unsigned long fault_size;
>>   	vm_fault_t rc = VM_FAULT_SIGBUS;
>> -	int id;
>>   	pfn_t pfn;
>>   	struct dev_dax *dev_dax = filp->private_data;
>>   
>> @@ -206,7 +205,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>   			(vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
>>   			vmf->vma->vm_start, vmf->vma->vm_end, pe_size);
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	switch (pe_size) {
>>   	case PE_SIZE_PTE:
>>   		fault_size = PAGE_SIZE;
>> @@ -246,7 +245,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
>>   			page->index = pgoff + i;
>>   		}
>>   	}
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   
>>   	return rc;
>>   }
>> @@ -284,7 +283,7 @@ static const struct vm_operations_struct dax_vm_ops = {
>>   static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
>>   {
>>   	struct dev_dax *dev_dax = filp->private_data;
>> -	int rc, id;
>> +	int rc;
>>   
>>   	dev_dbg(&dev_dax->dev, "trace\n");
>>   
>> @@ -292,9 +291,9 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
>>   	 * We lock to check dax_dev liveness and will re-check at
>>   	 * fault time.
>>   	 */
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	rc = check_vma(dev_dax, vma, __func__);
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   	if (rc)
>>   		return rc;
>>   
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index fc89e91beea7..48ce86501d93 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -36,7 +36,7 @@ struct dax_device {
>>   };
>>   
>>   static dev_t dax_devt;
>> -DEFINE_STATIC_SRCU(dax_srcu);
>> +static DECLARE_RWSEM(dax_rwsem);
>>   static struct vfsmount *dax_mnt;
>>   static DEFINE_IDA(dax_minor_ida);
>>   static struct kmem_cache *dax_cache __read_mostly;
>> @@ -46,18 +46,28 @@ static struct super_block *dax_superblock __read_mostly;
>>   static struct hlist_head dax_host_list[DAX_HASH_SIZE];
>>   static DEFINE_SPINLOCK(dax_host_lock);
>>   
>> -int dax_read_lock(void)
>> +void dax_read_lock(void)
>>   {
>> -	return srcu_read_lock(&dax_srcu);
>> +	down_read(&dax_rwsem);
>>   }
>>   EXPORT_SYMBOL_GPL(dax_read_lock);
>>   
>> -void dax_read_unlock(int id)
>> +void dax_read_unlock(void)
>>   {
>> -	srcu_read_unlock(&dax_srcu, id);
>> +	up_read(&dax_rwsem);
>>   }
>>   EXPORT_SYMBOL_GPL(dax_read_unlock);
>>   
>> +void dax_write_lock(void)
>> +{
>> +	down_write(&dax_rwsem);
>> +}
>> +
>> +void dax_write_unlock(void)
>> +{
>> +	up_write(&dax_rwsem);
>> +}
>> +
>>   static int dax_host_hash(const char *host)
>>   {
>>   	return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
>> @@ -70,14 +80,14 @@ static int dax_host_hash(const char *host)
>>   static struct dax_device *dax_get_by_host(const char *host)
>>   {
>>   	struct dax_device *dax_dev, *found = NULL;
>> -	int hash, id;
>> +	int hash;
>>   
>>   	if (!host)
>>   		return NULL;
>>   
>>   	hash = dax_host_hash(host);
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	spin_lock(&dax_host_lock);
>>   	hlist_for_each_entry(dax_dev, &dax_host_list[hash], list) {
>>   		if (!dax_alive(dax_dev)
>> @@ -89,7 +99,7 @@ static struct dax_device *dax_get_by_host(const char *host)
>>   		break;
>>   	}
>>   	spin_unlock(&dax_host_lock);
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   
>>   	return found;
>>   }
>> @@ -130,7 +140,7 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
>>   	pfn_t pfn, end_pfn;
>>   	sector_t last_page;
>>   	long len, len2;
>> -	int err, id;
>> +	int err;
>>   
>>   	if (blocksize != PAGE_SIZE) {
>>   		pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
>> @@ -155,14 +165,14 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
>>   		return false;
>>   	}
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
>>   	len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
>>   
>>   	if (len < 1 || len2 < 1) {
>>   		pr_info("%pg: error: dax access failed (%ld)\n",
>>   				bdev, len < 1 ? len : len2);
>> -		dax_read_unlock(id);
>> +		dax_read_unlock();
>>   		return false;
>>   	}
>>   
>> @@ -192,7 +202,7 @@ bool generic_fsdax_supported(struct dax_device *dax_dev,
>>   		put_dev_pagemap(end_pgmap);
>>   
>>   	}
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   
>>   	if (!dax_enabled) {
>>   		pr_info("%pg: error: dax support not enabled\n", bdev);
>> @@ -206,16 +216,15 @@ bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
>>   		int blocksize, sector_t start, sector_t len)
>>   {
>>   	bool ret = false;
>> -	int id;
>>   
>>   	if (!dax_dev)
>>   		return false;
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	if (dax_alive(dax_dev) && dax_dev->ops->dax_supported)
>>   		ret = dax_dev->ops->dax_supported(dax_dev, bdev, blocksize,
>>   						  start, len);
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   	return ret;
>>   }
>>   EXPORT_SYMBOL_GPL(dax_supported);
>> @@ -410,7 +419,7 @@ EXPORT_SYMBOL_GPL(__set_dax_synchronous);
>>   
>>   bool dax_alive(struct dax_device *dax_dev)
>>   {
>> -	lockdep_assert_held(&dax_srcu);
>> +	lockdep_assert_held(&dax_rwsem);
>>   	return test_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>   }
>>   EXPORT_SYMBOL_GPL(dax_alive);
>> @@ -428,8 +437,6 @@ void kill_dax(struct dax_device *dax_dev)
>>   
>>   	clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>   
>> -	synchronize_srcu(&dax_srcu);
> 
> Shouldn't you take the dax_write_lock() around the clear_bit call to
> maintain the behavior that kill_dax can't proceed until all the
> functions that depend on DAXDEV_ALIVE state have finished?

Yes, I understood now.  I'll change it to a per-device 
percpu_rw_semaphore.  The global rw_sem is not so good.


--
Thanks,
Ruan

> 
> --D
> 
>> -
>>   	spin_lock(&dax_host_lock);
>>   	hlist_del_init(&dax_dev->list);
>>   	spin_unlock(&dax_host_lock);
>> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
>> index 18320444fb0a..1067b3e98220 100644
>> --- a/drivers/md/dm-writecache.c
>> +++ b/drivers/md/dm-writecache.c
>> @@ -260,7 +260,6 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>   	loff_t s;
>>   	long p, da;
>>   	pfn_t pfn;
>> -	int id;
>>   	struct page **pages;
>>   	sector_t offset;
>>   
>> @@ -284,7 +283,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>   	}
>>   	offset >>= PAGE_SHIFT - 9;
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   
>>   	da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn);
>>   	if (da < 0) {
>> @@ -334,7 +333,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>   		wc->memory_vmapped = true;
>>   	}
>>   
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   
>>   	wc->memory_map += (size_t)wc->start_sector << SECTOR_SHIFT;
>>   	wc->memory_map_size -= (size_t)wc->start_sector << SECTOR_SHIFT;
>> @@ -343,7 +342,7 @@ static int persistent_memory_claim(struct dm_writecache *wc)
>>   err3:
>>   	kvfree(pages);
>>   err2:
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   err1:
>>   	return r;
>>   }
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 4e3e5a283a91..798c43f09eee 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -715,22 +715,21 @@ static int copy_cow_page_dax(struct block_device *bdev, struct dax_device *dax_d
>>   	void *vto, *kaddr;
>>   	pgoff_t pgoff;
>>   	long rc;
>> -	int id;
>>   
>>   	rc = bdev_dax_pgoff(bdev, sector, PAGE_SIZE, &pgoff);
>>   	if (rc)
>>   		return rc;
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	rc = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL);
>>   	if (rc < 0) {
>> -		dax_read_unlock(id);
>> +		dax_read_unlock();
>>   		return rc;
>>   	}
>>   	vto = kmap_atomic(to);
>>   	copy_user_page(vto, (void __force *)kaddr, vaddr, to);
>>   	kunmap_atomic(vto);
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   	return 0;
>>   }
>>   
>> @@ -1015,13 +1014,13 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>>   {
>>   	const sector_t sector = dax_iomap_sector(iomap, pos);
>>   	pgoff_t pgoff;
>> -	int id, rc;
>> +	int rc;
>>   	long length;
>>   
>>   	rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
>>   	if (rc)
>>   		return rc;
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
>>   				   NULL, pfnp);
>>   	if (length < 0) {
>> @@ -1038,7 +1037,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size,
>>   		goto out;
>>   	rc = 0;
>>   out:
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   	return rc;
>>   }
>>   
>> @@ -1130,7 +1129,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>>   {
>>   	sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>>   	pgoff_t pgoff;
>> -	long rc, id;
>> +	long rc;
>>   	void *kaddr;
>>   	bool page_aligned = false;
>>   	unsigned offset = offset_in_page(pos);
>> @@ -1144,14 +1143,14 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>>   	if (rc)
>>   		return rc;
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   
>>   	if (page_aligned)
>>   		rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
>>   	else
>>   		rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &kaddr, NULL);
>>   	if (rc < 0) {
>> -		dax_read_unlock(id);
>> +		dax_read_unlock();
>>   		return rc;
>>   	}
>>   
>> @@ -1159,7 +1158,7 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>>   		memset(kaddr + offset, 0, size);
>>   		dax_flush(iomap->dax_dev, kaddr + offset, size);
>>   	}
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   	return size;
>>   }
>>   
>> @@ -1174,7 +1173,6 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>>   	loff_t end = pos + length, done = 0;
>>   	ssize_t ret = 0;
>>   	size_t xfer;
>> -	int id;
>>   
>>   	if (iov_iter_rw(iter) == READ) {
>>   		end = min(end, i_size_read(iomi->inode));
>> @@ -1199,7 +1197,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>>   					      (end - 1) >> PAGE_SHIFT);
>>   	}
>>   
>> -	id = dax_read_lock();
>> +	dax_read_lock();
>>   	while (pos < end) {
>>   		unsigned offset = pos & (PAGE_SIZE - 1);
>>   		const size_t size = ALIGN(length + offset, PAGE_SIZE);
>> @@ -1251,7 +1249,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
>>   		if (xfer < map_len)
>>   			break;
>>   	}
>> -	dax_read_unlock(id);
>> +	dax_read_unlock();
>>   
>>   	return done ? done : ret;
>>   }
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 2619d94c308d..097b3304f9b9 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -177,15 +177,14 @@ static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
>>   #endif
>>   
>>   #if IS_ENABLED(CONFIG_DAX)
>> -int dax_read_lock(void);
>> -void dax_read_unlock(int id);
>> +void dax_read_lock(void);
>> +void dax_read_unlock(void);
>>   #else
>> -static inline int dax_read_lock(void)
>> +static inline void dax_read_lock(void)
>>   {
>> -	return 0;
>>   }
>>   
>> -static inline void dax_read_unlock(int id)
>> +static inline void dax_read_unlock(void)
>>   {
>>   }
>>   #endif /* CONFIG_DAX */
>> -- 
>> 2.33.0
>>
>>
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure()
  2021-10-14 18:05   ` Darrick J. Wong
@ 2021-10-20  5:25     ` Shiyang Ruan
  0 siblings, 0 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-10-20  5:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu



在 2021/10/15 2:05, Darrick J. Wong 写道:
> On Fri, Sep 24, 2021 at 09:09:55PM +0800, Shiyang Ruan wrote:
>> When memory-failure occurs, we call this function which is implemented
>> by each kind of devices.  For the fsdax case, pmem device driver
>> implements it.  Pmem device driver will find out the filesystem in which
>> the corrupted page located in.
>>
>> With dax_holder notify support, we are able to notify the memory failure
>> from pmem driver to upper layers.  If there is something not support in
>> the notify routine, memory_failure will fall back to the generic hanlder.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/nvdimm/pmem.c    | 11 +++++++++++
>>   include/linux/memremap.h |  9 +++++++++
>>   mm/memory-failure.c      | 14 ++++++++++++++
>>   3 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> index 72de88ff0d30..0dfafad8fcc5 100644
>> --- a/drivers/nvdimm/pmem.c
>> +++ b/drivers/nvdimm/pmem.c
>> @@ -362,9 +362,20 @@ static void pmem_release_disk(void *__pmem)
>>   	del_gendisk(pmem->disk);
>>   }
>>   
>> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
>> +		unsigned long pfn, size_t size, int flags)
>> +{
>> +	struct pmem_device *pmem =
>> +			container_of(pgmap, struct pmem_device, pgmap);
>> +	loff_t offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
>> +
>> +	return dax_holder_notify_failure(pmem->dax_dev, offset, size, flags);
>> +}
>> +
>>   static const struct dev_pagemap_ops fsdax_pagemap_ops = {
>>   	.kill			= pmem_pagemap_kill,
>>   	.cleanup		= pmem_pagemap_cleanup,
>> +	.memory_failure		= pmem_pagemap_memory_failure,
>>   };
>>   
>>   static int pmem_attach_disk(struct device *dev,
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index c0e9d35889e8..36d47bacd46d 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -87,6 +87,15 @@ struct dev_pagemap_ops {
>>   	 * the page back to a CPU accessible page.
>>   	 */
>>   	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
>> +
>> +	/*
>> +	 * Handle the memory failure happens on a range of pfns.  Notify the
>> +	 * processes who are using these pfns, and try to recover the data on
>> +	 * them if necessary.  The flag is finally passed to the recover
>> +	 * function through the whole notify routine.
>> +	 */
>> +	int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
>> +			      size_t size, int flags);
>>   };
>>   
>>   #define PGMAP_ALTMAP_VALID	(1 << 0)
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 8ff9b52823c0..85eab206b68f 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1605,6 +1605,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>>   	if (!pgmap_pfn_valid(pgmap, pfn))
>>   		goto out;
>>   
>> +	/*
>> +	 * Call driver's implementation to handle the memory failure, otherwise
>> +	 * fall back to generic handler.
>> +	 */
>> +	if (pgmap->ops->memory_failure) {
>> +		rc = pgmap->ops->memory_failure(pgmap, pfn, PAGE_SIZE, flags);
>> +		/*
>> +		 * Fall back to generic handler too if operation is not
>> +		 * supported inside the driver/device/filesystem.
>> +		 */
>> +		if (rc != EOPNOTSUPP)
> 
> -EOPNOTSUPP?  (negative errno)

Yes, my mistake. Thanks for pointing out.


--
Thanks,
Ruan.

> 
> --D
> 
>> +			goto out;
>> +	}
>> +
>>   	rc = mf_generic_kill_procs(pfn, flags, pgmap);
>>   out:
>>   	/* drop pgmap ref acquired in caller */
>> -- 
>> 2.33.0
>>
>>
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case
  2021-10-14 19:32   ` Darrick J. Wong
@ 2021-10-20  5:47     ` Shiyang Ruan
  0 siblings, 0 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-10-20  5:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu



在 2021/10/15 3:32, Darrick J. Wong 写道:
> On Fri, Sep 24, 2021 at 09:09:57PM +0800, Shiyang Ruan wrote:
>> This function is called at the end of RMAP routine, i.e. filesystem
>> recovery function, to collect and kill processes using a shared page of
>> DAX file.  The difference between mf_generic_kill_procs() is,
>> it accepts file's mapping,offset instead of struct page.  Because
>> different file's mappings and offsets may share the same page in fsdax
>> mode.  So, it is called when filesystem RMAP results are found.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   fs/dax.c            | 10 ------
>>   include/linux/dax.h |  9 +++++
>>   include/linux/mm.h  |  2 ++
>>   mm/memory-failure.c | 83 ++++++++++++++++++++++++++++++++++++++++-----
>>   4 files changed, 86 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 509b65e60478..2536c105ec7f 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -852,16 +852,6 @@ static void *dax_insert_entry(struct xa_state *xas,
>>   	return entry;
>>   }
>>   
>> -static inline
>> -unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
>> -{
>> -	unsigned long address;
>> -
>> -	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> -	return address;
>> -}
>> -
>>   /* Walk all mappings of a given index of a file and writeprotect them */
>>   static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index,
>>   		unsigned long pfn)
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 65411bee4312..3d90becbd160 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -258,6 +258,15 @@ static inline bool dax_mapping(struct address_space *mapping)
>>   {
>>   	return mapping->host && IS_DAX(mapping->host);
>>   }
>> +static inline unsigned long pgoff_address(pgoff_t pgoff,
>> +		struct vm_area_struct *vma)
>> +{
>> +	unsigned long address;
>> +
>> +	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>> +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>> +	return address;
>> +}
>>   
>>   #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>>   void hmem_register_device(int target_nid, struct resource *r);
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 73a52aba448f..d06af0051e53 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3114,6 +3114,8 @@ enum mf_flags {
>>   	MF_MUST_KILL = 1 << 2,
>>   	MF_SOFT_OFFLINE = 1 << 3,
>>   };
>> +extern int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> +			     size_t size, int flags);
>>   extern int memory_failure(unsigned long pfn, int flags);
>>   extern void memory_failure_queue(unsigned long pfn, int flags);
>>   extern void memory_failure_queue_kick(int cpu);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 85eab206b68f..a9d0d487d205 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -302,10 +302,9 @@ void shake_page(struct page *p)
>>   }
>>   EXPORT_SYMBOL_GPL(shake_page);
>>   
>> -static unsigned long dev_pagemap_mapping_shift(struct page *page,
>> +static unsigned long dev_pagemap_mapping_shift(unsigned long address,
>>   		struct vm_area_struct *vma)
>>   {
>> -	unsigned long address = vma_address(page, vma);
>>   	pgd_t *pgd;
>>   	p4d_t *p4d;
>>   	pud_t *pud;
>> @@ -345,7 +344,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
>>    * Schedule a process for later kill.
>>    * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>>    */
>> -static void add_to_kill(struct task_struct *tsk, struct page *p,
>> +static void add_to_kill(struct task_struct *tsk, struct page *p, pgoff_t pgoff,
> 
> Hm, so I guess you're passing the page and the pgoff now because
> page->index is meaningless for shared dax pages?  Ok.

Yes, it is for that case.

> 
>>   		       struct vm_area_struct *vma,
>>   		       struct list_head *to_kill)
>>   {
>> @@ -358,9 +357,15 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
>>   	}
>>   
>>   	tk->addr = page_address_in_vma(p, vma);
>> -	if (is_zone_device_page(p))
>> -		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
>> -	else
>> +	if (is_zone_device_page(p)) {
>> +		/*
>> +		 * Since page->mapping is no more used for fsdax, we should
>> +		 * calculate the address in a fsdax way.
>> +		 */
>> +		if (p->pgmap->type == MEMORY_DEVICE_FS_DAX)
>> +			tk->addr = pgoff_address(pgoff, vma);
>> +		tk->size_shift = dev_pagemap_mapping_shift(tk->addr, vma);
>> +	} else
>>   		tk->size_shift = page_shift(compound_head(p));
>>   
>>   	/*
>> @@ -508,7 +513,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
>>   			if (!page_mapped_in_vma(page, vma))
>>   				continue;
>>   			if (vma->vm_mm == t->mm)
>> -				add_to_kill(t, page, vma, to_kill);
>> +				add_to_kill(t, page, 0, vma, to_kill);
>>   		}
>>   	}
>>   	read_unlock(&tasklist_lock);
>> @@ -544,7 +549,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>>   			 * to be informed of all such data corruptions.
>>   			 */
>>   			if (vma->vm_mm == t->mm)
>> -				add_to_kill(t, page, vma, to_kill);
>> +				add_to_kill(t, page, 0, vma, to_kill);
>> +		}
>> +	}
>> +	read_unlock(&tasklist_lock);
>> +	i_mmap_unlock_read(mapping);
>> +}
>> +
>> +/*
>> + * Collect processes when the error hit a fsdax page.
>> + */
>> +static void collect_procs_fsdax(struct page *page, struct address_space *mapping,
>> +		pgoff_t pgoff, struct list_head *to_kill)
>> +{
>> +	struct vm_area_struct *vma;
>> +	struct task_struct *tsk;
>> +
>> +	i_mmap_lock_read(mapping);
>> +	read_lock(&tasklist_lock);
>> +	for_each_process(tsk) {
>> +		struct task_struct *t = task_early_kill(tsk, true);
>> +
>> +		if (!t)
>> +			continue;
>> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> +			if (vma->vm_mm == t->mm)
>> +				add_to_kill(t, page, pgoff, vma, to_kill);
>>   		}
>>   	}
>>   	read_unlock(&tasklist_lock);
>> @@ -1503,6 +1533,43 @@ static int mf_generic_kill_procs(unsigned long long pfn, int flags,
>>   	return 0;
>>   }
>>   
>> +/**
>> + * mf_dax_kill_procs - Collect and kill processes who are using this file range
>> + * @mapping:	the file in use
>> + * @index:	start offset of the range
>> + * @size:	length of the range
> 
> It feels odd that one argument is in units of pgoff_t but the other is
> in bytes.

The index is page aligned but @size may not be.  I will explain it in 
detail in the comments.

> 
>> + * @flags:	memory failure flags
>> + */
>> +int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> +		size_t size, int flags)
>> +{
>> +	LIST_HEAD(to_kill);
>> +	dax_entry_t cookie;
>> +	struct page *page;
>> +	size_t end = (index << PAGE_SHIFT) + size;
>> +
>> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> 
> Hm.  What flags will we be passing to the xfs_dax_notify_failure_fn?
> Does XFS itself have to care about what's in the flags values, or is it
> really just a magic cookie to be passed from the mm layer into the fs
> and back to mf_dax_kill_procs?
> 

Just to pass the flag from mm layer to mf_dax_kill_procs().  No one 
inside this RMAP progress will care about or change it.  As you 
mentioned in the next patch, I think this should be named with a "mf_" 
prefix to make it easier to understand.


--
Thanks,
Ruan.

> --D
> 
>> +
>> +	for (; (index << PAGE_SHIFT) < end; index++) {
>> +		page = NULL;
>> +		cookie = dax_lock_mapping_entry(mapping, index, &page);
>> +		if (!cookie)
>> +			return -EBUSY;
>> +		if (!page)
>> +			goto unlock;
>> +
>> +		SetPageHWPoison(page);
>> +
>> +		collect_procs_fsdax(page, mapping, index, &to_kill);
>> +		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>> +				index, flags);
>> +unlock:
>> +		dax_unlock_mapping_entry(mapping, index, cookie);
>> +	}
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
>> +
>>   static int memory_failure_hugetlb(unsigned long pfn, int flags)
>>   {
>>   	struct page *p = pfn_to_page(pfn);
>> -- 
>> 2.33.0
>>
>>
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v7 2/8] dax: Introduce holder for dax_device
  2021-10-14 18:00   ` Darrick J. Wong
@ 2021-10-20  6:58     ` Shiyang Ruan
  0 siblings, 0 replies; 26+ messages in thread
From: Shiyang Ruan @ 2021-10-20  6:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, linux-xfs, nvdimm, linux-mm, linux-fsdevel,
	dan.j.williams, david, hch, jane.chu



在 2021/10/15 2:00, Darrick J. Wong 写道:
> On Fri, Sep 24, 2021 at 09:09:53PM +0800, Shiyang Ruan wrote:
>> To easily track filesystem from a pmem device, we introduce a holder for
>> dax_device structure, and also its operation.  This holder is used to
>> remember who is using this dax_device:
>>   - When it is the backend of a filesystem, the holder will be the
>>     superblock of this filesystem.
>>   - When this pmem device is one of the targets in a mapped device, the
>>     holder will be this mapped device.  In this case, the mapped device
>>     has its own dax_device and it will follow the first rule.  So that we
>>     can finally track to the filesystem we needed.
>>
>> The holder and holder_ops will be set when filesystem is being mounted,
>> or an target device is being activated.
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>>   drivers/dax/super.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/dax.h | 29 ++++++++++++++++++++++
>>   2 files changed, 88 insertions(+)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index 48ce86501d93..7d4a11dcba90 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -23,7 +23,10 @@
>>    * @cdev: optional character interface for "device dax"
>>    * @host: optional name for lookups where the device path is not available
>>    * @private: dax driver private data
>> + * @holder_data: holder of a dax_device: could be filesystem or mapped device
>>    * @flags: state and boolean properties
>> + * @ops: operations for dax_device
>> + * @holder_ops: operations for the inner holder
>>    */
>>   struct dax_device {
>>   	struct hlist_node list;
>> @@ -31,8 +34,10 @@ struct dax_device {
>>   	struct cdev cdev;
>>   	const char *host;
>>   	void *private;
>> +	void *holder_data;
>>   	unsigned long flags;
>>   	const struct dax_operations *ops;
>> +	const struct dax_holder_operations *holder_ops;
>>   };
>>   
>>   static dev_t dax_devt;
>> @@ -374,6 +379,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>>   }
>>   EXPORT_SYMBOL_GPL(dax_zero_page_range);
>>   
>> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
>> +			      size_t size, int flags)
>> +{
>> +	int rc;
>> +
>> +	dax_read_lock();
>> +	if (!dax_alive(dax_dev)) {
>> +		rc = -ENXIO;
>> +		goto out;
>> +	}
>> +
>> +	if (!dax_dev->holder_data) {
>> +		rc = -EOPNOTSUPP;
>> +		goto out;
>> +	}
>> +
>> +	rc = dax_dev->holder_ops->notify_failure(dax_dev, offset, size, flags);
> 
> Shouldn't this check if dax_dev->holder_ops != NULL before dereferencing
> it for the function call?  Imagine an implementation that wants to
> attach a ->notify_failure function to a dax_device, maintains its own
> lookup table, and decides that it doesn't need to set holder_data.
> 
> (Or, imagine someone who writes a garbage into holder_data and *boom*)

My mistake. I should check @holder_ops instead of @holder_data.

> 
> How does the locking work here?  If there's a media failure, we'll take
> dax_rwsem and call ->notify_failure.  If the ->notify_failure function
> wants to access the pmem to handle the error by calling back into the
> dax code, will that cause nested locking on dax_rwsem?

Won't for now.  I have tested it with my simple testcases.
> 
> Jumping ahead a bit, I think the rmap btree accesses that the xfs
> implementation performs can cause xfs_buf(fer) cache IO, which would
> trigger that if the buffers aren't already in memory, if I'm reading
> this correctly?

I didn't think of this case.  But I think this uses read lock too.  It 
won't be blocked.  Only dax_set_holder() takes write lock.

> 
>> +out:
>> +	dax_read_unlock();
>> +	return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure);
>> +
>>   #ifdef CONFIG_ARCH_HAS_PMEM_API
>>   void arch_wb_cache_pmem(void *addr, size_t size);
>>   void dax_flush(struct dax_device *dax_dev, void *addr, size_t size)
>> @@ -618,6 +646,37 @@ void put_dax(struct dax_device *dax_dev)
>>   }
>>   EXPORT_SYMBOL_GPL(put_dax);
>>   
>> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
>> +		const struct dax_holder_operations *ops)
>> +{
>> +	dax_write_lock();
>> +	if (!dax_alive(dax_dev)) {
>> +		dax_write_unlock();
>> +		return;
>> +	}
>> +
>> +	dax_dev->holder_data = holder;
>> +	dax_dev->holder_ops = ops;
>> +	dax_write_unlock();
> 
> I guess this means that the holder has to detach itself before anyone
> calls kill_dax, or else a dead dax device ends up with a dangling
> reference to the holder?

Yes.

> 
>> +}
>> +EXPORT_SYMBOL_GPL(dax_set_holder);
>> +
>> +void *dax_get_holder(struct dax_device *dax_dev)
>> +{
>> +	void *holder;
>> +
>> +	dax_read_lock();
>> +	if (!dax_alive(dax_dev)) {
>> +		dax_read_unlock();
>> +		return NULL;
>> +	}
>> +
>> +	holder = dax_dev->holder_data;
>> +	dax_read_unlock();
>> +	return holder;
>> +}
>> +EXPORT_SYMBOL_GPL(dax_get_holder);
>> +
>>   /**
>>    * inode_dax: convert a public inode into its dax_dev
>>    * @inode: An inode with i_cdev pointing to a dax_dev
>> diff --git a/include/linux/dax.h b/include/linux/dax.h
>> index 097b3304f9b9..d273d59723cd 100644
>> --- a/include/linux/dax.h
>> +++ b/include/linux/dax.h
>> @@ -38,9 +38,24 @@ struct dax_operations {
>>   	int (*zero_page_range)(struct dax_device *, pgoff_t, size_t);
>>   };
>>   
>> +struct dax_holder_operations {
>> +	/*
>> +	 * notify_failure - notify memory failure into inner holder device
>> +	 * @dax_dev: the dax device which contains the holder
>> +	 * @offset: offset on this dax device where memory failure occurs
>> +	 * @size: length of this memory failure event
>> +	 * @flags: action flags for memory failure handler
>> +	 */
>> +	int (*notify_failure)(struct dax_device *dax_dev, loff_t offset,
>> +			size_t size, int flags);
> 
> Shouldn't size be u64 or something?  Let's say that 8GB of your pmem go
> bad, wouldn't you want a single call?  Though I guess the current
> implementation only goes a single page at a time, doesn't it?

Right.

> 
>> +};
>> +
>>   extern struct attribute_group dax_attribute_group;
>>   
>>   #if IS_ENABLED(CONFIG_DAX)
>> +void dax_set_holder(struct dax_device *dax_dev, void *holder,
>> +		const struct dax_holder_operations *ops);
>> +void *dax_get_holder(struct dax_device *dax_dev);
>>   struct dax_device *alloc_dax(void *private, const char *host,
>>   		const struct dax_operations *ops, unsigned long flags);
>>   void put_dax(struct dax_device *dax_dev);
>> @@ -70,6 +85,18 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
>>   	return dax_synchronous(dax_dev);
>>   }
>>   #else
>> +static inline struct dax_device *dax_get_by_host(const char *host)
> 
> Not sure why this is being added here?  AFAICT none of the patches call
> this function...?

It's mistake when I rebase my code to the latest.  These lines were 
deleted but I didn't notice.  Will fix it.


--
Thanks,
Ruan.

> 
> --D
> 
>> +{
>> +	return NULL;
>> +}
>> +static inline void dax_set_holder(struct dax_device *dax_dev, void *holder,
>> +		const struct dax_holder_operations *ops)
>> +{
>> +}
>> +static inline void *dax_get_holder(struct dax_device *dax_dev)
>> +{
>> +	return NULL;
>> +}
>>   static inline struct dax_device *alloc_dax(void *private, const char *host,
>>   		const struct dax_operations *ops, unsigned long flags)
>>   {
>> @@ -198,6 +225,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
>>   		size_t bytes, struct iov_iter *i);
>>   int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
>>   			size_t nr_pages);
>> +int dax_holder_notify_failure(struct dax_device *dax_dev, loff_t offset,
>> +		size_t size, int flags);
>>   void dax_flush(struct dax_device *dax_dev, void *addr, size_t size);
>>   
>>   ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>> -- 
>> 2.33.0
>>
>>
>>



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-10-20  6:58 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-24 13:09 [PATCH v7 0/8] [PATCH v7 0/8] fsdax: introduce fs query to support reflink Shiyang Ruan
2021-09-24 13:09 ` [PATCH v7 1/8] dax: Use rwsem for dax_{read,write}_lock() Shiyang Ruan
2021-10-14 17:48   ` Darrick J. Wong
2021-10-20  5:19     ` Shiyang Ruan
2021-10-15  6:30   ` Christoph Hellwig
2021-09-24 13:09 ` [PATCH v7 2/8] dax: Introduce holder for dax_device Shiyang Ruan
2021-10-14 18:00   ` Darrick J. Wong
2021-10-20  6:58     ` Shiyang Ruan
2021-09-24 13:09 ` [PATCH v7 3/8] mm: factor helpers for memory_failure_dev_pagemap Shiyang Ruan
2021-10-14 18:02   ` Darrick J. Wong
2021-10-15  6:33   ` Christoph Hellwig
2021-09-24 13:09 ` [PATCH v7 4/8] pagemap,pmem: Introduce ->memory_failure() Shiyang Ruan
2021-10-14 18:05   ` Darrick J. Wong
2021-10-20  5:25     ` Shiyang Ruan
2021-10-15  6:36   ` Christoph Hellwig
2021-09-24 13:09 ` [PATCH v7 5/8] fsdax: Introduce dax_lock_mapping_entry() Shiyang Ruan
2021-10-14 18:17   ` Darrick J. Wong
2021-09-24 13:09 ` [PATCH v7 6/8] mm: Introduce mf_dax_kill_procs() for fsdax case Shiyang Ruan
2021-10-14 19:32   ` Darrick J. Wong
2021-10-20  5:47     ` Shiyang Ruan
2021-09-24 13:09 ` [PATCH v7 7/8] xfs: Implement ->notify_failure() for XFS Shiyang Ruan
2021-10-14 19:21   ` Darrick J. Wong
2021-10-15  6:41   ` Christoph Hellwig
2021-09-24 13:09 ` [PATCH v7 8/8] fsdax: add exception for reflinked files Shiyang Ruan
2021-10-14 19:24   ` Darrick J. Wong
2021-10-15  6:38     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).