linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: djwong@kernel.org, jane.chu@oracle.com
Cc: linux-xfs@vger.kernel.org, hch@infradead.org,
	dan.j.williams@intel.com, linux-fsdevel@vger.kernel.org
Subject: [PATCH 1/5] dax: prepare pmem for use by zero-initializing contents and clearing poisons
Date: Fri, 17 Sep 2021 18:30:50 -0700	[thread overview]
Message-ID: <163192865031.417973.8372869475521627214.stgit@magnolia> (raw)
In-Reply-To: <163192864476.417973.143014658064006895.stgit@magnolia>

From: Darrick J. Wong <djwong@kernel.org>

Our current "advice" to people using persistent memory and FSDAX who
wish to recover upon receipt of a media error (aka 'hwpoison') event
from ACPI is to punch-hole that part of the file and then pwrite it,
which will magically cause the pmem to be reinitialized and the poison
to be cleared.

Punching doesn't make any sense at all -- the (re)allocation on pwrite
does not permit the caller to specify where to find blocks, which means
that we might not get the same pmem back.  This pushes the user farther
away from the goal of reinitializing poisoned memory and leads to
complaints about unnecessary file fragmentation.

AFAICT, the only reason why the "punch and write" dance works at all is
that the XFS and ext4 currently call blkdev_issue_zeroout when
allocating pmem ahead of a write call.  Even a regular overwrite won't
clear the poison, because dax_direct_access is smart enough to bail out
on poisoned pmem, but not smart enough to clear it.  To be fair, that
function maps pages and has no idea what kinds of reads and writes the
caller might want to perform.

Therefore, create a dax_zeroinit_range function that filesystems can to
reset the pmem contents to zero and clear hardware media error flags.
This uses the dax page zeroing helper function, which should ensure that
subsequent accesses will not trip over any pre-existing media errors.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/dax.c            |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |    7 ++++
 2 files changed, 100 insertions(+)


diff --git a/fs/dax.c b/fs/dax.c
index 4e3e5a283a91..765b80d08605 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1714,3 +1714,96 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
 	return dax_insert_pfn_mkwrite(vmf, pfn, order);
 }
 EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
+
+static loff_t
+dax_zeroinit_iter(struct iomap_iter *iter)
+{
+	struct iomap *iomap = &iter->iomap;
+	const struct iomap *srcmap = iomap_iter_srcmap(iter);
+	const u64 start = iomap->addr + iter->pos - iomap->offset;
+	const u64 nr_bytes = iomap_length(iter);
+	u64 start_page = start >> PAGE_SHIFT;
+	u64 nr_pages = nr_bytes >> PAGE_SHIFT;
+	int ret;
+
+	if (!iomap->dax_dev)
+		return -ECANCELED;
+
+	/*
+	 * The physical extent must be page aligned because that's what the dax
+	 * function requires.
+	 */
+	if (!PAGE_ALIGNED(start | nr_bytes))
+		return -ECANCELED;
+
+	/*
+	 * The dax function, by using pgoff_t, is stuck with unsigned long, so
+	 * we must check for overflows.
+	 */
+	if (start_page >= ULONG_MAX || start_page + nr_pages > ULONG_MAX)
+		return -ECANCELED;
+
+	/* Must be able to zero storage directly without fs intervention. */
+	if (iomap->flags & IOMAP_F_SHARED)
+		return -ECANCELED;
+	if (srcmap != iomap)
+		return -ECANCELED;
+
+	switch (iomap->type) {
+	case IOMAP_MAPPED:
+		while (nr_pages > 0) {
+			/* XXX function only supports one page at a time?! */
+			ret = dax_zero_page_range(iomap->dax_dev, start_page,
+					1);
+			if (ret)
+				return ret;
+			start_page++;
+			nr_pages--;
+		}
+
+		fallthrough;
+	case IOMAP_UNWRITTEN:
+		return nr_bytes;
+	}
+
+	/* Reject holes, inline data, or delalloc extents. */
+	return -ECANCELED;
+}
+
+/*
+ * Initialize storage mapped to a DAX-mode file to a known value and ensure the
+ * media are ready to accept read and write commands.  This requires the use of
+ * the dax layer's zero page range function to write zeroes to a pmem region
+ * and to reset any hardware media error state.
+ *
+ * The physical extents must be aligned to page size.  The file must be backed
+ * by a pmem device.  The extents returned must not require copy on write (or
+ * any other mapping interventions from the filesystem) and must be contiguous.
+ * @done will be set to true if the reset succeeded.
+ *
+ * Returns 0 if the zero initialization succeeded, -ECANCELED if the storage
+ * mappings do not support zero initialization, -EOPNOTSUPP if the device does
+ * not support it, or the usual negative errno.
+ */
+int
+dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
+		   const struct iomap_ops *ops)
+{
+	struct iomap_iter iter = {
+		.inode		= inode,
+		.pos		= pos,
+		.len		= len,
+		.flags		= IOMAP_REPORT,
+	};
+	int ret;
+
+	if (!IS_DAX(inode))
+		return -EINVAL;
+	if (pos + len > i_size_read(inode))
+		return -EINVAL;
+
+	while ((ret = iomap_iter(&iter, ops)) > 0)
+		iter.processed = dax_zeroinit_iter(&iter);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(dax_zeroinit_range);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2619d94c308d..3c873f7c35ba 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -129,6 +129,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping);
 struct page *dax_layout_busy_page_range(struct address_space *mapping, loff_t start, loff_t end);
 dax_entry_t dax_lock_page(struct page *page);
 void dax_unlock_page(struct page *page, dax_entry_t cookie);
+int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
+			const struct iomap_ops *ops);
 #else
 #define generic_fsdax_supported		NULL
 
@@ -174,6 +176,11 @@ static inline dax_entry_t dax_lock_page(struct page *page)
 static inline void dax_unlock_page(struct page *page, dax_entry_t cookie)
 {
 }
+static inline int dax_zeroinit_range(struct inode *inode, loff_t pos, u64 len,
+		const struct iomap_ops *ops)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_DAX)


  reply	other threads:[~2021-09-18  1:30 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-18  1:30 [PATCHSET RFC v2 jane 0/5] vfs: enable userspace to reset damaged file storage Darrick J. Wong
2021-09-18  1:30 ` Darrick J. Wong [this message]
2021-09-18 16:54   ` [PATCH 1/5] dax: prepare pmem for use by zero-initializing contents and clearing poisons riteshh
2021-09-20 17:22     ` Darrick J. Wong
2021-09-21  4:07       ` riteshh
2021-09-22 18:26         ` Darrick J. Wong
2021-09-22 19:47           ` riteshh
2021-09-22 20:26           ` Dan Williams
2021-09-21  8:34   ` Christoph Hellwig
2021-09-22 18:10     ` Darrick J. Wong
2021-09-18  1:30 ` [PATCH 2/5] iomap: use accelerated zeroing on a block device to zero a file range Darrick J. Wong
2021-09-18 16:55   ` riteshh
2021-09-21  8:29   ` Christoph Hellwig
2021-09-22 18:53     ` Darrick J. Wong
2021-09-21 22:33   ` Dave Chinner
2021-09-22 18:54     ` Darrick J. Wong
2021-09-18  1:31 ` [PATCH 3/5] vfs: add a zero-initialization mode to fallocate Darrick J. Wong
2021-09-18 16:58   ` riteshh
2021-09-20 17:52   ` Eric Biggers
2021-09-20 18:06     ` Darrick J. Wong
2021-09-21  0:44   ` Dave Chinner
2021-09-21  8:31     ` Christoph Hellwig
2021-09-22  2:16       ` Dan Williams
2021-09-22  2:38         ` Darrick J. Wong
2021-09-22  3:59           ` Dave Chinner
2021-09-22  4:13             ` Darrick J. Wong
2021-09-22  5:49               ` Dave Chinner
2021-09-22 21:27                 ` Darrick J. Wong
2021-09-23  0:02                   ` Darrick J. Wong
2021-09-23  0:44                     ` Darrick J. Wong
2021-09-23  1:42                     ` Dave Chinner
2021-09-23  2:43                       ` Dan Williams
2021-09-23  5:42                         ` Dan Williams
2021-09-23 22:54                           ` Dave Chinner
2021-09-24  1:18                             ` Dan Williams
2021-09-24  1:21                               ` Jane Chu
2021-09-24  1:35                                 ` Darrick J. Wong
2021-09-27 21:07                                   ` Dave Chinner
2021-09-27 21:57                                     ` Jane Chu
2021-09-28  0:08                                       ` Dan Williams
2021-09-22  5:28     ` riteshh
2021-09-18  1:31 ` [PATCH 4/5] xfs: implement FALLOC_FL_ZEROINIT_RANGE Darrick J. Wong
2021-09-18  1:31 ` [PATCH 5/5] ext4: " Darrick J. Wong
2021-09-18 17:07   ` riteshh
2021-09-20 18:11     ` Darrick J. Wong
2021-09-21  6:10       ` riteshh
2021-09-18 18:05 ` [PATCHSET RFC v2 jane 0/5] vfs: enable userspace to reset damaged file storage Dan Williams
2021-09-23  0:51 ` Darrick J. Wong
2021-09-23  1:17   ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=163192865031.417973.8372869475521627214.stgit@magnolia \
    --to=djwong@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=jane.chu@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).