archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <>
Subject: Re: [PATCHSET 0/2] dax: fix broken pmem poison narrative
Date: Mon, 16 Aug 2021 14:14:34 -0700	[thread overview]
Message-ID: <20210816211434.GB12640@magnolia> (raw)
In-Reply-To: <162914791879.197065.12619905059952917229.stgit@magnolia>

On Mon, Aug 16, 2021 at 02:05:18PM -0700, Darrick J. Wong wrote:
> Hi all,
> Our current "advice" to people using persistent memory and FSDAX who
> wish to recover upon receipt of a media error (aka 'hwpoison') event
> from ACPI is to punch-hole that part of the file and then pwrite it,
> which will magically cause the pmem to be reinitialized and the poison
> to be cleared.
> Punching doesn't make any sense at all -- we don't allow userspace to
> allocate from specific parts of the storage, and another writer could
> grab the poisoned range in the meantime.  In other words, the advice is
> seriously overfitted to incidental xfs and ext4 behavior and can
> completely fail.  Worse yet, that concurrent writer now has to deal with
> the poison that it didn't know about, and someone else is trying to fix.
> AFAICT, the only reason why the "punch and write" dance works at all is
> that the XFS and ext4 currently call blkdev_issue_zeroout when
> allocating pmem as part of a pwrite call.  A pwrite without the punch
> won't clear the poison, because pwrite on a DAX file calls
> dax_direct_access to access the memory directly, and dax_direct_access
> is only smart enough to bail out on poisoned pmem.  It does not know how
> to clear it.  Userspace could solve the problem by calling FIEMAP and
> issuing a BLKZEROOUT, but that requires rawio capabilities.
> The whole pmem poison recovery story is is wrong and needs to be
> corrected ASAP before everyone else starts doing this.  Therefore,
> create a dax_zeroinit_range function that filesystems can call to reset
> the contents of the pmem to a known value and clear any state associated
> with the media error.  Then, connect FALLOC_FL_ZERO_RANGE to this new
> function (for DAX files) so that unprivileged userspace has a safe way
> to reset the pmem and clear media errors.

This is a sample copy of a SIGBUS handler that will dump out the siginfo
data, call ZERO_RANGE to clear the poison, and then simulates being
fortunate enough to be able to reconstruct the file contents from

Note that I haven't tested this even with simulated pmem because I
cannot figure out how to inject a poison error into the pmem in such a
way that the nvdimm driver records it in the badblocks table.
madvise(HWPOISON) calls the SIGBUS handler, but that code path never
goes outside of the memory manager.

int fd = open(...);
char *data = mmap(fd, ... MAP_SYNC);

static void handle_sigbus(int signal, siginfo_t *info, void *dontcare)
	char *buf;
	loff_t err_offset = (char *)info->si_addr - data;
	loff_t err_len = (1ULL << info->si_addr_lsb);
	ssize_t ret;

	printf("    signal %d\n", info->si_signo);
	printf("    errno %d\n", info->si_errno);
	printf("    addr %p\n", info->si_addr);
	printf("    addr_lsb %d\n", info->si_addr_lsb);

	if (info->si_signo != SIGBUS) {
		printf("    code 0x%x\n", info->si_code);

	switch (info->si_code) {
		printf("    code: BUS_ADRALN\n");
		printf("    code: BUS_ADRERR\n");
		printf("    code: BUS_OBJERR\n");
		printf("    code: BUS_MCEERR_AR\n");
		printf("    code: BUS_MCEERR_AO\n");
		printf("    code 0x%x\n", info->si_code);

	printf("    err_offset %lld\n", (unsigned long long)err_offset);
	printf("    err_len %lld\n", (unsigned long long)err_len);

	if (info->si_code != BUS_MCEERR_AR)

	/* clear poison and reset pmem to initial value */
	ret = fallocate(fd, FALLOC_FL_ZERO_RANGE, err_offset, err_len);
	if (ret) {

	/* simulate being lucky enough to be able to reconstruct the data */
	buf = malloc(err_len);
	if (!buf) {
		perror("malloc pwrite buf");

	memset(buf, 0x59, err_len);

	ret = pwrite(fd, buf, err_len, err_offset);
	if (ret < 0) {
	if (ret != err_len) {
		fprintf(stderr, "short write %zd bytes, wanted %lld\n",
				ret, (long long)err_len);



> If you're going to start using this mess, you probably ought to just
> pull from my git trees, which are linked below.
> This is an extraordinary way to destroy everything.  Enjoy!
> Comments and questions are, as always, welcome.
> --D
> kernel git tree:
> ---
>  fs/dax.c            |   72 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/ext4/extents.c   |   19 +++++++++++++
>  fs/xfs/xfs_file.c   |   20 ++++++++++++++
>  include/linux/dax.h |    7 +++++
>  4 files changed, 118 insertions(+)

  parent reply	other threads:[~2021-08-16 21:14 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-16 21:05 [PATCHSET 0/2] dax: fix broken pmem poison narrative Darrick J. Wong
2021-08-16 21:05 ` [PATCH 1/2] xfs: use DAX block device zeroout for FSDAX file ZERO_RANGE operations Darrick J. Wong
2021-08-16 21:05 ` [PATCH 2/2] ext4: " Darrick J. Wong
2021-08-16 21:14 ` Darrick J. Wong [this message]
2021-08-17  7:39 ` [PATCHSET 0/2] dax: fix broken pmem poison narrative Christoph Hellwig
2021-08-17 15:46   ` Darrick J. Wong
2021-08-24 19:25   ` Jane Chu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210816211434.GB12640@magnolia \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).