linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Christoph Hellwig <hch@infradead.org>,
	device-mapper development <dm-devel@redhat.com>
Subject: Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
Date: Tue, 25 Feb 2020 13:52:38 -0800	[thread overview]
Message-ID: <CAPcyv4jBCCYXQRjOLHa79rW3qB9kisGRN6yTvcXYs6JJU1-p7g@mail.gmail.com> (raw)
In-Reply-To: <x494kveh0gm.fsf@segfault.boston.devel.redhat.com>

On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >>
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >>
> >> >> Let's just focus on reporting errors when we know we have them.
> >> >
> >> > That's the problem in my eyes. If software needs to contend with
> >> > latent error reporting then it should always contend otherwise
> >> > software has multiple error models to wrangle.
> >>
> >> The only way for an application to know that the data has been written
> >> successfully would be to issue a read after every write.  That's not a
> >> performance hit most applications are willing to take.  And, of course,
> >> the media can still go bad at a later time, so it only guarantees the
> >> data is accessible immediately after having been written.
> >>
> >> What I'm suggesting is that we should not complete a write successfully
> >> if we know that the data will not be retrievable.  I wouldn't call this
> >> adding an extra error model to contend with.  Applications should
> >> already be checking for errors on write.
> >>
> >> Does that make sense? Are we talking past each other?
> >
> > The badblock list is late to update in both directions, late to add
> > entries that the scrub needs to find and late to delete entries that
> > were inadvertently cleared by cache-line writes that did not first
> > ingest the poison for a read-modify-write.
>
> We aren't looking for perfection.  If the badblocks list is populated,
> then report the error instead of letting the user write data that we
> know they won't be able to access later.
>
> You have confused me, though, since I thought that stores to bad media
> would not clear errors.  Perhaps you are talking about some future
> hardware implementation that doesn't yet exist?

No, today cacheline aligned DMA, and cpu cachelines that are fully
dirtied without a demand fill can end up clearing poison. The
movdir64b instruction provides a way to force this behavior from the
cpu where it was previously luck.

> > So I see the above as being wishful in using the error list as the
> > hard source of truth and unfortunate to up-level all sub-sector error
> > entries into full PAGE_SIZE data offline events.
>
> The page size granularity is only an issue for mmap().  If you are using
> read/write, then 512 byte granularity can be achieved.  Even with mmap,
> if you encounter an error on a 4k page, you can query the status of each
> sector in that page to isolate the error.  So I'm not quite sure I
> understand what you're getting at.

I'm getting at the fact that we don't allow mmap on PAGE_SIZE
granularity in the presence of errors, and don't allow dax I/O to the
page when an error is present. Only non-dax direct-I/O can read /
write at sub-page granularity in the presence of errors.

The interface to query the status is not coordinated with the
filesystem, but that's a whole other discussion. Yes, we're getting a
bit off in the weeds.

> > I'm hoping we can find a way to make the error handling more fine
> > grained over time, but for the current patch, managing the blast
> > radius as PAGE_SIZE granularity at least matches the zero path with
> > the write path.
>
> I think the write path can do 512 byte granularity, not page size.
> Anyway, I think we've gone far enough off into the weeds that more
> patches will have to be posted for debate.  :)
>

It can't, see dax_iomap_actor(). That call to dax_direct_access() will
fail if it hits a known badblock.

> >> > Setting that aside we can start with just treating zeroing the same as
> >> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> >> > step.
> >>
> >> OK.
> >>
> >> > I'd rather have a separate op that filesystems can use to clear errors
> >> > at block allocation time that can be enforced to have the correct
> >> > alignment.
> >>
> >> So would file systems always call that routine instead of zeroing, or
> >> would they first check to see if there are badblocks?
> >
> > The proposal is that filesystems distinguish zeroing from free-block
> > allocation/initialization such that the fsdax implementation directs
> > initialization to a driver callback. This "initialization op" would
> > take care to check for poison and clear it. All other dax paths would
> > not consult the badblocks list.
>
> What do you mean by "all other dax paths?"  Would that include
> mmap/direct_access?  Because that definitely should still consult the
> badblocks list.

dax_direct_access() consults the badblock list,
dax_copy_{to,from}_iter() do not, and badblocks discovered after a
page is mapped do not invalidate the page unless the poison is
consumed.

> I'm not against having a separate operation for clearing errors, but I
> guess I'm not convinced it's cleaner, either.

The idea with a separate op is to have an explicit ABI to clear errors
like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit
side effect of direct-I/O writes.

I also feel like we're heading in a direction that tries to avoid
relying on the cpu machine-check recovery path at all costs. That's
the kernel's prerogative, of course, but it seems like at a minimum
we're not quite on the same page about what role machine-check
recovery plays in the error handling model.

  reply	other threads:[~2020-02-25 21:52 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-18 21:48 [PATCH v5 0/8] dax/pmem: Provide a dax operation to zero range of memory Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 1/8] pmem: Add functions for reading/writing page to/from pmem Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len Vivek Goyal
2020-02-20 16:17   ` Christoph Hellwig
2020-02-20 21:35   ` Jeff Moyer
2020-02-20 21:57     ` Vivek Goyal
2020-02-21 18:32       ` Jeff Moyer
2020-02-21 20:17         ` Vivek Goyal
2020-02-21 21:00           ` Dan Williams
2020-02-21 21:24             ` Vivek Goyal
2020-02-21 21:30               ` Dan Williams
2020-02-21 21:33                 ` Jeff Moyer
2020-02-23 23:03           ` Dave Chinner
2020-02-24  0:40             ` Dan Williams
2020-02-24 13:50               ` Jeff Moyer
2020-02-24 20:48                 ` Dan Williams
2020-02-24 21:53                   ` Jeff Moyer
2020-02-25  0:26                     ` Dan Williams
2020-02-25 20:32                       ` Jeff Moyer
2020-02-25 21:52                         ` Dan Williams [this message]
2020-02-25 23:26                       ` Jane Chu
2020-02-24 15:38             ` Vivek Goyal
2020-02-27  3:02               ` Dave Chinner
2020-02-27  4:19                 ` Dan Williams
2020-02-28  1:30                   ` Dave Chinner
2020-02-28  3:28                     ` Dan Williams
2020-02-28 14:05                       ` Christoph Hellwig
2020-02-28 16:26                         ` Dan Williams
2020-02-24 20:13             ` Vivek Goyal
2020-02-24 20:52               ` Dan Williams
2020-02-24 21:15                 ` Vivek Goyal
2020-02-24 21:32                   ` Dan Williams
2020-02-25 13:36                     ` Vivek Goyal
2020-02-25 16:25                       ` Dan Williams
2020-02-25 20:08                         ` Vivek Goyal
2020-02-25 22:49                           ` Dan Williams
2020-02-26 13:51                             ` Vivek Goyal
2020-02-26 16:57                             ` Vivek Goyal
2020-02-27  3:11                               ` Dave Chinner
2020-02-27 15:25                                 ` Vivek Goyal
2020-02-28  1:50                                   ` Dave Chinner
2020-02-18 21:48 ` [PATCH v5 3/8] pmem: Enable pmem_do_write() to deal with arbitrary ranges Vivek Goyal
2020-02-20 16:17   ` Christoph Hellwig
2020-02-18 21:48 ` [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range Vivek Goyal
2020-03-31 19:38   ` Dan Williams
2020-04-01 13:15     ` Vivek Goyal
2020-04-01 16:14     ` Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 5/8] s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 6/8] dm,dax: Add dax zero_page_range operation Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 7/8] dax,iomap: Start using dax native zero_page_range() Vivek Goyal
2020-02-18 21:48 ` [PATCH v5 8/8] dax,iomap: Add helper dax_iomap_zero() to zero a range Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4jBCCYXQRjOLHa79rW3qB9kisGRN6yTvcXYs6JJU1-p7g@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=jmoyer@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).