From: Dan Williams <dan.j.williams@intel.com> To: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Chinner <david@fromorbit.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-nvdimm <linux-nvdimm@lists.01.org>, Christoph Hellwig <hch@infradead.org>, device-mapper development <dm-devel@redhat.com> Subject: Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len Date: Tue, 25 Feb 2020 13:52:38 -0800 [thread overview] Message-ID: <CAPcyv4jBCCYXQRjOLHa79rW3qB9kisGRN6yTvcXYs6JJU1-p7g@mail.gmail.com> (raw) In-Reply-To: <x494kveh0gm.fsf@segfault.boston.devel.redhat.com> On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote: > >> > >> Dan Williams <dan.j.williams@intel.com> writes: > >> > >> >> Let's just focus on reporting errors when we know we have them. > >> > > >> > That's the problem in my eyes. If software needs to contend with > >> > latent error reporting then it should always contend otherwise > >> > software has multiple error models to wrangle. > >> > >> The only way for an application to know that the data has been written > >> successfully would be to issue a read after every write. That's not a > >> performance hit most applications are willing to take. And, of course, > >> the media can still go bad at a later time, so it only guarantees the > >> data is accessible immediately after having been written. > >> > >> What I'm suggesting is that we should not complete a write successfully > >> if we know that the data will not be retrievable. I wouldn't call this > >> adding an extra error model to contend with. Applications should > >> already be checking for errors on write. > >> > >> Does that make sense? Are we talking past each other? > > > > The badblock list is late to update in both directions, late to add > > entries that the scrub needs to find and late to delete entries that > > were inadvertently cleared by cache-line writes that did not first > > ingest the poison for a read-modify-write. > > We aren't looking for perfection. If the badblocks list is populated, > then report the error instead of letting the user write data that we > know they won't be able to access later. > > You have confused me, though, since I thought that stores to bad media > would not clear errors. Perhaps you are talking about some future > hardware implementation that doesn't yet exist? No, today cacheline aligned DMA, and cpu cachelines that are fully dirtied without a demand fill can end up clearing poison. The movdir64b instruction provides a way to force this behavior from the cpu where it was previously luck. > > So I see the above as being wishful in using the error list as the > > hard source of truth and unfortunate to up-level all sub-sector error > > entries into full PAGE_SIZE data offline events. > > The page size granularity is only an issue for mmap(). If you are using > read/write, then 512 byte granularity can be achieved. Even with mmap, > if you encounter an error on a 4k page, you can query the status of each > sector in that page to isolate the error. So I'm not quite sure I > understand what you're getting at. I'm getting at the fact that we don't allow mmap on PAGE_SIZE granularity in the presence of errors, and don't allow dax I/O to the page when an error is present. Only non-dax direct-I/O can read / write at sub-page granularity in the presence of errors. The interface to query the status is not coordinated with the filesystem, but that's a whole other discussion. Yes, we're getting a bit off in the weeds. > > I'm hoping we can find a way to make the error handling more fine > > grained over time, but for the current patch, managing the blast > > radius as PAGE_SIZE granularity at least matches the zero path with > > the write path. > > I think the write path can do 512 byte granularity, not page size. > Anyway, I think we've gone far enough off into the weeds that more > patches will have to be posted for debate. :) > It can't, see dax_iomap_actor(). That call to dax_direct_access() will fail if it hits a known badblock. > >> > Setting that aside we can start with just treating zeroing the same as > >> > the copy_from_iter() case and fail the I/O at the dax_direct_access() > >> > step. > >> > >> OK. > >> > >> > I'd rather have a separate op that filesystems can use to clear errors > >> > at block allocation time that can be enforced to have the correct > >> > alignment. > >> > >> So would file systems always call that routine instead of zeroing, or > >> would they first check to see if there are badblocks? > > > > The proposal is that filesystems distinguish zeroing from free-block > > allocation/initialization such that the fsdax implementation directs > > initialization to a driver callback. This "initialization op" would > > take care to check for poison and clear it. All other dax paths would > > not consult the badblocks list. > > What do you mean by "all other dax paths?" Would that include > mmap/direct_access? Because that definitely should still consult the > badblocks list. dax_direct_access() consults the badblock list, dax_copy_{to,from}_iter() do not, and badblocks discovered after a page is mapped do not invalidate the page unless the poison is consumed. > I'm not against having a separate operation for clearing errors, but I > guess I'm not convinced it's cleaner, either. The idea with a separate op is to have an explicit ABI to clear errors like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit side effect of direct-I/O writes. I also feel like we're heading in a direction that tries to avoid relying on the cpu machine-check recovery path at all costs. That's the kernel's prerogative, of course, but it seems like at a minimum we're not quite on the same page about what role machine-check recovery plays in the error handling model. _______________________________________________ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Chinner <david@fromorbit.com>, Vivek Goyal <vgoyal@redhat.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-nvdimm <linux-nvdimm@lists.01.org>, Christoph Hellwig <hch@infradead.org>, device-mapper development <dm-devel@redhat.com> Subject: Re: [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len Date: Tue, 25 Feb 2020 13:52:38 -0800 [thread overview] Message-ID: <CAPcyv4jBCCYXQRjOLHa79rW3qB9kisGRN6yTvcXYs6JJU1-p7g@mail.gmail.com> (raw) In-Reply-To: <x494kveh0gm.fsf@segfault.boston.devel.redhat.com> On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote: > >> > >> Dan Williams <dan.j.williams@intel.com> writes: > >> > >> >> Let's just focus on reporting errors when we know we have them. > >> > > >> > That's the problem in my eyes. If software needs to contend with > >> > latent error reporting then it should always contend otherwise > >> > software has multiple error models to wrangle. > >> > >> The only way for an application to know that the data has been written > >> successfully would be to issue a read after every write. That's not a > >> performance hit most applications are willing to take. And, of course, > >> the media can still go bad at a later time, so it only guarantees the > >> data is accessible immediately after having been written. > >> > >> What I'm suggesting is that we should not complete a write successfully > >> if we know that the data will not be retrievable. I wouldn't call this > >> adding an extra error model to contend with. Applications should > >> already be checking for errors on write. > >> > >> Does that make sense? Are we talking past each other? > > > > The badblock list is late to update in both directions, late to add > > entries that the scrub needs to find and late to delete entries that > > were inadvertently cleared by cache-line writes that did not first > > ingest the poison for a read-modify-write. > > We aren't looking for perfection. If the badblocks list is populated, > then report the error instead of letting the user write data that we > know they won't be able to access later. > > You have confused me, though, since I thought that stores to bad media > would not clear errors. Perhaps you are talking about some future > hardware implementation that doesn't yet exist? No, today cacheline aligned DMA, and cpu cachelines that are fully dirtied without a demand fill can end up clearing poison. The movdir64b instruction provides a way to force this behavior from the cpu where it was previously luck. > > So I see the above as being wishful in using the error list as the > > hard source of truth and unfortunate to up-level all sub-sector error > > entries into full PAGE_SIZE data offline events. > > The page size granularity is only an issue for mmap(). If you are using > read/write, then 512 byte granularity can be achieved. Even with mmap, > if you encounter an error on a 4k page, you can query the status of each > sector in that page to isolate the error. So I'm not quite sure I > understand what you're getting at. I'm getting at the fact that we don't allow mmap on PAGE_SIZE granularity in the presence of errors, and don't allow dax I/O to the page when an error is present. Only non-dax direct-I/O can read / write at sub-page granularity in the presence of errors. The interface to query the status is not coordinated with the filesystem, but that's a whole other discussion. Yes, we're getting a bit off in the weeds. > > I'm hoping we can find a way to make the error handling more fine > > grained over time, but for the current patch, managing the blast > > radius as PAGE_SIZE granularity at least matches the zero path with > > the write path. > > I think the write path can do 512 byte granularity, not page size. > Anyway, I think we've gone far enough off into the weeds that more > patches will have to be posted for debate. :) > It can't, see dax_iomap_actor(). That call to dax_direct_access() will fail if it hits a known badblock. > >> > Setting that aside we can start with just treating zeroing the same as > >> > the copy_from_iter() case and fail the I/O at the dax_direct_access() > >> > step. > >> > >> OK. > >> > >> > I'd rather have a separate op that filesystems can use to clear errors > >> > at block allocation time that can be enforced to have the correct > >> > alignment. > >> > >> So would file systems always call that routine instead of zeroing, or > >> would they first check to see if there are badblocks? > > > > The proposal is that filesystems distinguish zeroing from free-block > > allocation/initialization such that the fsdax implementation directs > > initialization to a driver callback. This "initialization op" would > > take care to check for poison and clear it. All other dax paths would > > not consult the badblocks list. > > What do you mean by "all other dax paths?" Would that include > mmap/direct_access? Because that definitely should still consult the > badblocks list. dax_direct_access() consults the badblock list, dax_copy_{to,from}_iter() do not, and badblocks discovered after a page is mapped do not invalidate the page unless the poison is consumed. > I'm not against having a separate operation for clearing errors, but I > guess I'm not convinced it's cleaner, either. The idea with a separate op is to have an explicit ABI to clear errors like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit side effect of direct-I/O writes. I also feel like we're heading in a direction that tries to avoid relying on the cpu machine-check recovery path at all costs. That's the kernel's prerogative, of course, but it seems like at a minimum we're not quite on the same page about what role machine-check recovery plays in the error handling model.
next prev parent reply other threads:[~2020-02-25 21:52 UTC|newest] Thread overview: 103+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-02-18 21:48 [PATCH v5 0/8] dax/pmem: Provide a dax operation to zero range of memory Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-02-18 21:48 ` [PATCH v5 1/8] pmem: Add functions for reading/writing page to/from pmem Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-02-18 21:48 ` [PATCH v5 2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-02-20 16:17 ` Christoph Hellwig 2020-02-20 16:17 ` Christoph Hellwig 2020-02-20 21:35 ` Jeff Moyer 2020-02-20 21:35 ` Jeff Moyer 2020-02-20 21:57 ` Vivek Goyal 2020-02-20 21:57 ` Vivek Goyal 2020-02-21 18:32 ` Jeff Moyer 2020-02-21 18:32 ` Jeff Moyer 2020-02-21 20:17 ` Vivek Goyal 2020-02-21 20:17 ` Vivek Goyal 2020-02-21 21:00 ` Dan Williams 2020-02-21 21:00 ` Dan Williams 2020-02-21 21:24 ` Vivek Goyal 2020-02-21 21:24 ` Vivek Goyal 2020-02-21 21:30 ` Dan Williams 2020-02-21 21:30 ` Dan Williams 2020-02-21 21:33 ` Jeff Moyer 2020-02-21 21:33 ` Jeff Moyer 2020-02-23 23:03 ` Dave Chinner 2020-02-23 23:03 ` Dave Chinner 2020-02-24 0:40 ` Dan Williams 2020-02-24 0:40 ` Dan Williams 2020-02-24 13:50 ` Jeff Moyer 2020-02-24 13:50 ` Jeff Moyer 2020-02-24 20:48 ` Dan Williams 2020-02-24 20:48 ` Dan Williams 2020-02-24 21:53 ` Jeff Moyer 2020-02-24 21:53 ` Jeff Moyer 2020-02-25 0:26 ` Dan Williams 2020-02-25 0:26 ` Dan Williams 2020-02-25 20:32 ` Jeff Moyer 2020-02-25 20:32 ` Jeff Moyer 2020-02-25 21:52 ` Dan Williams [this message] 2020-02-25 21:52 ` Dan Williams 2020-02-25 23:26 ` Jane Chu 2020-02-25 23:26 ` Jane Chu 2020-02-24 15:38 ` Vivek Goyal 2020-02-24 15:38 ` Vivek Goyal 2020-02-27 3:02 ` Dave Chinner 2020-02-27 3:02 ` Dave Chinner 2020-02-27 4:19 ` Dan Williams 2020-02-27 4:19 ` Dan Williams 2020-02-28 1:30 ` Dave Chinner 2020-02-28 1:30 ` Dave Chinner 2020-02-28 3:28 ` Dan Williams 2020-02-28 3:28 ` Dan Williams 2020-02-28 14:05 ` Christoph Hellwig 2020-02-28 14:05 ` Christoph Hellwig 2020-02-28 16:26 ` Dan Williams 2020-02-28 16:26 ` Dan Williams 2020-02-24 20:13 ` Vivek Goyal 2020-02-24 20:13 ` Vivek Goyal 2020-02-24 20:52 ` Dan Williams 2020-02-24 20:52 ` Dan Williams 2020-02-24 21:15 ` Vivek Goyal 2020-02-24 21:15 ` Vivek Goyal 2020-02-24 21:32 ` Dan Williams 2020-02-24 21:32 ` Dan Williams 2020-02-25 13:36 ` Vivek Goyal 2020-02-25 13:36 ` Vivek Goyal 2020-02-25 16:25 ` Dan Williams 2020-02-25 16:25 ` Dan Williams 2020-02-25 20:08 ` Vivek Goyal 2020-02-25 20:08 ` Vivek Goyal 2020-02-25 22:49 ` Dan Williams 2020-02-25 22:49 ` Dan Williams 2020-02-26 13:51 ` Vivek Goyal 2020-02-26 13:51 ` Vivek Goyal 2020-02-26 16:57 ` Vivek Goyal 2020-02-26 16:57 ` Vivek Goyal 2020-02-27 3:11 ` Dave Chinner 2020-02-27 3:11 ` Dave Chinner 2020-02-27 15:25 ` Vivek Goyal 2020-02-27 15:25 ` Vivek Goyal 2020-02-28 1:50 ` Dave Chinner 2020-02-28 1:50 ` Dave Chinner 2020-02-18 21:48 ` [PATCH v5 3/8] pmem: Enable pmem_do_write() to deal with arbitrary ranges Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-02-20 16:17 ` Christoph Hellwig 2020-02-20 16:17 ` Christoph Hellwig 2020-02-18 21:48 ` [PATCH v5 4/8] dax, pmem: Add a dax operation zero_page_range Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-03-31 19:38 ` Dan Williams 2020-03-31 19:38 ` Dan Williams 2020-04-01 13:15 ` Vivek Goyal 2020-04-01 13:15 ` Vivek Goyal 2020-04-01 16:14 ` Vivek Goyal 2020-04-01 16:14 ` Vivek Goyal 2020-02-18 21:48 ` [PATCH v5 5/8] s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-02-18 21:48 ` [PATCH v5 6/8] dm,dax: Add dax zero_page_range operation Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-02-18 21:48 ` [PATCH v5 7/8] dax,iomap: Start using dax native zero_page_range() Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-02-18 21:48 ` [PATCH v5 8/8] dax,iomap: Add helper dax_iomap_zero() to zero a range Vivek Goyal 2020-02-18 21:48 ` Vivek Goyal 2020-04-25 11:31 ` [PATCH v5 8/8] dax, iomap: " neolift9
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAPcyv4jBCCYXQRjOLHa79rW3qB9kisGRN6yTvcXYs6JJU1-p7g@mail.gmail.com \ --to=dan.j.williams@intel.com \ --cc=david@fromorbit.com \ --cc=dm-devel@redhat.com \ --cc=hch@infradead.org \ --cc=jmoyer@redhat.com \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-nvdimm@lists.01.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.