All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Jane Chu <jane.chu@oracle.com>,
	 "david@fromorbit.com" <david@fromorbit.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	 "dave.jiang@intel.com" <dave.jiang@intel.com>,
	"agk@redhat.com" <agk@redhat.com>,
	 "snitzer@redhat.com" <snitzer@redhat.com>,
	"dm-devel@redhat.com" <dm-devel@redhat.com>,
	 "ira.weiny@intel.com" <ira.weiny@intel.com>,
	"willy@infradead.org" <willy@infradead.org>,
	 "vgoyal@redhat.com" <vgoyal@redhat.com>,
	 "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	 "nvdimm@lists.linux.dev" <nvdimm@lists.linux.dev>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	 "linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>
Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag
Date: Tue, 2 Nov 2021 12:57:10 -0700	[thread overview]
Message-ID: <CAPcyv4j8snuGpy=z6BAXogQkP5HmTbqzd6e22qyERoNBvFKROw@mail.gmail.com> (raw)
In-Reply-To: <YYDYUCCiEPXhZEw0@infradead.org>

On Mon, Nov 1, 2021 at 11:19 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> > ...so would you happen to know if anyone's working on solving this
> > problem for us by putting the memory controller in charge of dealing
> > with media errors?
>
> The only one who could know is Intel..
>
> > The trouble is, we really /do/ want to be able to (re)write the failed
> > area, and we probably want to try to read whatever we can.  Those are
> > reads and writes, not {pre,f}allocation activities.  This is where Dave
> > and I arrived at a month ago.
> >
> > Unless you'd be ok with a second IO path for recovery where we're
> > allowed to be slow?  That would probably have the same user interface
> > flag, just a different path into the pmem driver.
>
> Which is fine with me.  If you look at the API here we do have the
> RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> API which then gets special casing over three methods.
>
> And while Pavel pointed out that he and Jens are now optimizing for
> single branches like this.  I think this actually is silly and it is
> not my point.
>
> The point is that the DAX in-kernel API is a mess, and before we make
> it even worse we need to sort it first.  What is directly relevant
> here is that the copy_from_iter and copy_to_iter APIs do not make
> sense.  Most of the DAX API is based around getting a memory mapping
> using ->direct_access, it is just the read/write path which is a slow
> path that actually uses this.  I have a very WIP patch series to try
> to sort this out here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
>
> But back to this series.  The basic DAX model is that the callers gets a
> memory mapping an just works on that, maybe calling a sync after a write
> in a few cases.  So any kind of recovery really needs to be able to
> work with that model as going forward the copy_to/from_iter path will
> be used less and less.  i.e. file systems can and should use
> direct_access directly instead of using the block layer implementation
> in the pmem driver.  As an example the dm-writecache driver, the pending
> bcache nvdimm support and the (horribly and out of tree) nova file systems
> won't even use this path.  We need to find a way to support recovery
> for them.  And overloading it over the read/write path which is not
> the main path for DAX, but the absolutely fast path for 99% of the
> kernel users is a horrible idea.
>
> So how can we work around the horrible nvdimm design for data recovery
> in a way that:
>
>    a) actually works with the intended direct memory map use case
>    b) doesn't really affect the normal kernel too much
>
> ?

Ok, now I see where you are going, but I don't see line of sight to
something better than RWF_RECOVER_DATA.

This goes back to one of the original DAX concerns of wanting a kernel
library for coordinating PMEM mmap I/O vs leaving userspace to wrap
PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
has this error-handling-API issue whether it is a DAX mapping or not.
I.e. a memory failure in page cache is going to signal the process the
same way and it will need to fall back to something other than mmap
I/O to make forward progress. This is not a PMEM, Intel or even x86
problem, it's a generic CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE problem.

CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
and then rely on readv(2)/writev(2) to recover. Do you see a readily
available way to improve upon that model without CPU instruction
changes? Even with CPU instructions changes, do you think it could
improve much upon the model of interrupting the process when a load
instruction aborts?

I do agree with you that DAX needs to separate itself from block, but
I don't think it follows that DAX also needs to separate itself from
readv/writev for when a kernel slow-path needs to get involved because
mmap I/O (just CPU instructions) does not have the proper semantics.
Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
those semantics in new / augmented CPU instructions you will likely
not get all of them to move and certainly not in any near term
timeframe, so the kernel path will be around indefinitely.

Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
storage besides PMEM and helps storage-drivers do better than large
blast radius "I/O error" completions with no other recourse.

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Christoph Hellwig <hch@infradead.org>
Cc: Jane Chu <jane.chu@oracle.com>,
	"nvdimm@lists.linux.dev" <nvdimm@lists.linux.dev>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"snitzer@redhat.com" <snitzer@redhat.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"willy@infradead.org" <willy@infradead.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"dm-devel@redhat.com" <dm-devel@redhat.com>,
	"vgoyal@redhat.com" <vgoyal@redhat.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"agk@redhat.com" <agk@redhat.com>
Subject: Re: [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag
Date: Tue, 2 Nov 2021 12:57:10 -0700	[thread overview]
Message-ID: <CAPcyv4j8snuGpy=z6BAXogQkP5HmTbqzd6e22qyERoNBvFKROw@mail.gmail.com> (raw)
In-Reply-To: <YYDYUCCiEPXhZEw0@infradead.org>

On Mon, Nov 1, 2021 at 11:19 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote:
> > ...so would you happen to know if anyone's working on solving this
> > problem for us by putting the memory controller in charge of dealing
> > with media errors?
>
> The only one who could know is Intel..
>
> > The trouble is, we really /do/ want to be able to (re)write the failed
> > area, and we probably want to try to read whatever we can.  Those are
> > reads and writes, not {pre,f}allocation activities.  This is where Dave
> > and I arrived at a month ago.
> >
> > Unless you'd be ok with a second IO path for recovery where we're
> > allowed to be slow?  That would probably have the same user interface
> > flag, just a different path into the pmem driver.
>
> Which is fine with me.  If you look at the API here we do have the
> RWF_ API, which them maps to the IOMAP API, which maps to the DAX_
> API which then gets special casing over three methods.
>
> And while Pavel pointed out that he and Jens are now optimizing for
> single branches like this.  I think this actually is silly and it is
> not my point.
>
> The point is that the DAX in-kernel API is a mess, and before we make
> it even worse we need to sort it first.  What is directly relevant
> here is that the copy_from_iter and copy_to_iter APIs do not make
> sense.  Most of the DAX API is based around getting a memory mapping
> using ->direct_access, it is just the read/write path which is a slow
> path that actually uses this.  I have a very WIP patch series to try
> to sort this out here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize
>
> But back to this series.  The basic DAX model is that the callers gets a
> memory mapping an just works on that, maybe calling a sync after a write
> in a few cases.  So any kind of recovery really needs to be able to
> work with that model as going forward the copy_to/from_iter path will
> be used less and less.  i.e. file systems can and should use
> direct_access directly instead of using the block layer implementation
> in the pmem driver.  As an example the dm-writecache driver, the pending
> bcache nvdimm support and the (horribly and out of tree) nova file systems
> won't even use this path.  We need to find a way to support recovery
> for them.  And overloading it over the read/write path which is not
> the main path for DAX, but the absolutely fast path for 99% of the
> kernel users is a horrible idea.
>
> So how can we work around the horrible nvdimm design for data recovery
> in a way that:
>
>    a) actually works with the intended direct memory map use case
>    b) doesn't really affect the normal kernel too much
>
> ?

Ok, now I see where you are going, but I don't see line of sight to
something better than RWF_RECOVER_DATA.

This goes back to one of the original DAX concerns of wanting a kernel
library for coordinating PMEM mmap I/O vs leaving userspace to wrap
PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O
has this error-handling-API issue whether it is a DAX mapping or not.
I.e. a memory failure in page cache is going to signal the process the
same way and it will need to fall back to something other than mmap
I/O to make forward progress. This is not a PMEM, Intel or even x86
problem, it's a generic CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE problem.

CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will
receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled
and then rely on readv(2)/writev(2) to recover. Do you see a readily
available way to improve upon that model without CPU instruction
changes? Even with CPU instructions changes, do you think it could
improve much upon the model of interrupting the process when a load
instruction aborts?

I do agree with you that DAX needs to separate itself from block, but
I don't think it follows that DAX also needs to separate itself from
readv/writev for when a kernel slow-path needs to get involved because
mmap I/O (just CPU instructions) does not have the proper semantics.
Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement
those semantics in new / augmented CPU instructions you will likely
not get all of them to move and certainly not in any near term
timeframe, so the kernel path will be around indefinitely.

Meanwhile, I think RWF_RECOVER_DATA is generically useful for other
storage besides PMEM and helps storage-drivers do better than large
blast radius "I/O error" completions with no other recourse.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


  reply	other threads:[~2021-11-02 19:57 UTC|newest]

Thread overview: 129+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-21  0:10 [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag Jane Chu
2021-10-21  0:10 ` [dm-devel] " Jane Chu
2021-10-21  0:10 ` [PATCH 1/6] dax: introduce RWF_RECOVERY_DATA flag to preadv2() and pwritev2() Jane Chu
2021-10-21  0:10   ` [dm-devel] " Jane Chu
2021-10-21  0:10 ` [PATCH 2/6] dax: prepare dax_direct_access() API with DAXDEV_F_RECOVERY flag Jane Chu
2021-10-21  0:10   ` [dm-devel] " Jane Chu
2021-10-21 11:20   ` Christoph Hellwig
2021-10-21 11:20     ` [dm-devel] " Christoph Hellwig
2021-10-21 18:19     ` Jane Chu
2021-10-21 18:19       ` [dm-devel] " Jane Chu
2021-10-21  0:10 ` [PATCH 3/6] pmem: pmem_dax_direct_access() to honor the " Jane Chu
2021-10-21  0:10   ` [dm-devel] " Jane Chu
2021-10-21 11:23   ` Christoph Hellwig
2021-10-21 11:23     ` [dm-devel] " Christoph Hellwig
2021-10-21 18:24     ` Jane Chu
2021-10-21 18:24       ` [dm-devel] " Jane Chu
2021-10-21  0:10 ` [PATCH 4/6] dm,dax,pmem: prepare dax_copy_to/from_iter() APIs with DAXDEV_F_RECOVERY Jane Chu
2021-10-21  0:10   ` [dm-devel] [PATCH 4/6] dm, dax, pmem: " Jane Chu
2021-10-21 11:27   ` [PATCH 4/6] dm,dax,pmem: " Christoph Hellwig
2021-10-21 11:27     ` [dm-devel] [PATCH 4/6] dm, dax, pmem: " Christoph Hellwig
2021-10-22  0:49     ` [PATCH 4/6] dm,dax,pmem: " Jane Chu
2021-10-22  0:49       ` [dm-devel] [PATCH 4/6] dm, dax, pmem: " Jane Chu
2021-10-22  1:41       ` correction: Re: [PATCH 4/6] dm,dax,pmem: " Jane Chu
2021-10-22  1:41         ` [dm-devel] correction: Re: [PATCH 4/6] dm, dax, pmem: " Jane Chu
2021-10-22  5:33       ` [PATCH 4/6] dm,dax,pmem: " Christoph Hellwig
2021-10-22  5:33         ` [dm-devel] [PATCH 4/6] dm, dax, pmem: " Christoph Hellwig
2021-10-22 20:30         ` [PATCH 4/6] dm,dax,pmem: " Jane Chu
2021-10-22 20:30           ` [dm-devel] [PATCH 4/6] dm, dax, pmem: " Jane Chu
2021-10-21  0:10 ` [PATCH 5/6] dax,pmem: Add data recovery feature to pmem_copy_to/from_iter() Jane Chu
2021-10-21  0:10   ` [dm-devel] [PATCH 5/6] dax, pmem: " Jane Chu
2021-10-21 11:28   ` [PATCH 5/6] dax,pmem: " Christoph Hellwig
2021-10-21 11:28     ` [dm-devel] [PATCH 5/6] dax, pmem: " Christoph Hellwig
2021-10-22  0:58     ` [PATCH 5/6] dax,pmem: " Jane Chu
2021-10-22  0:58       ` [dm-devel] [PATCH 5/6] dax, pmem: " Jane Chu
2021-10-22  8:03   ` kernel test robot
2021-10-22  8:03     ` kernel test robot
2021-10-26 10:21   ` [PATCH 5/6] dax,pmem: " kernel test robot
2021-10-26 10:21     ` [PATCH 5/6] dax, pmem: " kernel test robot
2021-10-26 10:21     ` [dm-devel] " kernel test robot
2021-10-21  0:10 ` [PATCH 6/6] dm: Ensure dm honors DAXDEV_F_RECOVERY flag on dax only Jane Chu
2021-10-21  0:10   ` [dm-devel] " Jane Chu
2021-10-21 11:31 ` [dm-devel] [PATCH 0/6] dax poison recovery with RWF_RECOVERY_DATA flag Christoph Hellwig
2021-10-21 11:31   ` Christoph Hellwig
2021-10-22  1:37   ` Jane Chu
2021-10-22  1:37     ` Jane Chu
2021-10-22  1:58     ` Darrick J. Wong
2021-10-22  1:58       ` Darrick J. Wong
2021-10-22  5:38       ` Christoph Hellwig
2021-10-22  5:38         ` Christoph Hellwig
2021-10-22  5:36     ` Christoph Hellwig
2021-10-22  5:36       ` Christoph Hellwig
2021-10-22 20:52       ` Jane Chu
2021-10-22 20:52         ` Jane Chu
2021-10-27  6:49         ` Christoph Hellwig
2021-10-27  6:49           ` Christoph Hellwig
2021-10-28  0:24           ` Darrick J. Wong
2021-10-28  0:24             ` Darrick J. Wong
2021-10-28 22:59             ` Dave Chinner
2021-10-28 22:59               ` Dave Chinner
2021-10-29 11:46               ` Pavel Begunkov
2021-10-29 11:46                 ` Pavel Begunkov
2021-10-29 16:57                 ` Darrick J. Wong
2021-10-29 16:57                   ` Darrick J. Wong
2021-10-29 19:23                   ` Pavel Begunkov
2021-10-29 19:23                     ` Pavel Begunkov
2021-10-29 20:08                     ` Darrick J. Wong
2021-10-29 20:08                       ` Darrick J. Wong
2021-10-31 13:27                       ` Pavel Begunkov
2021-10-31 13:27                         ` Pavel Begunkov
2021-10-29 18:53                 ` Jane Chu
2021-10-29 18:53                   ` Jane Chu
2021-10-29 22:32                 ` Dave Chinner
2021-10-29 22:32                   ` Dave Chinner
2021-10-31 13:19                   ` Pavel Begunkov
2021-10-31 13:19                     ` Pavel Begunkov
2021-11-01  2:31                     ` Matthew Wilcox
2021-11-01  2:31                       ` Matthew Wilcox
2021-11-02  6:18             ` Christoph Hellwig
2021-11-02  6:18               ` Christoph Hellwig
2021-11-02 19:57               ` Dan Williams [this message]
2021-11-02 19:57                 ` Dan Williams
2021-11-03 16:58                 ` Christoph Hellwig
2021-11-03 16:58                   ` Christoph Hellwig
2021-11-03 20:33                   ` Dan Williams
2021-11-03 20:33                     ` Dan Williams
2021-11-04  8:30                     ` Christoph Hellwig
2021-11-04  8:30                       ` Christoph Hellwig
2021-11-04 12:29                       ` Matthew Wilcox
2021-11-04 12:29                         ` Matthew Wilcox
2021-11-04 16:24                       ` Dan Williams
2021-11-04 16:24                         ` Dan Williams
2021-11-04 17:43                         ` Christoph Hellwig
2021-11-04 17:43                           ` Christoph Hellwig
2021-11-04 17:50                           ` Dan Williams
2021-11-04 17:50                             ` Dan Williams
2021-11-04 18:05                           ` Matthew Wilcox
2021-11-04 18:05                             ` Matthew Wilcox
2021-11-04 18:33                         ` Jane Chu
2021-11-04 18:33                           ` Jane Chu
2021-11-04 19:00                           ` Dan Williams
2021-11-04 19:00                             ` Dan Williams
2021-11-04 20:27                             ` Jane Chu
2021-11-04 20:27                               ` Jane Chu
2021-11-05  0:46                               ` Dan Williams
2021-11-05  0:46                                 ` Dan Williams
2021-11-05  1:35                                 ` Dan Williams
2021-11-05  1:35                                   ` Dan Williams
2021-11-05  5:56                             ` Christoph Hellwig
2021-11-05  5:56                               ` Christoph Hellwig
2021-11-03 18:09               ` Jane Chu
2021-11-03 18:09                 ` Jane Chu
2021-11-04  6:21                 ` Dan Williams
2021-11-04  6:21                   ` Dan Williams
2021-11-04  8:36                   ` Christoph Hellwig
2021-11-04  8:36                     ` Christoph Hellwig
2021-11-04 16:08                     ` Dan Williams
2021-11-04 16:08                       ` Dan Williams
2021-11-04 17:46                       ` Christoph Hellwig
2021-11-04 17:46                         ` Christoph Hellwig
2021-11-04  8:21                 ` Christoph Hellwig
2021-11-04  8:21                   ` Christoph Hellwig
2021-11-02 16:12             ` Dan Williams
2021-11-02 16:12               ` Dan Williams
2021-11-02 16:03           ` Dan Williams
2021-11-02 16:03             ` Dan Williams
2021-11-03 16:53             ` Christoph Hellwig
2021-11-03 16:53               ` Christoph Hellwig
2021-11-06  7:41             ` Lukas Straub
2021-11-06  7:41               ` Lukas Straub

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4j8snuGpy=z6BAXogQkP5HmTbqzd6e22qyERoNBvFKROw@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=agk@redhat.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jane.chu@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=snitzer@redhat.com \
    --cc=vgoyal@redhat.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.