From: Jane Chu <jane.chu@oracle.com> To: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com>, Christoph Hellwig <hch@infradead.org>, david <david@fromorbit.com>, "Darrick J. Wong" <djwong@kernel.org>, Vishal L Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Alasdair Kergon <agk@redhat.com>, device-mapper development <dm-devel@redhat.com>, "Weiny, Ira" <ira.weiny@intel.com>, Matthew Wilcox <willy@infradead.org>, Vivek Goyal <vgoyal@redhat.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Linux NVDIMM <nvdimm@lists.linux.dev>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-xfs <linux-xfs@vger.kernel.org> Subject: Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Date: Fri, 12 Nov 2021 18:00:02 +0000 [thread overview] Message-ID: <5ca628b6-d5b6-f16a-480d-ea34dfc53aef@oracle.com> (raw) In-Reply-To: <YY6J/mdSmrfK8moV@redhat.com> On 11/12/2021 7:36 AM, Mike Snitzer wrote: > On Wed, Nov 10 2021 at 1:26P -0500, > Jane Chu <jane.chu@oracle.com> wrote: > >> On 11/9/2021 1:02 PM, Dan Williams wrote: >>> On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote: >>>> >>>> On 11/9/2021 10:48 AM, Dan Williams wrote: >>>>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote: >>>>>> >>>>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote: >>>>>>> static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, >>>>>>> void *addr, size_t bytes, struct iov_iter *i, int mode) >>>>>>> { >>>>>>> + phys_addr_t pmem_off; >>>>>>> + size_t len, lead_off; >>>>>>> + struct pmem_device *pmem = dax_get_private(dax_dev); >>>>>>> + struct device *dev = pmem->bb.dev; >>>>>>> + >>>>>>> + if (unlikely(mode == DAX_OP_RECOVERY)) { >>>>>>> + lead_off = (unsigned long)addr & ~PAGE_MASK; >>>>>>> + len = PFN_PHYS(PFN_UP(lead_off + bytes)); >>>>>>> + if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) { >>>>>>> + if (lead_off || !(PAGE_ALIGNED(bytes))) { >>>>>>> + dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n", >>>>>>> + addr, bytes); >>>>>>> + return (size_t) -EIO; >>>>>>> + } >>>>>>> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; >>>>>>> + if (pmem_clear_poison(pmem, pmem_off, bytes) != >>>>>>> + BLK_STS_OK) >>>>>>> + return (size_t) -EIO; >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> This is in the wrong spot. As seen in my WIP series individual drivers >>>>>> really should not hook into copying to and from the iter, because it >>>>>> really is just one way to write to a nvdimm. How would dm-writecache >>>>>> clear the errors with this scheme? >>>>>> >>>>>> So IMHO going back to the separate recovery method as in your previous >>>>>> patch really is the way to go. If/when the 64-bit store happens we >>>>>> need to figure out a good way to clear the bad block list for that. >>>>> >>>>> I think we just make error management a first class citizen of a >>>>> dax-device and stop abstracting it behind a driver callback. That way >>>>> the driver that registers the dax-device can optionally register error >>>>> management as well. Then fsdax path can do: >>>>> >>>>> rc = dax_direct_access(..., &kaddr, ...); >>>>> if (unlikely(rc)) { >>>>> kaddr = dax_mk_recovery(kaddr); >>>> >>>> Sorry, what does dax_mk_recovery(kaddr) do? >>> >>> I was thinking this just does the hackery to set a flag bit in the >>> pointer, something like: >>> >>> return (void *) ((unsigned long) kaddr | DAX_RECOVERY) >> >> Okay, how about call it dax_prep_recovery()? >> >>> >>>> >>>>> dax_direct_access(..., &kaddr, ...); >>>>> return dax_recovery_{read,write}(..., kaddr, ...); >>>>> } >>>>> return copy_{mc_to_iter,from_iter_flushcache}(...); >>>>> >>>>> Where, the recovery version of dax_direct_access() has the opportunity >>>>> to change the page permissions / use an alias mapping for the access, >>>> >>>> again, sorry, what 'page permissions'? memory_failure_dev_pagemap() >>>> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?), >>>> do you mean to reverse the change? >>> >>> Right, the result of the conversation with Boris is that >>> memory_failure() should mark the page as NP in call cases, so >>> dax_direct_access() needs to create a UC mapping and >>> dax_recover_{read,write}() would sink that operation and either return >>> the page to NP after the access completes, or convert it to WB if the >>> operation cleared the error. >> >> Okay, will add a patch to fix set_mce_nospec(). >> >> How about moving set_memory_uc() and set_memory_np() down to >> dax_recovery_read(), so that we don't split the set_memory_X calls >> over different APIs, because we can't enforce what follows >> dax_direct_access()? >> >>> >>>>> dax_recovery_read() allows reading the good cachelines out of a >>>>> poisoned page, and dax_recovery_write() coordinates error list >>>>> management and returning a poison page to full write-back caching >>>>> operation when no more poisoned cacheline are detected in the page. >>>>> >>>> >>>> How about to introduce 3 dax_recover_ APIs: >>>> dax_recover_direct_access(): similar to dax_direct_access except >>>> it ignores error list and return the kaddr, and hence is also >>>> optional, exported by device driver that has the ability to >>>> detect error; >>>> dax_recovery_read(): optional, supported by pmem driver only, >>>> reads as much data as possible up to the poisoned page; >>> >>> It wouldn't be a property of the pmem driver, I expect it would be a >>> flag on the dax device whether to attempt recovery or not. I.e. get >>> away from this being a pmem callback and make this a native capability >>> of a dax device. >>> >>>> dax_recovery_write(): optional, supported by pmem driver only, >>>> first clear-poison, then write. >>>> >>>> Should we worry about the dm targets? >>> >>> The dm targets after Christoph's conversion should be able to do all >>> the translation at direct access time and then dax_recovery_X can be >>> done on the resulting already translated kaddr. >> >> I'm thinking about the mixed device dm where some provides >> dax_recovery_X, others don't, in which case we don't allow >> dax recovery because that causes confusion? or should we still >> allow recovery for part of the mixed devices? > > I really don't like the all or nothing approach if it can be avoided. > I would imagine that if recovery possible it best to support it even > if the DM device happens to span a mix of devices with varying support > for recovery. Got it! thanks! -jane > > Thanks, > Mike >
WARNING: multiple messages have this Message-ID (diff)
From: Jane Chu <jane.chu@oracle.com> To: Mike Snitzer <snitzer@redhat.com> Cc: Linux NVDIMM <nvdimm@lists.linux.dev>, Dave Jiang <dave.jiang@intel.com>, "Darrick J. Wong" <djwong@kernel.org>, david <david@fromorbit.com>, device-mapper, Matthew Wilcox <willy@infradead.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Christoph Hellwig <hch@infradead.org>, development <dm-devel@redhat.com>, Vivek Goyal <vgoyal@redhat.com>, Vishal L Verma <vishal.l.verma@intel.com>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Dan Williams <dan.j.williams@intel.com>, "Weiny, Ira" <ira.weiny@intel.com>, linux-xfs <linux-xfs@vger.kernel.org>, Alasdair Kergon <agk@redhat.com> Subject: Re: [dm-devel] [PATCH v2 2/2] dax, pmem: Implement pmem based dax data recovery Date: Fri, 12 Nov 2021 18:00:02 +0000 [thread overview] Message-ID: <5ca628b6-d5b6-f16a-480d-ea34dfc53aef@oracle.com> (raw) In-Reply-To: <YY6J/mdSmrfK8moV@redhat.com> On 11/12/2021 7:36 AM, Mike Snitzer wrote: > On Wed, Nov 10 2021 at 1:26P -0500, > Jane Chu <jane.chu@oracle.com> wrote: > >> On 11/9/2021 1:02 PM, Dan Williams wrote: >>> On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote: >>>> >>>> On 11/9/2021 10:48 AM, Dan Williams wrote: >>>>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote: >>>>>> >>>>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote: >>>>>>> static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, >>>>>>> void *addr, size_t bytes, struct iov_iter *i, int mode) >>>>>>> { >>>>>>> + phys_addr_t pmem_off; >>>>>>> + size_t len, lead_off; >>>>>>> + struct pmem_device *pmem = dax_get_private(dax_dev); >>>>>>> + struct device *dev = pmem->bb.dev; >>>>>>> + >>>>>>> + if (unlikely(mode == DAX_OP_RECOVERY)) { >>>>>>> + lead_off = (unsigned long)addr & ~PAGE_MASK; >>>>>>> + len = PFN_PHYS(PFN_UP(lead_off + bytes)); >>>>>>> + if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) { >>>>>>> + if (lead_off || !(PAGE_ALIGNED(bytes))) { >>>>>>> + dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n", >>>>>>> + addr, bytes); >>>>>>> + return (size_t) -EIO; >>>>>>> + } >>>>>>> + pmem_off = PFN_PHYS(pgoff) + pmem->data_offset; >>>>>>> + if (pmem_clear_poison(pmem, pmem_off, bytes) != >>>>>>> + BLK_STS_OK) >>>>>>> + return (size_t) -EIO; >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> This is in the wrong spot. As seen in my WIP series individual drivers >>>>>> really should not hook into copying to and from the iter, because it >>>>>> really is just one way to write to a nvdimm. How would dm-writecache >>>>>> clear the errors with this scheme? >>>>>> >>>>>> So IMHO going back to the separate recovery method as in your previous >>>>>> patch really is the way to go. If/when the 64-bit store happens we >>>>>> need to figure out a good way to clear the bad block list for that. >>>>> >>>>> I think we just make error management a first class citizen of a >>>>> dax-device and stop abstracting it behind a driver callback. That way >>>>> the driver that registers the dax-device can optionally register error >>>>> management as well. Then fsdax path can do: >>>>> >>>>> rc = dax_direct_access(..., &kaddr, ...); >>>>> if (unlikely(rc)) { >>>>> kaddr = dax_mk_recovery(kaddr); >>>> >>>> Sorry, what does dax_mk_recovery(kaddr) do? >>> >>> I was thinking this just does the hackery to set a flag bit in the >>> pointer, something like: >>> >>> return (void *) ((unsigned long) kaddr | DAX_RECOVERY) >> >> Okay, how about call it dax_prep_recovery()? >> >>> >>>> >>>>> dax_direct_access(..., &kaddr, ...); >>>>> return dax_recovery_{read,write}(..., kaddr, ...); >>>>> } >>>>> return copy_{mc_to_iter,from_iter_flushcache}(...); >>>>> >>>>> Where, the recovery version of dax_direct_access() has the opportunity >>>>> to change the page permissions / use an alias mapping for the access, >>>> >>>> again, sorry, what 'page permissions'? memory_failure_dev_pagemap() >>>> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?), >>>> do you mean to reverse the change? >>> >>> Right, the result of the conversation with Boris is that >>> memory_failure() should mark the page as NP in call cases, so >>> dax_direct_access() needs to create a UC mapping and >>> dax_recover_{read,write}() would sink that operation and either return >>> the page to NP after the access completes, or convert it to WB if the >>> operation cleared the error. >> >> Okay, will add a patch to fix set_mce_nospec(). >> >> How about moving set_memory_uc() and set_memory_np() down to >> dax_recovery_read(), so that we don't split the set_memory_X calls >> over different APIs, because we can't enforce what follows >> dax_direct_access()? >> >>> >>>>> dax_recovery_read() allows reading the good cachelines out of a >>>>> poisoned page, and dax_recovery_write() coordinates error list >>>>> management and returning a poison page to full write-back caching >>>>> operation when no more poisoned cacheline are detected in the page. >>>>> >>>> >>>> How about to introduce 3 dax_recover_ APIs: >>>> dax_recover_direct_access(): similar to dax_direct_access except >>>> it ignores error list and return the kaddr, and hence is also >>>> optional, exported by device driver that has the ability to >>>> detect error; >>>> dax_recovery_read(): optional, supported by pmem driver only, >>>> reads as much data as possible up to the poisoned page; >>> >>> It wouldn't be a property of the pmem driver, I expect it would be a >>> flag on the dax device whether to attempt recovery or not. I.e. get >>> away from this being a pmem callback and make this a native capability >>> of a dax device. >>> >>>> dax_recovery_write(): optional, supported by pmem driver only, >>>> first clear-poison, then write. >>>> >>>> Should we worry about the dm targets? >>> >>> The dm targets after Christoph's conversion should be able to do all >>> the translation at direct access time and then dax_recovery_X can be >>> done on the resulting already translated kaddr. >> >> I'm thinking about the mixed device dm where some provides >> dax_recovery_X, others don't, in which case we don't allow >> dax recovery because that causes confusion? or should we still >> allow recovery for part of the mixed devices? > > I really don't like the all or nothing approach if it can be avoided. > I would imagine that if recovery possible it best to support it even > if the DM device happens to span a mix of devices with varying support > for recovery. Got it! thanks! -jane > > Thanks, > Mike > -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
next prev parent reply other threads:[~2021-11-12 18:00 UTC|newest] Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-11-06 1:16 [PATCH v2 0/2] Dax poison recovery Jane Chu 2021-11-06 1:16 ` [dm-devel] " Jane Chu 2021-11-06 1:16 ` [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes Jane Chu 2021-11-06 1:16 ` [dm-devel] " Jane Chu 2021-11-06 1:50 ` Darrick J. Wong 2021-11-06 1:50 ` [dm-devel] " Darrick J. Wong 2021-11-08 20:43 ` Jane Chu 2021-11-08 20:43 ` [dm-devel] " Jane Chu 2021-11-06 16:48 ` Dan Williams 2021-11-06 16:48 ` [dm-devel] " Dan Williams 2021-11-08 21:02 ` Jane Chu 2021-11-08 21:02 ` [dm-devel] " Jane Chu 2021-11-09 5:26 ` Ira Weiny 2021-11-09 5:26 ` [dm-devel] " Ira Weiny 2021-11-09 6:04 ` Dan Williams 2021-11-09 6:04 ` [dm-devel] " Dan Williams 2021-11-06 1:16 ` [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Jane Chu 2021-11-06 1:16 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu 2021-11-06 2:04 ` [PATCH v2 2/2] dax,pmem: " Darrick J. Wong 2021-11-06 2:04 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Darrick J. Wong 2021-11-08 20:53 ` [PATCH v2 2/2] dax,pmem: " Jane Chu 2021-11-08 20:53 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu 2021-11-08 21:00 ` [PATCH v2 2/2] dax,pmem: " Jane Chu 2021-11-08 21:00 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu 2021-11-09 7:27 ` [PATCH v2 2/2] dax,pmem: " Christoph Hellwig 2021-11-09 7:27 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Christoph Hellwig 2021-11-09 18:48 ` [PATCH v2 2/2] dax,pmem: " Dan Williams 2021-11-09 18:48 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Dan Williams 2021-11-09 19:52 ` [PATCH v2 2/2] dax,pmem: " Christoph Hellwig 2021-11-09 19:52 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Christoph Hellwig 2021-11-09 19:58 ` [PATCH v2 2/2] dax,pmem: " Jane Chu 2021-11-09 19:58 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu 2021-11-09 21:02 ` [PATCH v2 2/2] dax,pmem: " Dan Williams 2021-11-09 21:02 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Dan Williams 2021-11-10 18:26 ` [PATCH v2 2/2] dax,pmem: " Jane Chu 2021-11-10 18:26 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu 2021-11-12 15:36 ` [PATCH v2 2/2] dax,pmem: " Mike Snitzer 2021-11-12 15:36 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Mike Snitzer 2021-11-12 18:00 ` Jane Chu [this message] 2021-11-12 18:00 ` Jane Chu 2021-11-09 19:14 ` [PATCH v2 2/2] dax,pmem: " Jane Chu 2021-11-09 19:14 ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=5ca628b6-d5b6-f16a-480d-ea34dfc53aef@oracle.com \ --to=jane.chu@oracle.com \ --cc=agk@redhat.com \ --cc=dan.j.williams@intel.com \ --cc=dave.jiang@intel.com \ --cc=david@fromorbit.com \ --cc=djwong@kernel.org \ --cc=dm-devel@redhat.com \ --cc=hch@infradead.org \ --cc=ira.weiny@intel.com \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-xfs@vger.kernel.org \ --cc=nvdimm@lists.linux.dev \ --cc=snitzer@redhat.com \ --cc=vgoyal@redhat.com \ --cc=vishal.l.verma@intel.com \ --cc=willy@infradead.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.