All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jane Chu <jane.chu@oracle.com>
To: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Christoph Hellwig <hch@infradead.org>,
	david <david@fromorbit.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	Vishal L Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	Alasdair Kergon <agk@redhat.com>,
	device-mapper development <dm-devel@redhat.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vivek Goyal <vgoyal@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux NVDIMM <nvdimm@lists.linux.dev>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery
Date: Fri, 12 Nov 2021 18:00:02 +0000	[thread overview]
Message-ID: <5ca628b6-d5b6-f16a-480d-ea34dfc53aef@oracle.com> (raw)
In-Reply-To: <YY6J/mdSmrfK8moV@redhat.com>

On 11/12/2021 7:36 AM, Mike Snitzer wrote:
> On Wed, Nov 10 2021 at  1:26P -0500,
> Jane Chu <jane.chu@oracle.com> wrote:
> 
>> On 11/9/2021 1:02 PM, Dan Williams wrote:
>>> On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote:
>>>>
>>>> On 11/9/2021 10:48 AM, Dan Williams wrote:
>>>>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>>>
>>>>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
>>>>>>>     static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>>>>>>                  void *addr, size_t bytes, struct iov_iter *i, int mode)
>>>>>>>     {
>>>>>>> +     phys_addr_t pmem_off;
>>>>>>> +     size_t len, lead_off;
>>>>>>> +     struct pmem_device *pmem = dax_get_private(dax_dev);
>>>>>>> +     struct device *dev = pmem->bb.dev;
>>>>>>> +
>>>>>>> +     if (unlikely(mode == DAX_OP_RECOVERY)) {
>>>>>>> +             lead_off = (unsigned long)addr & ~PAGE_MASK;
>>>>>>> +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
>>>>>>> +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>>>>>>> +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
>>>>>>> +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>>>>>>> +                                     addr, bytes);
>>>>>>> +                             return (size_t) -EIO;
>>>>>>> +                     }
>>>>>>> +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>>>>>>> +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>>>>>>> +                                             BLK_STS_OK)
>>>>>>> +                             return (size_t) -EIO;
>>>>>>> +             }
>>>>>>> +     }
>>>>>>
>>>>>> This is in the wrong spot.  As seen in my WIP series individual drivers
>>>>>> really should not hook into copying to and from the iter, because it
>>>>>> really is just one way to write to a nvdimm.  How would dm-writecache
>>>>>> clear the errors with this scheme?
>>>>>>
>>>>>> So IMHO going back to the separate recovery method as in your previous
>>>>>> patch really is the way to go.  If/when the 64-bit store happens we
>>>>>> need to figure out a good way to clear the bad block list for that.
>>>>>
>>>>> I think we just make error management a first class citizen of a
>>>>> dax-device and stop abstracting it behind a driver callback. That way
>>>>> the driver that registers the dax-device can optionally register error
>>>>> management as well. Then fsdax path can do:
>>>>>
>>>>>            rc = dax_direct_access(..., &kaddr, ...);
>>>>>            if (unlikely(rc)) {
>>>>>                    kaddr = dax_mk_recovery(kaddr);
>>>>
>>>> Sorry, what does dax_mk_recovery(kaddr) do?
>>>
>>> I was thinking this just does the hackery to set a flag bit in the
>>> pointer, something like:
>>>
>>> return (void *) ((unsigned long) kaddr | DAX_RECOVERY)
>>
>> Okay, how about call it dax_prep_recovery()?
>>
>>>
>>>>
>>>>>                    dax_direct_access(..., &kaddr, ...);
>>>>>                    return dax_recovery_{read,write}(..., kaddr, ...);
>>>>>            }
>>>>>            return copy_{mc_to_iter,from_iter_flushcache}(...);
>>>>>
>>>>> Where, the recovery version of dax_direct_access() has the opportunity
>>>>> to change the page permissions / use an alias mapping for the access,
>>>>
>>>> again, sorry, what 'page permissions'?  memory_failure_dev_pagemap()
>>>> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?),
>>>> do you mean to reverse the change?
>>>
>>> Right, the result of the conversation with Boris is that
>>> memory_failure() should mark the page as NP in call cases, so
>>> dax_direct_access() needs to create a UC mapping and
>>> dax_recover_{read,write}() would sink that operation and either return
>>> the page to NP after the access completes, or convert it to WB if the
>>> operation cleared the error.
>>
>> Okay,  will add a patch to fix set_mce_nospec().
>>
>> How about moving set_memory_uc() and set_memory_np() down to
>> dax_recovery_read(), so that we don't split the set_memory_X calls
>> over different APIs, because we can't enforce what follows
>> dax_direct_access()?
>>
>>>
>>>>> dax_recovery_read() allows reading the good cachelines out of a
>>>>> poisoned page, and dax_recovery_write() coordinates error list
>>>>> management and returning a poison page to full write-back caching
>>>>> operation when no more poisoned cacheline are detected in the page.
>>>>>
>>>>
>>>> How about to introduce 3 dax_recover_ APIs:
>>>>      dax_recover_direct_access(): similar to dax_direct_access except
>>>>         it ignores error list and return the kaddr, and hence is also
>>>>         optional, exported by device driver that has the ability to
>>>>         detect error;
>>>>      dax_recovery_read(): optional, supported by pmem driver only,
>>>>         reads as much data as possible up to the poisoned page;
>>>
>>> It wouldn't be a property of the pmem driver, I expect it would be a
>>> flag on the dax device whether to attempt recovery or not. I.e. get
>>> away from this being a pmem callback and make this a native capability
>>> of a dax device.
>>>
>>>>      dax_recovery_write(): optional, supported by pmem driver only,
>>>>         first clear-poison, then write.
>>>>
>>>> Should we worry about the dm targets?
>>>
>>> The dm targets after Christoph's conversion should be able to do all
>>> the translation at direct access time and then dax_recovery_X can be
>>> done on the resulting already translated kaddr.
>>
>> I'm thinking about the mixed device dm where some provides
>> dax_recovery_X, others don't, in which case we don't allow
>> dax recovery because that causes confusion? or should we still
>> allow recovery for part of the mixed devices?
> 
> I really don't like the all or nothing approach if it can be avoided.
> I would imagine that if recovery possible it best to support it even
> if the DM device happens to span a mix of devices with varying support
> for recovery.

Got it!

thanks!
-jane

> 
> Thanks,
> Mike
> 


WARNING: multiple messages have this Message-ID (diff)
From: Jane Chu <jane.chu@oracle.com>
To: Mike Snitzer <snitzer@redhat.com>
Cc: Linux NVDIMM <nvdimm@lists.linux.dev>,
	Dave Jiang <dave.jiang@intel.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	david <david@fromorbit.com>,
	device-mapper, Matthew Wilcox <willy@infradead.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	development <dm-devel@redhat.com>,
	Vivek Goyal <vgoyal@redhat.com>,
	Vishal L Verma <vishal.l.verma@intel.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Dan Williams <dan.j.williams@intel.com>,
	"Weiny, Ira" <ira.weiny@intel.com>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Alasdair Kergon <agk@redhat.com>
Subject: Re: [dm-devel] [PATCH v2 2/2] dax, pmem: Implement pmem based dax data recovery
Date: Fri, 12 Nov 2021 18:00:02 +0000	[thread overview]
Message-ID: <5ca628b6-d5b6-f16a-480d-ea34dfc53aef@oracle.com> (raw)
In-Reply-To: <YY6J/mdSmrfK8moV@redhat.com>

On 11/12/2021 7:36 AM, Mike Snitzer wrote:
> On Wed, Nov 10 2021 at  1:26P -0500,
> Jane Chu <jane.chu@oracle.com> wrote:
> 
>> On 11/9/2021 1:02 PM, Dan Williams wrote:
>>> On Tue, Nov 9, 2021 at 11:59 AM Jane Chu <jane.chu@oracle.com> wrote:
>>>>
>>>> On 11/9/2021 10:48 AM, Dan Williams wrote:
>>>>> On Mon, Nov 8, 2021 at 11:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>>>>>>
>>>>>> On Fri, Nov 05, 2021 at 07:16:38PM -0600, Jane Chu wrote:
>>>>>>>     static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>>>>>>>                  void *addr, size_t bytes, struct iov_iter *i, int mode)
>>>>>>>     {
>>>>>>> +     phys_addr_t pmem_off;
>>>>>>> +     size_t len, lead_off;
>>>>>>> +     struct pmem_device *pmem = dax_get_private(dax_dev);
>>>>>>> +     struct device *dev = pmem->bb.dev;
>>>>>>> +
>>>>>>> +     if (unlikely(mode == DAX_OP_RECOVERY)) {
>>>>>>> +             lead_off = (unsigned long)addr & ~PAGE_MASK;
>>>>>>> +             len = PFN_PHYS(PFN_UP(lead_off + bytes));
>>>>>>> +             if (is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, len)) {
>>>>>>> +                     if (lead_off || !(PAGE_ALIGNED(bytes))) {
>>>>>>> +                             dev_warn(dev, "Found poison, but addr(%p) and/or bytes(%#lx) not page aligned\n",
>>>>>>> +                                     addr, bytes);
>>>>>>> +                             return (size_t) -EIO;
>>>>>>> +                     }
>>>>>>> +                     pmem_off = PFN_PHYS(pgoff) + pmem->data_offset;
>>>>>>> +                     if (pmem_clear_poison(pmem, pmem_off, bytes) !=
>>>>>>> +                                             BLK_STS_OK)
>>>>>>> +                             return (size_t) -EIO;
>>>>>>> +             }
>>>>>>> +     }
>>>>>>
>>>>>> This is in the wrong spot.  As seen in my WIP series individual drivers
>>>>>> really should not hook into copying to and from the iter, because it
>>>>>> really is just one way to write to a nvdimm.  How would dm-writecache
>>>>>> clear the errors with this scheme?
>>>>>>
>>>>>> So IMHO going back to the separate recovery method as in your previous
>>>>>> patch really is the way to go.  If/when the 64-bit store happens we
>>>>>> need to figure out a good way to clear the bad block list for that.
>>>>>
>>>>> I think we just make error management a first class citizen of a
>>>>> dax-device and stop abstracting it behind a driver callback. That way
>>>>> the driver that registers the dax-device can optionally register error
>>>>> management as well. Then fsdax path can do:
>>>>>
>>>>>            rc = dax_direct_access(..., &kaddr, ...);
>>>>>            if (unlikely(rc)) {
>>>>>                    kaddr = dax_mk_recovery(kaddr);
>>>>
>>>> Sorry, what does dax_mk_recovery(kaddr) do?
>>>
>>> I was thinking this just does the hackery to set a flag bit in the
>>> pointer, something like:
>>>
>>> return (void *) ((unsigned long) kaddr | DAX_RECOVERY)
>>
>> Okay, how about call it dax_prep_recovery()?
>>
>>>
>>>>
>>>>>                    dax_direct_access(..., &kaddr, ...);
>>>>>                    return dax_recovery_{read,write}(..., kaddr, ...);
>>>>>            }
>>>>>            return copy_{mc_to_iter,from_iter_flushcache}(...);
>>>>>
>>>>> Where, the recovery version of dax_direct_access() has the opportunity
>>>>> to change the page permissions / use an alias mapping for the access,
>>>>
>>>> again, sorry, what 'page permissions'?  memory_failure_dev_pagemap()
>>>> changes the poisoned page mem_type from 'rw' to 'uc-' (should be NP?),
>>>> do you mean to reverse the change?
>>>
>>> Right, the result of the conversation with Boris is that
>>> memory_failure() should mark the page as NP in call cases, so
>>> dax_direct_access() needs to create a UC mapping and
>>> dax_recover_{read,write}() would sink that operation and either return
>>> the page to NP after the access completes, or convert it to WB if the
>>> operation cleared the error.
>>
>> Okay,  will add a patch to fix set_mce_nospec().
>>
>> How about moving set_memory_uc() and set_memory_np() down to
>> dax_recovery_read(), so that we don't split the set_memory_X calls
>> over different APIs, because we can't enforce what follows
>> dax_direct_access()?
>>
>>>
>>>>> dax_recovery_read() allows reading the good cachelines out of a
>>>>> poisoned page, and dax_recovery_write() coordinates error list
>>>>> management and returning a poison page to full write-back caching
>>>>> operation when no more poisoned cacheline are detected in the page.
>>>>>
>>>>
>>>> How about to introduce 3 dax_recover_ APIs:
>>>>      dax_recover_direct_access(): similar to dax_direct_access except
>>>>         it ignores error list and return the kaddr, and hence is also
>>>>         optional, exported by device driver that has the ability to
>>>>         detect error;
>>>>      dax_recovery_read(): optional, supported by pmem driver only,
>>>>         reads as much data as possible up to the poisoned page;
>>>
>>> It wouldn't be a property of the pmem driver, I expect it would be a
>>> flag on the dax device whether to attempt recovery or not. I.e. get
>>> away from this being a pmem callback and make this a native capability
>>> of a dax device.
>>>
>>>>      dax_recovery_write(): optional, supported by pmem driver only,
>>>>         first clear-poison, then write.
>>>>
>>>> Should we worry about the dm targets?
>>>
>>> The dm targets after Christoph's conversion should be able to do all
>>> the translation at direct access time and then dax_recovery_X can be
>>> done on the resulting already translated kaddr.
>>
>> I'm thinking about the mixed device dm where some provides
>> dax_recovery_X, others don't, in which case we don't allow
>> dax recovery because that causes confusion? or should we still
>> allow recovery for part of the mixed devices?
> 
> I really don't like the all or nothing approach if it can be avoided.
> I would imagine that if recovery possible it best to support it even
> if the DM device happens to span a mix of devices with varying support
> for recovery.

Got it!

thanks!
-jane

> 
> Thanks,
> Mike
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


  reply	other threads:[~2021-11-12 18:00 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-06  1:16 [PATCH v2 0/2] Dax poison recovery Jane Chu
2021-11-06  1:16 ` [dm-devel] " Jane Chu
2021-11-06  1:16 ` [PATCH v2 1/2] dax: Introduce normal and recovery dax operation modes Jane Chu
2021-11-06  1:16   ` [dm-devel] " Jane Chu
2021-11-06  1:50   ` Darrick J. Wong
2021-11-06  1:50     ` [dm-devel] " Darrick J. Wong
2021-11-08 20:43     ` Jane Chu
2021-11-08 20:43       ` [dm-devel] " Jane Chu
2021-11-06 16:48   ` Dan Williams
2021-11-06 16:48     ` [dm-devel] " Dan Williams
2021-11-08 21:02     ` Jane Chu
2021-11-08 21:02       ` [dm-devel] " Jane Chu
2021-11-09  5:26       ` Ira Weiny
2021-11-09  5:26         ` [dm-devel] " Ira Weiny
2021-11-09  6:04         ` Dan Williams
2021-11-09  6:04           ` [dm-devel] " Dan Williams
2021-11-06  1:16 ` [PATCH v2 2/2] dax,pmem: Implement pmem based dax data recovery Jane Chu
2021-11-06  1:16   ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu
2021-11-06  2:04   ` [PATCH v2 2/2] dax,pmem: " Darrick J. Wong
2021-11-06  2:04     ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Darrick J. Wong
2021-11-08 20:53     ` [PATCH v2 2/2] dax,pmem: " Jane Chu
2021-11-08 20:53       ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu
2021-11-08 21:00     ` [PATCH v2 2/2] dax,pmem: " Jane Chu
2021-11-08 21:00       ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu
2021-11-09  7:27   ` [PATCH v2 2/2] dax,pmem: " Christoph Hellwig
2021-11-09  7:27     ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Christoph Hellwig
2021-11-09 18:48     ` [PATCH v2 2/2] dax,pmem: " Dan Williams
2021-11-09 18:48       ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Dan Williams
2021-11-09 19:52       ` [PATCH v2 2/2] dax,pmem: " Christoph Hellwig
2021-11-09 19:52         ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Christoph Hellwig
2021-11-09 19:58       ` [PATCH v2 2/2] dax,pmem: " Jane Chu
2021-11-09 19:58         ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu
2021-11-09 21:02         ` [PATCH v2 2/2] dax,pmem: " Dan Williams
2021-11-09 21:02           ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Dan Williams
2021-11-10 18:26           ` [PATCH v2 2/2] dax,pmem: " Jane Chu
2021-11-10 18:26             ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu
2021-11-12 15:36             ` [PATCH v2 2/2] dax,pmem: " Mike Snitzer
2021-11-12 15:36               ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Mike Snitzer
2021-11-12 18:00               ` Jane Chu [this message]
2021-11-12 18:00                 ` Jane Chu
2021-11-09 19:14     ` [PATCH v2 2/2] dax,pmem: " Jane Chu
2021-11-09 19:14       ` [dm-devel] [PATCH v2 2/2] dax, pmem: " Jane Chu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ca628b6-d5b6-f16a-480d-ea34dfc53aef@oracle.com \
    --to=jane.chu@oracle.com \
    --cc=agk@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=dm-devel@redhat.com \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=snitzer@redhat.com \
    --cc=vgoyal@redhat.com \
    --cc=vishal.l.verma@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.