linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: linux-nvdimm <linux-nvdimm@lists.01.org>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH v2] libnvdimm: Update persistence domain value for of_pmem and papr_scm device
Date: Tue, 11 Feb 2020 08:38:14 -0800	[thread overview]
Message-ID: <CAPcyv4iFP6_jkocoyv-6zd0Y8FEYFA3Pk6brH5+_XQ9+U896wQ@mail.gmail.com> (raw)
In-Reply-To: <25eabdd9-410f-e4c3-6b0e-41a5e6daba10@linux.ibm.com>

On Tue, Feb 11, 2020 at 6:57 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 2/10/20 11:48 PM, Dan Williams wrote:
> > On Mon, Feb 10, 2020 at 6:20 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >>
> >>> On Tue, Feb 4, 2020 at 9:21 PM Aneesh Kumar K.V
> >>> <aneesh.kumar@linux.ibm.com> wrote:
> >>>>
> >>>> Currently, kernel shows the below values
> >>>>          "persistence_domain":"cpu_cache"
> >>>>          "persistence_domain":"memory_controller"
> >>>>          "persistence_domain":"unknown"
> >>>>
> >>>> "cpu_cache" indicates no extra instructions is needed to ensure the persistence
> >>>> of data in the pmem media on power failure.
> >>>>
> >>>> "memory_controller" indicates platform provided instructions need to be issued
> >>>
> >>> No, it does not. The only requirement implied by "memory_controller"
> >>> is global visibility outside the cpu cache. If there are special
> >>> instructions beyond that then it isn't persistent memory, at least not
> >>> pmem that is safe for dax. virtio-pmem is an example of pmem-like
> >>> memory that is not enabled for userspace flushing (MAP_SYNC disabled).
> >>>
> >>
> >> Can you explain this more? The way I was expecting the application to
> >> interpret the value was, a regular store instruction doesn't guarantee
> >> persistence if you find the "memory_controller" value for
> >> persistence_domain. Instead, we need to make sure we flush data to the
> >> controller at which point the platform will take care of the persistence in
> >> case of power loss. How we flush data to the controller will also be
> >> defined by the platform.
> >
> > If the platform requires any flush mechanism outside of the base cpu
> > ISA of cache flushes and memory barriers then MAP_SYNC needs to be
> > explicitly disabled to force the application to call fsync()/msync().
> > Then those platform specific mechanisms need to be triggered through a
> > platform-aware driver.
> >
>
>
> Agreed. I was thinking we mark the persistence_domain: "Unknown" in that
> case. virtio-pmem mark it that way.

I would say the driver requirement case is persistence_domain "None",
not "Unknown". I.e. the platform provides no mechanism to flush data
to the persistence domain on power loss, it's back to typical storage
semantics.

>
>
> >>
> >>
> >>>> as per documented sequence to make sure data get flushed so that it is
> >>>> guaranteed to be on pmem media in case of system power loss.
> >>>>
> >>>> Based on the above use memory_controller for non volatile regions on ppc64.
> >>>>
> >>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >>>> ---
> >>>>   arch/powerpc/platforms/pseries/papr_scm.c | 7 ++++++-
> >>>>   drivers/nvdimm/of_pmem.c                  | 4 +++-
> >>>>   include/linux/libnvdimm.h                 | 1 -
> >>>>   3 files changed, 9 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> >>>> index 7525635a8536..ffcd0d7a867c 100644
> >>>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> >>>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> >>>> @@ -359,8 +359,13 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> >>>>
> >>>>          if (p->is_volatile)
> >>>>                  p->region = nvdimm_volatile_region_create(p->bus, &ndr_desc);
> >>>> -       else
> >>>> +       else {
> >>>> +               /*
> >>>> +                * We need to flush things correctly to guarantee persistance
> >>>> +                */
> >>>
> >>> There are never guarantees. If you're going to comment what does
> >>> software need to flush, and how?
> >>
> >> Can you explain why you say there are never guarantees? If you follow the platform
> >> recommended instruction sequence to flush data, we can be sure of data
> >> persistence in the pmem media.
> >
> > Because storage can always fail. You can reduce risk, but never
> > eliminate it. This is similar to SSDs that use latent capacitance to
> > flush their write caches on driver power loss. Even if the application
> > successfully flushes its writes to buffers that are protected by that
> > capacitance that power source can still (and in practice does) fail.
> >
>
> ok guarantee is not the right term there. Can we say
>
> /* We need to flush tings correctly to ensure persistence */

The definition of the "memory_controller" persistence domain is: "the
platform takes care to flush writes to media once they are globally
visible outside the cache".

>
>
> What I was trying to understand/clarify was the detail an application
> can infer looking at the value of persistence_domain ?
>
> Do you agree that below can be inferred from the "memory_controller"
> value of persistence_domain
>
> 1) Application needs to use cache flush instructions and that ensures
> data is persistent across power failure.
>
>
> Or are you suggesting that application should not infer any of those
> details looking at persistence_domain value? If so what is the purpose
> of exporting that attribute?

The way the patch was worded I thought it was referring to an explicit
mechanism outside cpu cache flushes, i.e. a mechanism that required a
driver call.

>
>
> >>
> >>
> >>>
> >>>> +               set_bit(ND_REGION_PERSIST_MEMCTRL, &ndr_desc.flags);
> >>>>                  p->region = nvdimm_pmem_region_create(p->bus, &ndr_desc);
> >>>> +       }
> >>>>          if (!p->region) {
> >>>>                  dev_err(dev, "Error registering region %pR from %pOF\n",
> >>>>                                  ndr_desc.res, p->dn);
> >>>> diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
> >>>> index 8224d1431ea9..6826a274a1f1 100644
> >>>> --- a/drivers/nvdimm/of_pmem.c
> >>>> +++ b/drivers/nvdimm/of_pmem.c
> >>>> @@ -62,8 +62,10 @@ static int of_pmem_region_probe(struct platform_device *pdev)
> >>>>
> >>>>                  if (is_volatile)
> >>>>                          region = nvdimm_volatile_region_create(bus, &ndr_desc);
> >>>> -               else
> >>>> +               else {
> >>>> +                       set_bit(ND_REGION_PERSIST_MEMCTRL, &ndr_desc.flags);
> >>>>                          region = nvdimm_pmem_region_create(bus, &ndr_desc);
> >>>> +               }
> >>>>
> >>>>                  if (!region)
> >>>>                          dev_warn(&pdev->dev, "Unable to register region %pR from %pOF\n",
> >>>> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
> >>>> index 0f366706b0aa..771d888a5ed7 100644
> >>>> --- a/include/linux/libnvdimm.h
> >>>> +++ b/include/linux/libnvdimm.h
> >>>> @@ -54,7 +54,6 @@ enum {
> >>>>          /*
> >>>>           * Platform provides mechanisms to automatically flush outstanding
> >>>>           * write data from memory controler to pmem on system power loss.
> >>>> -        * (ADR)
> >>>
> >>> I'd rather not delete critical terminology for a developer / platform
> >>> owner to be able to consult documentation, or their vendor. Can you
> >>> instead add the PowerPC equivalent term for this capability? I.e. list
> >>> (x86: ADR PowerPC: foo ...).
> >>
> >> Power ISA doesn't clearly call out what mechanism will be used to ensure
> >> that a load following power loss will return the previously flushed
> >> data. Hence there is no description of details like Asynchronous DRAM
> >> Refresh. Only details specified is with respect to flush sequence that ensures
> >> that a load following power loss will return the value stored.
> >
> > What is this "flush sequence"?
> >
>
> cpu cache flush instructions "dcbf; hwsync"

Looks good, as long as the flush mechanism is defined by the cpu ISA
then MAP_SYNC is viable.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

  reply	other threads:[~2020-02-11 16:38 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-05  5:20 [PATCH v2] libnvdimm: Update persistence domain value for of_pmem and papr_scm device Aneesh Kumar K.V
2020-02-09 16:12 ` Dan Williams
2020-02-10 14:20   ` Aneesh Kumar K.V
2020-02-10 18:18     ` Dan Williams
2020-02-11 14:55       ` Aneesh Kumar K.V
2020-02-11 16:38         ` Dan Williams [this message]
2020-03-20  9:25           ` Aneesh Kumar K.V
2020-03-24  1:21             ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4iFP6_jkocoyv-6zd0Y8FEYFA3Pk6brH5+_XQ9+U896wQ@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).