nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Justin He <Justin.He@arm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>,
	Dave Jiang <dave.jiang@intel.com>,
	 David Hildenbrand <david@redhat.com>,
	Linux NVDIMM <nvdimm@lists.linux.dev>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	nd <nd@arm.com>
Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is invalid
Date: Tue, 14 Sep 2021 22:15:54 -0700	[thread overview]
Message-ID: <CAPcyv4gyCHTcXUSLcsgnX8o0JUfpSNf8B=7zbfcZcvWFCUSCvw@mail.gmail.com> (raw)
In-Reply-To: <AM6PR08MB4376FC35158104629C603197F7DA9@AM6PR08MB4376.eurprd08.prod.outlook.com>

On Mon, Sep 13, 2021 at 7:06 PM Justin He <Justin.He@arm.com> wrote:
>
> Hi Dan,
>
> > -----Original Message-----
> > From: Dan Williams <dan.j.williams@intel.com>
> > Sent: Friday, September 10, 2021 11:42 PM
> > To: Justin He <Justin.He@arm.com>
> > Cc: Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang
> > <dave.jiang@intel.com>; David Hildenbrand <david@redhat.com>; Linux NVDIMM
> > <nvdimm@lists.linux.dev>; Linux Kernel Mailing List <linux-
> > kernel@vger.kernel.org>
> > Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is
> > invalid
> >
> > On Fri, Sep 10, 2021 at 5:46 AM Jia He <justin.he@arm.com> wrote:
> > >
> > > Previously, numa_off was set unconditionally in dummy_numa_init()
> > > even with a fake numa node. Then ACPI sets node id as NUMA_NO_NODE(-1)
> > > after acpi_map_pxm_to_node() because it regards numa_off as turning
> > > off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> > > arm64 with fake numa case.
> > >
> > > Without this patch, pmem can't be probed as RAM devices on arm64 if
> > > SRAT table isn't present:
> > >   $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g
> > -a 64K
> > >   kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with
> > invalid node: -1
> > >   kmem: probe of dax0.0 failed with error -22
> > >
> > > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> > >
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > Signed-off-by: Jia He <justin.he@arm.com>
> > > ---
> > > v2: - rebase it based on David's "memory group" patch.
> > >     - drop the changes in dev_dax_kmem_remove() since nid had been
> > >       removed in remove_memory().
> > >  drivers/dax/kmem.c | 31 +++++++++++++++++--------------
> > >  1 file changed, 17 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > index a37622060fff..e4836eb7539e 100644
> > > --- a/drivers/dax/kmem.c
> > > +++ b/drivers/dax/kmem.c
> > > @@ -47,20 +47,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > >         unsigned long total_len = 0;
> > >         struct dax_kmem_data *data;
> > >         int i, rc, mapped = 0;
> > > -       int numa_node;
> > > -
> > > -       /*
> > > -        * Ensure good NUMA information for the persistent memory.
> > > -        * Without this check, there is a risk that slow memory
> > > -        * could be mixed in a node with faster memory, causing
> > > -        * unavoidable performance issues.
> > > -        */
> > > -       numa_node = dev_dax->target_node;
> > > -       if (numa_node < 0) {
> > > -               dev_warn(dev, "rejecting DAX region with invalid
> > node: %d\n",
> > > -                               numa_node);
> > > -               return -EINVAL;
> > > -       }
> > > +       int numa_node = dev_dax->target_node;
> > >
> > >         for (i = 0; i < dev_dax->nr_range; i++) {
> > >                 struct range range;
> > > @@ -71,6 +58,22 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > >                                         i, range.start, range.end);
> > >                         continue;
> > >                 }
> > > +
> > > +               /*
> > > +                * Ensure good NUMA information for the persistent
> > memory.
> > > +                * Without this check, there is a risk but not fatal
> > that slow
> > > +                * memory could be mixed in a node with faster memory,
> > causing
> > > +                * unavoidable performance issues. Warn this and use
> > fallback
> > > +                * node id.
> > > +                */
> > > +               if (numa_node < 0) {
> > > +                       int new_node =
> > memory_add_physaddr_to_nid(range.start);
> > > +
> > > +                       dev_info(dev, "changing nid from %d to %d for
> > DAX region [%#llx-%#llx]\n",
> > > +                                numa_node, new_node, range.start,
> > range.end);
> > > +                       numa_node = new_node;
> > > +               }
> > > +
> > >                 total_len += range_len(&range);
> >
> > This fallback change belongs where the parent region for the namespace
> > adopts its target_node, because it's not clear
> > memory_add_physaddr_to_nid() is the right fallback in all situations.
> > Here is where this setting is happening currently:
> >
> > drivers/acpi/nfit/core.c:3004:          ndr_desc->target_node =
> > pxm_to_node(spa->proximity_domain);
> On my local arm64 guest('virt' machine type), the target_node is
> set to -1 at this line.
> That is:
> The condition "spa->flags & ACPI_NFIT_PROXIMITY_VALID" is hit.
>
> > drivers/acpi/nfit/core.c:3007:          ndr_desc->target_node =
> > NUMA_NO_NODE;
> > drivers/nvdimm/e820.c:29:       ndr_desc.target_node = nid;
> > drivers/nvdimm/of_pmem.c:58:            ndr_desc.target_node =
> > ndr_desc.numa_node;
> > drivers/nvdimm/region_devs.c:1127:      nd_region->target_node =
> > ndr_desc->target_node;
>
>
> Sorry,Dan. I thought I missed your previous mail:
>
> =========================================
> Looks like it is the NFIT driver, thanks.
>
> If you're getting NUMA_NO_NODE in dax_kmem from the NFIT driver in
> means your ACPI NFIT table is failing to populate correct numa
> information. You could try the following to fix it up, but I think the
> real problem is that your platform BIOS needs to add the proper numa
> data.
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index fb775b967c52..d3a0cec635b1 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -3005,15 +3005,8 @@ static int acpi_nfit_register_region(struct
> acpi_nfit_desc *acpi_desc,
>         ndr_desc->res = &res;
>         ndr_desc->provider_data = nfit_spa;
>         ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
> -       if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) {
> -               ndr_desc->numa_node = acpi_map_pxm_to_online_node(
> -                                               spa->proximity_domain);
> -               ndr_desc->target_node = acpi_map_pxm_to_node(
> -                               spa->proximity_domain);
> -       } else {
> -               ndr_desc->numa_node = NUMA_NO_NODE;
> -               ndr_desc->target_node = NUMA_NO_NODE;
> -       }
> +       ndr_desc->numa_node = memory_add_physaddr_to_nid(spa->address);
> +       ndr_desc->target_node = phys_to_target_node(spa->address);
>
>         /*
>          * Persistence domain bits are hierarchical, if
> ===================================================
>
> Do you still suggest fixing like this?

Are you saying that ACPI_NFIT_PROXIMITY_VALID is not set on your
platform, or that pxm_to_node() returns NUMA_NO_NODE?

I would expect something like this:

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index a3ef6cce644c..95de7dc18ed8 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -3007,6 +3007,15 @@ static int acpi_nfit_register_region(struct
acpi_nfit_desc *acpi_desc,
                ndr_desc->target_node = NUMA_NO_NODE;
        }

+       /*
+        * Fallback to address based numa information if node lookup
+        * failed
+        */
+       if (ndr_desc->numa_node == NUMA_NO_NODE)
+               ndr_desc->numa_node = memory_add_physaddr_to_nid(spa->address);
+       if (ndr_desc->target_node == NUMA_NO_NODE)
+               phys_to_target_node(spa->address);
+
        /*
         * Persistence domain bits are hierarchical, if
         * ACPI_NFIT_CAPABILITY_CACHE_FLUSH is set then

  reply	other threads:[~2021-09-15  5:16 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-10 12:46 [PATCH v2] device-dax: use fallback nid when numa node is invalid Jia He
2021-09-10 15:42 ` Dan Williams
2021-09-14  2:06   ` Justin He
2021-09-15  5:15     ` Dan Williams [this message]
2021-09-15  6:50       ` Justin He
2021-09-15 16:22         ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4gyCHTcXUSLcsgnX8o0JUfpSNf8B=7zbfcZcvWFCUSCvw@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=Justin.He@arm.com \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nd@arm.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).