All of lore.kernel.org
 help / color / mirror / Atom feed
* DAX numa_attribute vs SubNUMA clusters
@ 2019-04-04 19:47 Brice Goglin
  2019-04-05  3:29 ` Elliott, Robert (Servers)
  2019-04-08  4:26 ` Dan Williams
  0 siblings, 2 replies; 10+ messages in thread
From: Brice Goglin @ 2019-04-04 19:47 UTC (permalink / raw)
  To: linux-nvdimm

Hello

I am trying to understand the locality of the DAX devices with
respect to processors with SubNUMA clustering enabled. The machine
I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the SNCs of both
processors, #4-5 are prox domains for each socket set of NVDIMMs.

SLIT says the topology looks like this, which seems OK to me:

  Package 0 ---------- Package 1
  NVregion0            NVregion1
   |     |              |     |
SNC 0   SNC 1        SNC 2   SNC 3
node0   node1        node2   node3

However each DAX "numa_node" attribute contains a single node ID,
which leads to this topology instead:

  Package 0 ---------- Package 1
   |     |              |     |
SNC 0   SNC 1        SNC 2   SNC 3
node0   node1        node2   node3
   |                   |
dax0.0               dax1.0

It looks like this is caused by acpi_map_pxm_to_online_node()
only returning the first closest node found in the SLIT.
However, even if we change it to return multiple local nodes,
the DAX "numa_node" attribute cannot expose multiple nodes.
Should we rather expose Keith HMAT attributes for DAX devices?
Maybe there's even a way to share them between DAX devices
and Dave's KMEM hotplugged NUMA nodes?

By the way, I am not sure if my above configuration is what
we should expect on SNC-enabled production machines.
Is the NFIT table supposed to expose one SPA Range per SNC,
or one per socket? Should it depend with the SNC config in
the BIOS?

If we had one SPA range per SNC, would it still be possible
to interleave NVDIMMs of both SNC to create a single region
for each socket?

If I don't interleave NVDIMMs, I get the same result even if
some regions should be only local to node1 (or node3). Maybe
because they are still in the same SPA range, and thus still
get the entire range locality?

Brice


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: DAX numa_attribute vs SubNUMA clusters
  2019-04-04 19:47 DAX numa_attribute vs SubNUMA clusters Brice Goglin
@ 2019-04-05  3:29 ` Elliott, Robert (Servers)
  2019-04-08  4:26 ` Dan Williams
  1 sibling, 0 replies; 10+ messages in thread
From: Elliott, Robert (Servers) @ 2019-04-05  3:29 UTC (permalink / raw)
  To: Brice Goglin, linux-nvdimm



> -----Original Message-----
> From: Linux-nvdimm <linux-nvdimm-bounces@lists.01.org> On Behalf Of
> Brice Goglin
> Sent: Thursday, April 04, 2019 2:48 PM
> To: linux-nvdimm@lists.01.org
> Subject: DAX numa_attribute vs SubNUMA clusters
> 
> Hello
> 
> I am trying to understand the locality of the DAX devices with
> respect to processors with SubNUMA clustering enabled. The machine
> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are
> the SNCs of both
> processors, #4-5 are prox domains for each socket set of NVDIMMs.
> 
> SLIT says the topology looks like this, which seems OK to me:
> 
>   Package 0 ---------- Package 1
>   NVregion0            NVregion1
>    |     |              |     |
> SNC 0   SNC 1        SNC 2   SNC 3
> node0   node1        node2   node3
> 
> However each DAX "numa_node" attribute contains a single node ID,
> which leads to this topology instead:
> 
>   Package 0 ---------- Package 1
>    |     |              |     |
> SNC 0   SNC 1        SNC 2   SNC 3
> node0   node1        node2   node3
>    |                   |
> dax0.0               dax1.0
> 
> It looks like this is caused by acpi_map_pxm_to_online_node()
> only returning the first closest node found in the SLIT.
> However, even if we change it to return multiple local nodes,
> the DAX "numa_node" attribute cannot expose multiple nodes.
> Should we rather expose Keith HMAT attributes for DAX devices?
> Maybe there's even a way to share them between DAX devices
> and Dave's KMEM hotplugged NUMA nodes?
> 
> By the way, I am not sure if my above configuration is what
> we should expect on SNC-enabled production machines.
> Is the NFIT table supposed to expose one SPA Range per SNC,
> or one per socket? Should it depend with the SNC config in
> the BIOS?
> 
> If we had one SPA range per SNC, would it still be possible
> to interleave NVDIMMs of both SNC to create a single region
> for each socket?

There is one SPA range for each interleave set. All of these 
are possible (but maybe not supported in a particular system):
* interleave across channels within the SNC
* interleave across SNCs within the package
* interleave across packages

The latter is sometimes called "node interleaving"; with SNC,
there is another level of nodes inside the package, and interleaving
across those could also be called "node interleaving."

In that case, the memory is not really part of one node; it's
equidistant from both nodes.  So, it would most accurately
be described as a separate node with a "distance" of half that
between the nodes. However, the kernel has struggled in the past
handling CPU nodes (especially CPU 0) with no local memory, so
rounding down to node0 and node2 isn't surprising.


> If I don't interleave NVDIMMs, I get the same result even if
> some regions should be only local to node1 (or node3). Maybe
> because they are still in the same SPA range, and thus still
> get the entire range locality?

That sounds like a bug.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-04 19:47 DAX numa_attribute vs SubNUMA clusters Brice Goglin
  2019-04-05  3:29 ` Elliott, Robert (Servers)
@ 2019-04-08  4:26 ` Dan Williams
  2019-04-08  8:13   ` Brice Goglin
  1 sibling, 1 reply; 10+ messages in thread
From: Dan Williams @ 2019-04-08  4:26 UTC (permalink / raw)
  To: Brice Goglin; +Cc: linux-nvdimm

On Thu, Apr 4, 2019 at 12:48 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
> Hello
>
> I am trying to understand the locality of the DAX devices with
> respect to processors with SubNUMA clustering enabled. The machine
> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the SNCs of both
> processors, #4-5 are prox domains for each socket set of NVDIMMs.
>
> SLIT says the topology looks like this, which seems OK to me:
>
>   Package 0 ---------- Package 1
>   NVregion0            NVregion1
>    |     |              |     |
> SNC 0   SNC 1        SNC 2   SNC 3
> node0   node1        node2   node3
>
> However each DAX "numa_node" attribute contains a single node ID,
> which leads to this topology instead:
>
>   Package 0 ---------- Package 1
>    |     |              |     |
> SNC 0   SNC 1        SNC 2   SNC 3
> node0   node1        node2   node3
>    |                   |
> dax0.0               dax1.0
>
> It looks like this is caused by acpi_map_pxm_to_online_node()
> only returning the first closest node found in the SLIT.
> However, even if we change it to return multiple local nodes,
> the DAX "numa_node" attribute cannot expose multiple nodes.
> Should we rather expose Keith HMAT attributes for DAX devices?

If I understand the suggestion correctly you're referring to the
"target_node" or the unique node number that gets assigned when the
memory is transitioned online. I struggle to see the incremental
benefit relative to what we lose with compatibility of the
"traditional" numa node interpretation for a device that indicates
which cpus are close to the given device. I think the bulk of the
problem is solved with the next suggestion below.

> Maybe there's even a way to share them between DAX devices
> and Dave's KMEM hotplugged NUMA nodes?

In this instance, where the expectation is that the NVDIMM range is
equidistant from both SNC nodes on a package, I would teach numactl
tool and other tooling to return a list of local nodes rather than the
single attribute. Effectively an operation like "numactl --preferred
block:pmem0" would return a node-mask that includes nodes 0 and 1.

> By the way, I am not sure if my above configuration is what
> we should expect on SNC-enabled production machines.
> Is the NFIT table supposed to expose one SPA Range per SNC,
> or one per socket? Should it depend with the SNC config in
> the BIOS?

The NFIT is "supposed" to expose the interleave boundaries, and in
this case it seems to be saying that System RAM is interleaved
differently than the PMEM. Whether that is correct or not is for the
platform BIOS developer to validate. The OS is only equipped to trust
the SLIT.

> If we had one SPA range per SNC, would it still be possible
> to interleave NVDIMMs of both SNC to create a single region
> for each socket?

I don't follow the question, if the SPA range is split you want the
SLIT to lie and say it isn't?

> If I don't interleave NVDIMMs, I get the same result even if
> some regions should be only local to node1 (or node3). Maybe
> because they are still in the same SPA range, and thus still
> get the entire range locality?

...or the SLIT is incorrect for that config.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-08  4:26 ` Dan Williams
@ 2019-04-08  8:13   ` Brice Goglin
  2019-04-08 14:56     ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Brice Goglin @ 2019-04-08  8:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Le 08/04/2019 à 06:26, Dan Williams a écrit :
> On Thu, Apr 4, 2019 at 12:48 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>> Hello
>>
>> I am trying to understand the locality of the DAX devices with
>> respect to processors with SubNUMA clustering enabled. The machine
>> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the SNCs of both
>> processors, #4-5 are prox domains for each socket set of NVDIMMs.
>>
>> SLIT says the topology looks like this, which seems OK to me:
>>
>>   Package 0 ---------- Package 1
>>   NVregion0            NVregion1
>>    |     |              |     |
>> SNC 0   SNC 1        SNC 2   SNC 3
>> node0   node1        node2   node3
>>
>> However each DAX "numa_node" attribute contains a single node ID,
>> which leads to this topology instead:
>>
>>   Package 0 ---------- Package 1
>>    |     |              |     |
>> SNC 0   SNC 1        SNC 2   SNC 3
>> node0   node1        node2   node3
>>    |                   |
>> dax0.0               dax1.0
>>
>> It looks like this is caused by acpi_map_pxm_to_online_node()
>> only returning the first closest node found in the SLIT.
>> However, even if we change it to return multiple local nodes,
>> the DAX "numa_node" attribute cannot expose multiple nodes.
>> Should we rather expose Keith HMAT attributes for DAX devices?
> If I understand the suggestion correctly you're referring to the
> "target_node" or the unique node number that gets assigned when the
> memory is transitioned online. I struggle to see the incremental
> benefit relative to what we lose with compatibility of the
> "traditional" numa node interpretation for a device that indicates
> which cpus are close to the given device. I think the bulk of the
> problem is solved with the next suggestion below.


Hello Dan,

Not sure why you're talking about "target_node" here. That attribute is
correct:

$ cat /sys/bus/dax/devices/dax0.0/target_node
4

My issue is with "numa_node" which fails to return enough information here:

$ cat
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/dax0.0/numa_node

0

(instead of 0+1 but I don't want to change the semantics of that file,
see below)


>
>> Maybe there's even a way to share them between DAX devices
>> and Dave's KMEM hotplugged NUMA nodes?
> In this instance, where the expectation is that the NVDIMM range is
> equidistant from both SNC nodes on a package, I would teach numactl
> tool and other tooling to return a list of local nodes rather than the
> single attribute. Effectively an operation like "numactl --preferred
> block:pmem0" would return a node-mask that includes nodes 0 and 1.


Teaching these tools is exactly what I want to solve here (I was rather
talking about dax0.0 than pmem0 but it doesn't matter much). There are
usually two ways to find the locality of a device from userspace:

* Reading a "local_cpus" sysfs attribute. Works well for finding local
CPUs. Doesn't always work for finding local memory when some CPUs are
offline: if all CPUs of the local node are offline, you loose the
information about the local memory being close to your device (Intel
people from "mOS" heavily rely of this).

* Reading a "numa_node" sysfs attribute, but it points to a single node.


Keith HMAT patches are somehow a 3rd way that doesn't have any of these
issues: you just read "access0/initiators/node*":

* If you want local CPUs, you read the "cpumap" of the initiators nodes.

* If you want the list of "close" memory nodes, you have the list of
initiator "nodes", or their targets.

It would work very well for describing the topology of my machine once I
hotplug node4 and node5 using Dave's "kmem" driver: I get node0 and
node1 is node4/access0/initiators/


I know HMAT attributes don't appear in hotplugged node sysfs directories
yet, but it would also be nice to have a way to get that information for
dax devices before hotplug, since the dax device and hotplugged nodes
are the same thing.


In an crazy world, maybe we could have something like this:

* before hotplug with kmem driver, unregistered nodes appear in a
special directory such as
/sys/devices/system/node/unregistered_hmat/nodeX together with their
HMAT attributes. If I want to find the locality of a DAX device, I read
its target_node, and go to the corresponding unregistered_hmat/nodeX and
read cpumap, initiators, etc.

* at hotplug, the node is moved out of unregistered_hmat/


>> If we had one SPA range per SNC, would it still be possible
>> to interleave NVDIMMs of both SNC to create a single region
>> for each socket?
> I don't follow the question, if the SPA range is split you want the
> SLIT to lie and say it isn't?


Sorry, these questions about NFIT were not related to my specific config
but rather to understand what configs are possible. Elliott's answer and
yours clarified things, thanks.

Brice


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-08  8:13   ` Brice Goglin
@ 2019-04-08 14:56     ` Dan Williams
  2019-04-08 19:55       ` Brice Goglin
  0 siblings, 1 reply; 10+ messages in thread
From: Dan Williams @ 2019-04-08 14:56 UTC (permalink / raw)
  To: Brice Goglin; +Cc: linux-nvdimm

On Mon, Apr 8, 2019 at 1:13 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
> Le 08/04/2019 à 06:26, Dan Williams a écrit :
> > On Thu, Apr 4, 2019 at 12:48 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >> Hello
> >>
> >> I am trying to understand the locality of the DAX devices with
> >> respect to processors with SubNUMA clustering enabled. The machine
> >> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the SNCs of both
> >> processors, #4-5 are prox domains for each socket set of NVDIMMs.
> >>
> >> SLIT says the topology looks like this, which seems OK to me:
> >>
> >>   Package 0 ---------- Package 1
> >>   NVregion0            NVregion1
> >>    |     |              |     |
> >> SNC 0   SNC 1        SNC 2   SNC 3
> >> node0   node1        node2   node3
> >>
> >> However each DAX "numa_node" attribute contains a single node ID,
> >> which leads to this topology instead:
> >>
> >>   Package 0 ---------- Package 1
> >>    |     |              |     |
> >> SNC 0   SNC 1        SNC 2   SNC 3
> >> node0   node1        node2   node3
> >>    |                   |
> >> dax0.0               dax1.0
> >>
> >> It looks like this is caused by acpi_map_pxm_to_online_node()
> >> only returning the first closest node found in the SLIT.
> >> However, even if we change it to return multiple local nodes,
> >> the DAX "numa_node" attribute cannot expose multiple nodes.
> >> Should we rather expose Keith HMAT attributes for DAX devices?
> > If I understand the suggestion correctly you're referring to the
> > "target_node" or the unique node number that gets assigned when the
> > memory is transitioned online. I struggle to see the incremental
> > benefit relative to what we lose with compatibility of the
> > "traditional" numa node interpretation for a device that indicates
> > which cpus are close to the given device. I think the bulk of the
> > problem is solved with the next suggestion below.
>
>
> Hello Dan,
>
> Not sure why you're talking about "target_node" here. That attribute is
> correct:
>
> $ cat /sys/bus/dax/devices/dax0.0/target_node
> 4
>
> My issue is with "numa_node" which fails to return enough information here:
>
> $ cat
> /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/dax0.0/numa_node
>
> 0
>
> (instead of 0+1 but I don't want to change the semantics of that file,
> see below)
>
>
> >
> >> Maybe there's even a way to share them between DAX devices
> >> and Dave's KMEM hotplugged NUMA nodes?
> > In this instance, where the expectation is that the NVDIMM range is
> > equidistant from both SNC nodes on a package, I would teach numactl
> > tool and other tooling to return a list of local nodes rather than the
> > single attribute. Effectively an operation like "numactl --preferred
> > block:pmem0" would return a node-mask that includes nodes 0 and 1.
>
>
> Teaching these tools is exactly what I want to solve here (I was rather
> talking about dax0.0 than pmem0 but it doesn't matter much). There are
> usually two ways to find the locality of a device from userspace:
>
> * Reading a "local_cpus" sysfs attribute. Works well for finding local
> CPUs. Doesn't always work for finding local memory when some CPUs are
> offline: if all CPUs of the local node are offline, you loose the
> information about the local memory being close to your device (Intel
> people from "mOS" heavily rely of this).
>
> * Reading a "numa_node" sysfs attribute, but it points to a single node.
>
>
> Keith HMAT patches are somehow a 3rd way that doesn't have any of these
> issues: you just read "access0/initiators/node*":
>
> * If you want local CPUs, you read the "cpumap" of the initiators nodes.
>
> * If you want the list of "close" memory nodes, you have the list of
> initiator "nodes", or their targets.
>
> It would work very well for describing the topology of my machine once I
> hotplug node4 and node5 using Dave's "kmem" driver: I get node0 and
> node1 is node4/access0/initiators/

Yes, I agree with all of the above, but I think we need a way to fix
this independent of the HMAT data being present. The SLIT already
tells the kernel enough to let tooling figure out equidistant "local"
nodes. While the numa_node attribute will remain a singleton the
tooling needs to handle this case and can't assume the HMAT data will
be present.

> I know HMAT attributes don't appear in hotplugged node sysfs directories
> yet, but it would also be nice to have a way to get that information for
> dax devices before hotplug, since the dax device and hotplugged nodes
> are the same thing.
>
>
> In an crazy world, maybe we could have something like this:
>
> * before hotplug with kmem driver, unregistered nodes appear in a
> special directory such as
> /sys/devices/system/node/unregistered_hmat/nodeX together with their
> HMAT attributes. If I want to find the locality of a DAX device, I read
> its target_node, and go to the corresponding unregistered_hmat/nodeX and
> read cpumap, initiators, etc.
>
> * at hotplug, the node is moved out of unregistered_hmat/

Some sort of offline target_node data makes sense, but seems secondary
to teaching tools to supplement the 'numa_node' attribute.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-08 14:56     ` Dan Williams
@ 2019-04-08 19:55       ` Brice Goglin
  2019-04-16 15:31         ` Brice Goglin
  0 siblings, 1 reply; 10+ messages in thread
From: Brice Goglin @ 2019-04-08 19:55 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm

Le 08/04/2019 à 16:56, Dan Williams a écrit :
> Yes, I agree with all of the above, but I think we need a way to fix
> this independent of the HMAT data being present. The SLIT already
> tells the kernel enough to let tooling figure out equidistant "local"
> nodes. While the numa_node attribute will remain a singleton the
> tooling needs to handle this case and can't assume the HMAT data will
> be present.

So you want to export the part of SLIT that is currently hidden to
userspace because the corresponding nodes aren't registered?

With the patch below, I get 17 17 28 28 in dax0.0/node_distance which
means it's close to node0 and node1.

The code is pretty much a duplicate of read_node_distance() in
drivers/base/node.c. Not sure it's worth factorizing such small functions?

The name "node_distance" (instead of "distance" for NUMA nodes) is also
subject to discussion.

Brice


commit 6488b6a5c942a972b97c2f28d566d89b9917ef1d
Author: Brice Goglin <Brice.Goglin@inria.fr>
Date:   Mon Apr 8 21:44:30 2019 +0200

    device-dax: Add a 'node_distance' attribute
    
    This attribute is identical to the 'distance' attribute
    inside NUMA node directories. It lists the distance
    from the DAX device to each online NUMA node.
    
    Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2109cfe80219..35a80c852e0d 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -295,6 +295,28 @@ static ssize_t target_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(target_node);
 
+static ssize_t node_distance_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	int nid = dev_dax_target_node(dev_dax);
+	int len = 0;
+	int i;
+
+	/*
+	 * buf is currently PAGE_SIZE in length and each node needs 4 chars
+	 * at the most (distance + space or newline).
+	 */
+	BUILD_BUG_ON(MAX_NUMNODES * 4 > PAGE_SIZE);
+
+	for_each_online_node(i)
+		len += sprintf(buf + len, "%s%d", i ? " " : "", node_distance(nid, i));
+
+	len += sprintf(buf + len, "\n");
+	return len;
+}
+static DEVICE_ATTR(node_distance, S_IRUGO, node_distance_show, NULL);
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
@@ -320,6 +342,7 @@ static struct attribute *dev_dax_attributes[] = {
 	&dev_attr_modalias.attr,
 	&dev_attr_size.attr,
 	&dev_attr_target_node.attr,
+	&dev_attr_node_distance.attr,
 	NULL,
 };
 



_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-08 19:55       ` Brice Goglin
@ 2019-04-16 15:31         ` Brice Goglin
  2019-04-17 21:35           ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Brice Goglin @ 2019-04-16 15:31 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm


Le 08/04/2019 à 21:55, Brice Goglin a écrit :
> Le 08/04/2019 à 16:56, Dan Williams a écrit :
>> Yes, I agree with all of the above, but I think we need a way to fix
>> this independent of the HMAT data being present. The SLIT already
>> tells the kernel enough to let tooling figure out equidistant "local"
>> nodes. While the numa_node attribute will remain a singleton the
>> tooling needs to handle this case and can't assume the HMAT data will
>> be present.
> So you want to export the part of SLIT that is currently hidden to
> userspace because the corresponding nodes aren't registered?
>
> With the patch below, I get 17 17 28 28 in dax0.0/node_distance which
> means it's close to node0 and node1.
>
> The code is pretty much a duplicate of read_node_distance() in
> drivers/base/node.c. Not sure it's worth factorizing such small functions?
>
> The name "node_distance" (instead of "distance" for NUMA nodes) is also
> subject to discussion.

Here's a better patch that exports the existing routine for showing
node distances, and reuses it in dax/bus.c and nvdimm/pfn_devs.c:

# cat /sys/class/block/pmem1/device/node_distance 
28 28 17 17
# cat /sys/bus/dax/devices/dax0.0/node_distance 
17 17 28 28

By the way, it also handles the case where the nd_region has no
valid target_node (idea stolen from kmem.c).

Are there other places where it'd be useful to export that attribute?

Ideally we could just export it in the region sysfs directory,
but I can't find backlinks going from daxX.Y or pmemZ to that
region directory :/

Signed-off-by: Brice Goglin <brice.goglin@inria.fr>

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 8598fcb..5c6bce1 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -509,10 +509,9 @@ static ssize_t node_read_vmstat(struct device *dev,
 }
 static DEVICE_ATTR(vmstat, S_IRUGO, node_read_vmstat, NULL);
 
-static ssize_t node_read_distance(struct device *dev,
+ssize_t generic_node_distance_show(int nid,
 			struct device_attribute *attr, char *buf)
 {
-	int nid = dev->id;
 	int len = 0;
 	int i;
 
@@ -528,6 +527,13 @@ static ssize_t node_read_distance(struct device *dev,
 	len += sprintf(buf + len, "\n");
 	return len;
 }
+EXPORT_SYMBOL_GPL(generic_node_distance_show);
+
+static ssize_t node_read_distance(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	return generic_node_distance_show(dev->id, attr, buf);
+}
 static DEVICE_ATTR(distance, S_IRUGO, node_read_distance, NULL);
 
 static struct attribute *node_dev_attrs[] = {
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2109cfe..37d750e 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -6,6 +6,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/dax.h>
+#include <linux/node.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -295,6 +296,18 @@ static ssize_t target_node_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(target_node);
 
+static ssize_t node_distance_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	int nid = dev_dax_target_node(dev_dax);
+
+	if (nid < 0)
+		return 0;
+	return generic_node_distance_show(nid, attr, buf);
+}
+static DEVICE_ATTR(node_distance, S_IRUGO, node_distance_show, NULL);
+
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
 		char *buf)
 {
@@ -320,6 +333,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
 	&dev_attr_modalias.attr,
 	&dev_attr_size.attr,
 	&dev_attr_target_node.attr,
+	&dev_attr_node_distance.attr,
 	NULL,
 };
 
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index d271bd73..5a0f55c 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -18,6 +18,7 @@
 #include <linux/slab.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/node.h>
 #include "nd-core.h"
 #include "pfn.h"
 #include "nd.h"
@@ -271,6 +272,18 @@ static ssize_t supported_alignments_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(supported_alignments);
 
+static ssize_t node_distance_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	int nid = nd_region->target_node;
+
+	if (nid < 0)
+		return 0;
+	return generic_node_distance_show(nid, attr, buf);
+}
+static DEVICE_ATTR(node_distance, S_IRUGO, node_distance_show, NULL);
+
 static struct attribute *nd_pfn_attributes[] = {
 	&dev_attr_mode.attr,
 	&dev_attr_namespace.attr,
@@ -279,6 +292,7 @@ static ssize_t supported_alignments_show(struct device *dev,
 	&dev_attr_resource.attr,
 	&dev_attr_size.attr,
 	&dev_attr_supported_alignments.attr,
+	&dev_attr_node_distance.attr,
 	NULL,
 };
 
diff --git a/include/linux/node.h b/include/linux/node.h
index 1a557c5..949e7ed 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -150,6 +150,11 @@ extern int register_memory_node_under_compute_node(unsigned int mem_nid,
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
 					 node_registration_func_t unregister);
 #endif
+
+extern ssize_t generic_node_distance_show(int nid,
+					  struct device_attribute *attr,
+					  char *buf);
+
 #else
 static inline int __register_one_node(int nid)
 {
@@ -186,6 +191,13 @@ static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 						node_registration_func_t unreg)
 {
 }
+
+static inline ssize_t generic_node_distance_show(int nid,
+						 struct device_attribute *attr,
+						 char *buf)
+{
+	return 0;
+}
 #endif
 
 #define to_node(device) container_of(device, struct node, dev)


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-16 15:31         ` Brice Goglin
@ 2019-04-17 21:35           ` Dan Williams
  2019-04-17 21:46             ` Brice Goglin
  0 siblings, 1 reply; 10+ messages in thread
From: Dan Williams @ 2019-04-17 21:35 UTC (permalink / raw)
  To: Brice Goglin; +Cc: linux-nvdimm

On Tue, Apr 16, 2019 at 8:31 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 08/04/2019 à 21:55, Brice Goglin a écrit :
>
> Le 08/04/2019 à 16:56, Dan Williams a écrit :
>
> Yes, I agree with all of the above, but I think we need a way to fix
> this independent of the HMAT data being present. The SLIT already
> tells the kernel enough to let tooling figure out equidistant "local"
> nodes. While the numa_node attribute will remain a singleton the
> tooling needs to handle this case and can't assume the HMAT data will
> be present.
>
> So you want to export the part of SLIT that is currently hidden to
> userspace because the corresponding nodes aren't registered?
>
> With the patch below, I get 17 17 28 28 in dax0.0/node_distance which
> means it's close to node0 and node1.
>
> The code is pretty much a duplicate of read_node_distance() in
> drivers/base/node.c. Not sure it's worth factorizing such small functions?
>
> The name "node_distance" (instead of "distance" for NUMA nodes) is also
> subject to discussion.
>
> Here's a better patch that exports the existing routine for showing
> node distances, and reuses it in dax/bus.c and nvdimm/pfn_devs.c:
>
> # cat /sys/class/block/pmem1/device/node_distance
> 28 28 17 17
> # cat /sys/bus/dax/devices/dax0.0/node_distance
> 17 17 28 28
>
> By the way, it also handles the case where the nd_region has no
> valid target_node (idea stolen from kmem.c).
>
> Are there other places where it'd be useful to export that attribute?
>
> Ideally we could just export it in the region sysfs directory,
> but I can't find backlinks going from daxX.Y or pmemZ to that
> region directory :/

I understand where you're trying to go, but this is too dax-device
specific. What about a storage-controller in the topology that is
equidistant from multiple cpu nodes. I'd rather solve this from the
tooling perspective to lookup cpu nodes that are equidistant to the
device's "numa_node".

I'd rather not teach the kernel to export this extra node_distance
information in favor of teaching numactl to consider equidistant cpu
nodes in its default node masks.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-17 21:35           ` Dan Williams
@ 2019-04-17 21:46             ` Brice Goglin
  2019-04-18  0:13               ` Dan Williams
  0 siblings, 1 reply; 10+ messages in thread
From: Brice Goglin @ 2019-04-17 21:46 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-nvdimm


Le 17/04/2019 à 23:35, Dan Williams a écrit :
> On Tue, Apr 16, 2019 at 8:31 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>>
>> Le 08/04/2019 à 21:55, Brice Goglin a écrit :
>>
>> Le 08/04/2019 à 16:56, Dan Williams a écrit :
>>
>> Yes, I agree with all of the above, but I think we need a way to fix
>> this independent of the HMAT data being present. The SLIT already
>> tells the kernel enough to let tooling figure out equidistant "local"
>> nodes. While the numa_node attribute will remain a singleton the
>> tooling needs to handle this case and can't assume the HMAT data will
>> be present.
>>
>> So you want to export the part of SLIT that is currently hidden to
>> userspace because the corresponding nodes aren't registered?
>>
>> With the patch below, I get 17 17 28 28 in dax0.0/node_distance which
>> means it's close to node0 and node1.
>>
>> The code is pretty much a duplicate of read_node_distance() in
>> drivers/base/node.c. Not sure it's worth factorizing such small functions?
>>
>> The name "node_distance" (instead of "distance" for NUMA nodes) is also
>> subject to discussion.
>>
>> Here's a better patch that exports the existing routine for showing
>> node distances, and reuses it in dax/bus.c and nvdimm/pfn_devs.c:
>>
>> # cat /sys/class/block/pmem1/device/node_distance
>> 28 28 17 17
>> # cat /sys/bus/dax/devices/dax0.0/node_distance
>> 17 17 28 28
>>
>> By the way, it also handles the case where the nd_region has no
>> valid target_node (idea stolen from kmem.c).
>>
>> Are there other places where it'd be useful to export that attribute?
>>
>> Ideally we could just export it in the region sysfs directory,
>> but I can't find backlinks going from daxX.Y or pmemZ to that
>> region directory :/
> I understand where you're trying to go, but this is too dax-device
> specific. What about a storage-controller in the topology that is
> equidistant from multiple cpu nodes. I'd rather solve this from the
> tooling perspective to lookup cpu nodes that are equidistant to the
> device's "numa_node".


I don't see how you're going to lookup those equidistant nodes. In the
above case, pmem1 numa_node is 2. Where do you want tools to find the
information that pmem1 is actually close to node2 AND node3?

That information is hidden in SLIT node5<->node2 and node5<->node3 but
these are not exposed to userspace tools since node5 isn't registered.

Brice


_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: DAX numa_attribute vs SubNUMA clusters
  2019-04-17 21:46             ` Brice Goglin
@ 2019-04-18  0:13               ` Dan Williams
  0 siblings, 0 replies; 10+ messages in thread
From: Dan Williams @ 2019-04-18  0:13 UTC (permalink / raw)
  To: Brice Goglin; +Cc: Dave Hansen, linux-nvdimm

[ add Keith and Dave for their thoughts ]

On Wed, Apr 17, 2019 at 2:46 PM Brice Goglin <Brice.Goglin@inria.fr> wrote:
>
>
> Le 17/04/2019 à 23:35, Dan Williams a écrit :
> > On Tue, Apr 16, 2019 at 8:31 AM Brice Goglin <Brice.Goglin@inria.fr> wrote:
> >>
> >> Le 08/04/2019 à 21:55, Brice Goglin a écrit :
> >>
> >> Le 08/04/2019 à 16:56, Dan Williams a écrit :
> >>
> >> Yes, I agree with all of the above, but I think we need a way to fix
> >> this independent of the HMAT data being present. The SLIT already
> >> tells the kernel enough to let tooling figure out equidistant "local"
> >> nodes. While the numa_node attribute will remain a singleton the
> >> tooling needs to handle this case and can't assume the HMAT data will
> >> be present.
> >>
> >> So you want to export the part of SLIT that is currently hidden to
> >> userspace because the corresponding nodes aren't registered?
> >>
> >> With the patch below, I get 17 17 28 28 in dax0.0/node_distance which
> >> means it's close to node0 and node1.
> >>
> >> The code is pretty much a duplicate of read_node_distance() in
> >> drivers/base/node.c. Not sure it's worth factorizing such small functions?
> >>
> >> The name "node_distance" (instead of "distance" for NUMA nodes) is also
> >> subject to discussion.
> >>
> >> Here's a better patch that exports the existing routine for showing
> >> node distances, and reuses it in dax/bus.c and nvdimm/pfn_devs.c:
> >>
> >> # cat /sys/class/block/pmem1/device/node_distance
> >> 28 28 17 17
> >> # cat /sys/bus/dax/devices/dax0.0/node_distance
> >> 17 17 28 28
> >>
> >> By the way, it also handles the case where the nd_region has no
> >> valid target_node (idea stolen from kmem.c).
> >>
> >> Are there other places where it'd be useful to export that attribute?
> >>
> >> Ideally we could just export it in the region sysfs directory,
> >> but I can't find backlinks going from daxX.Y or pmemZ to that
> >> region directory :/
> > I understand where you're trying to go, but this is too dax-device
> > specific. What about a storage-controller in the topology that is
> > equidistant from multiple cpu nodes. I'd rather solve this from the
> > tooling perspective to lookup cpu nodes that are equidistant to the
> > device's "numa_node".
>
>
> I don't see how you're going to lookup those equidistant nodes. In the
> above case, pmem1 numa_node is 2. Where do you want tools to find the
> information that pmem1 is actually close to node2 AND node3?

Yeah, I was indeed confusing proximity-domain and numa-node in my
thought process of what information userspace tools have readily
available, but I think a generic solution is still salvageable.

> That information is hidden in SLIT node5<->node2 and node5<->node3 but
> these are not exposed to userspace tools since node5 isn't registered.

I think the root problem is that the kernel allocates numa-nodes in
arch-specific code at the beginning of time and the proximity-domain
information is not readily available with the expectation that the
Linux numa node is sufficient.

Your node_distance attribute proposal solves this, but I find SLIT
data to be a bit magical and poorly specified, especially across
architectures.

What about just exporting the proximity domain information via an
opaque firmware-implementation-specific 'node_handle' attribute? Then
the node_handle can be used to consult questions like what numa-nodes
is this handle local to beyond what the 'numa_node' attribute
indicates, what is the effective target-node for this node-handle, and
allow for interrogating the next level of detail beyond what
CONFIG_HMEM_REPORTING allows.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-04-18  0:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-04 19:47 DAX numa_attribute vs SubNUMA clusters Brice Goglin
2019-04-05  3:29 ` Elliott, Robert (Servers)
2019-04-08  4:26 ` Dan Williams
2019-04-08  8:13   ` Brice Goglin
2019-04-08 14:56     ` Dan Williams
2019-04-08 19:55       ` Brice Goglin
2019-04-16 15:31         ` Brice Goglin
2019-04-17 21:35           ` Dan Williams
2019-04-17 21:46             ` Brice Goglin
2019-04-18  0:13               ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.