nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] fix pmem RAM device when nid is NUMA_NO_NODE
@ 2021-07-28  8:22 Jia He
  2021-07-28  8:22 ` [PATCH] device-dax: use fallback nid when numa_node is invalid Jia He
  0 siblings, 1 reply; 7+ messages in thread
From: Jia He @ 2021-07-28  8:22 UTC (permalink / raw)
  To: Dan Williams, Vishal Verma, Dave Jiang; +Cc: nvdimm, linux-kernel, nd, Jia He

Background
==========
I once sent out the preparatory patches [1] but I dropped the last patch
of using fallback nid with memory_add_physaddr_to_nid() due to some
dependency.

After phys_addr_to_target_node() and memory_add_physaddr_to_nid() are
stable now, it's time to fix the original bug on arm64 now.

Compared with the last version [2], this version rebases the patch to
latest v5.14-rc3 (s/kmem_start/range.start)

Test
====
Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The 
memblocks from the dax pmem device can be either hot-added or hot-removed
on arm64 guest. Also passed the compilation test on x86.

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2228771.html
[2] https://lkml.org/lkml/2020/7/8/1546

Jia He (1):
  device-dax: use fallback nid when numa_node is invalid

 drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] device-dax: use fallback nid when numa_node is invalid
  2021-07-28  8:22 [PATCH 0/1] fix pmem RAM device when nid is NUMA_NO_NODE Jia He
@ 2021-07-28  8:22 ` Jia He
  2021-07-28 20:17   ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Jia He @ 2021-07-28  8:22 UTC (permalink / raw)
  To: Dan Williams, Vishal Verma, Dave Jiang; +Cc: nvdimm, linux-kernel, nd, Jia He

Previously, numa_off was set unconditionally in dummy_numa_init()
even with a fake numa node. Then ACPI set node id as NUMA_NO_NODE(-1)
after acpi_map_pxm_to_node() because it regards numa_off as turning
off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
arm64 with fake numa.

Without this patch, pmem can't be probed as a RAM device on arm64 if
SRAT table isn't present:
  $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
  kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1
  kmem: probe of dax0.0 failed with error -22

This fixes it by using fallback memory_add_physaddr_to_nid() as nid.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Jia He <justin.he@arm.com>
---
 drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index ac231cc36359..749674909e51 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -46,20 +46,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	struct dax_kmem_data *data;
 	int rc = -ENOMEM;
 	int i, mapped = 0;
-	int numa_node;
-
-	/*
-	 * Ensure good NUMA information for the persistent memory.
-	 * Without this check, there is a risk that slow memory
-	 * could be mixed in a node with faster memory, causing
-	 * unavoidable performance issues.
-	 */
-	numa_node = dev_dax->target_node;
-	if (numa_node < 0) {
-		dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
-				numa_node);
-		return -EINVAL;
-	}
+	int numa_node = dev_dax->target_node, new_node;
 
 	data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
 	if (!data)
@@ -104,6 +91,20 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		 */
 		res->flags = IORESOURCE_SYSTEM_RAM;
 
+		/*
+		 * Ensure good NUMA information for the persistent memory.
+		 * Without this check, there is a risk but not fatal that slow
+		 * memory could be mixed in a node with faster memory, causing
+		 * unavoidable performance issues. Furthermore, fallback node
+		 * id can be used when numa_node is invalid.
+		 */
+		if (numa_node < 0) {
+			new_node = memory_add_physaddr_to_nid(range.start);
+			dev_info(dev, "changing nid from %d to %d for DAX region %pR\n",
+				numa_node, new_node, res);
+			numa_node = new_node;
+		}
+
 		/*
 		 * Ensure that future kexec'd kernels will not treat
 		 * this as RAM automatically.
@@ -141,6 +142,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 	int i, success = 0;
 	struct device *dev = &dev_dax->dev;
 	struct dax_kmem_data *data = dev_get_drvdata(dev);
+	int numa_node = dev_dax->target_node;
 
 	/*
 	 * We have one shot for removing memory, if some memory blocks were not
@@ -156,8 +158,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 		if (rc)
 			continue;
 
-		rc = remove_memory(dev_dax->target_node, range.start,
-				range_len(&range));
+		if (numa_node < 0)
+			numa_node = memory_add_physaddr_to_nid(range.start);
+
+		rc = remove_memory(numa_node, range.start, range_len(&range));
 		if (rc == 0) {
 			release_resource(data->res[i]);
 			kfree(data->res[i]);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] device-dax: use fallback nid when numa_node is invalid
  2021-07-28  8:22 ` [PATCH] device-dax: use fallback nid when numa_node is invalid Jia He
@ 2021-07-28 20:17   ` David Hildenbrand
  2021-07-29  0:20     ` Justin He
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2021-07-28 20:17 UTC (permalink / raw)
  To: Jia He, Dan Williams, Vishal Verma, Dave Jiang; +Cc: nvdimm, linux-kernel, nd

On 28.07.21 10:22, Jia He wrote:
> Previously, numa_off was set unconditionally in dummy_numa_init()
> even with a fake numa node. Then ACPI set node id as NUMA_NO_NODE(-1)
> after acpi_map_pxm_to_node() because it regards numa_off as turning
> off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> arm64 with fake numa.
> 
> Without this patch, pmem can't be probed as a RAM device on arm64 if
> SRAT table isn't present:
>    $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
>    kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1
>    kmem: probe of dax0.0 failed with error -22
> 
> This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Jia He <justin.he@arm.com>
> ---
>   drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
>   1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index ac231cc36359..749674909e51 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -46,20 +46,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   	struct dax_kmem_data *data;
>   	int rc = -ENOMEM;
>   	int i, mapped = 0;
> -	int numa_node;
> -
> -	/*
> -	 * Ensure good NUMA information for the persistent memory.
> -	 * Without this check, there is a risk that slow memory
> -	 * could be mixed in a node with faster memory, causing
> -	 * unavoidable performance issues.
> -	 */
> -	numa_node = dev_dax->target_node;
> -	if (numa_node < 0) {
> -		dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
> -				numa_node);
> -		return -EINVAL;
> -	}
> +	int numa_node = dev_dax->target_node, new_node;
>   
>   	data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
>   	if (!data)
> @@ -104,6 +91,20 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   		 */
>   		res->flags = IORESOURCE_SYSTEM_RAM;
>   
> +		/*
> +		 * Ensure good NUMA information for the persistent memory.
> +		 * Without this check, there is a risk but not fatal that slow
> +		 * memory could be mixed in a node with faster memory, causing
> +		 * unavoidable performance issues. Furthermore, fallback node
> +		 * id can be used when numa_node is invalid.
> +		 */
> +		if (numa_node < 0) {
> +			new_node = memory_add_physaddr_to_nid(range.start);
> +			dev_info(dev, "changing nid from %d to %d for DAX region %pR\n",
> +				numa_node, new_node, res);
> +			numa_node = new_node;
> +		}
> +
>   		/*
>   		 * Ensure that future kexec'd kernels will not treat
>   		 * this as RAM automatically.
> @@ -141,6 +142,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>   	int i, success = 0;
>   	struct device *dev = &dev_dax->dev;
>   	struct dax_kmem_data *data = dev_get_drvdata(dev);
> +	int numa_node = dev_dax->target_node;
>   
>   	/*
>   	 * We have one shot for removing memory, if some memory blocks were not
> @@ -156,8 +158,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>   		if (rc)
>   			continue;
>   
> -		rc = remove_memory(dev_dax->target_node, range.start,
> -				range_len(&range));
> +		if (numa_node < 0)
> +			numa_node = memory_add_physaddr_to_nid(range.start);
> +
> +		rc = remove_memory(numa_node, range.start, range_len(&range));
>   		if (rc == 0) {
>   			release_resource(data->res[i]);
>   			kfree(data->res[i]);
> 

Note that this patch conflicts with:

https://lkml.kernel.org/r/20210723125210.29987-7-david@redhat.com

But nothing fundamental. Determining a single NID is similar to how I'm 
handling it for ACPI:

https://lkml.kernel.org/r/20210723125210.29987-6-david@redhat.com

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] device-dax: use fallback nid when numa_node is invalid
  2021-07-28 20:17   ` David Hildenbrand
@ 2021-07-29  0:20     ` Justin He
  2021-07-29  7:59       ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Justin He @ 2021-07-29  0:20 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams, Vishal Verma, Dave Jiang
  Cc: nvdimm, linux-kernel, nd

Hi David

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Thursday, July 29, 2021 4:17 AM
> To: Justin He <Justin.He@arm.com>; Dan Williams <dan.j.williams@intel.com>;
> Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>
> Cc: nvdimm@lists.linux.dev; linux-kernel@vger.kernel.org; nd <nd@arm.com>
> Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is invalid
> 
> On 28.07.21 10:22, Jia He wrote:
> > Previously, numa_off was set unconditionally in dummy_numa_init()
> > even with a fake numa node. Then ACPI set node id as NUMA_NO_NODE(-1)
> > after acpi_map_pxm_to_node() because it regards numa_off as turning
> > off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> > arm64 with fake numa.
> >
> > Without this patch, pmem can't be probed as a RAM device on arm64 if
> > SRAT table isn't present:
> >    $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g
> -a 64K
> >    kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with
> invalid node: -1
> >    kmem: probe of dax0.0 failed with error -22
> >
> > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Jia He <justin.he@arm.com>
> > ---
> >   drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
> >   1 file changed, 20 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index ac231cc36359..749674909e51 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -46,20 +46,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >   	struct dax_kmem_data *data;
> >   	int rc = -ENOMEM;
> >   	int i, mapped = 0;
> > -	int numa_node;
> > -
> > -	/*
> > -	 * Ensure good NUMA information for the persistent memory.
> > -	 * Without this check, there is a risk that slow memory
> > -	 * could be mixed in a node with faster memory, causing
> > -	 * unavoidable performance issues.
> > -	 */
> > -	numa_node = dev_dax->target_node;
> > -	if (numa_node < 0) {
> > -		dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
> > -				numa_node);
> > -		return -EINVAL;
> > -	}
> > +	int numa_node = dev_dax->target_node, new_node;
> >
> >   	data = kzalloc(struct_size(data, res, dev_dax->nr_range),
> GFP_KERNEL);
> >   	if (!data)
> > @@ -104,6 +91,20 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >   		 */
> >   		res->flags = IORESOURCE_SYSTEM_RAM;
> >
> > +		/*
> > +		 * Ensure good NUMA information for the persistent memory.
> > +		 * Without this check, there is a risk but not fatal that slow
> > +		 * memory could be mixed in a node with faster memory, causing
> > +		 * unavoidable performance issues. Furthermore, fallback node
> > +		 * id can be used when numa_node is invalid.
> > +		 */
> > +		if (numa_node < 0) {
> > +			new_node = memory_add_physaddr_to_nid(range.start);
> > +			dev_info(dev, "changing nid from %d to %d for DAX
> region %pR\n",
> > +				numa_node, new_node, res);
> > +			numa_node = new_node;
> > +		}
> > +
> >   		/*
> >   		 * Ensure that future kexec'd kernels will not treat
> >   		 * this as RAM automatically.
> > @@ -141,6 +142,7 @@ static void dev_dax_kmem_remove(struct dev_dax
> *dev_dax)
> >   	int i, success = 0;
> >   	struct device *dev = &dev_dax->dev;
> >   	struct dax_kmem_data *data = dev_get_drvdata(dev);
> > +	int numa_node = dev_dax->target_node;
> >
> >   	/*
> >   	 * We have one shot for removing memory, if some memory blocks were
> not
> > @@ -156,8 +158,10 @@ static void dev_dax_kmem_remove(struct dev_dax
> *dev_dax)
> >   		if (rc)
> >   			continue;
> >
> > -		rc = remove_memory(dev_dax->target_node, range.start,
> > -				range_len(&range));
> > +		if (numa_node < 0)
> > +			numa_node = memory_add_physaddr_to_nid(range.start);
> > +
> > +		rc = remove_memory(numa_node, range.start, range_len(&range));
> >   		if (rc == 0) {
> >   			release_resource(data->res[i]);
> >   			kfree(data->res[i]);
> >
> 
> Note that this patch conflicts with:
> 
> https://lkml.kernel.org/r/20210723125210.29987-7-david@redhat.com
> 
> But nothing fundamental. Determining a single NID is similar to how I'm
> handling it for ACPI:
> 
> https://lkml.kernel.org/r/20210723125210.29987-6-david@redhat.com
> 

Okay, got it. Thanks for the reminder.
Seems my patch is not useful after your patch.


--
Cheers,
Justin (Jia He)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] device-dax: use fallback nid when numa_node is invalid
  2021-07-29  0:20     ` Justin He
@ 2021-07-29  7:59       ` David Hildenbrand
  2021-07-29 14:44         ` Justin He
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2021-07-29  7:59 UTC (permalink / raw)
  To: Justin He, Dan Williams, Vishal Verma, Dave Jiang
  Cc: nvdimm, linux-kernel, nd

Hi Justin,

>>>
>>
>> Note that this patch conflicts with:
>>
>> https://lkml.kernel.org/r/20210723125210.29987-7-david@redhat.com
>>
>> But nothing fundamental. Determining a single NID is similar to how I'm
>> handling it for ACPI:
>>
>> https://lkml.kernel.org/r/20210723125210.29987-6-david@redhat.com
>>
> 
> Okay, got it. Thanks for the reminder.
> Seems my patch is not useful after your patch.
> 

I think your patch still makes sense. With

https://lore.kernel.org/linux-acpi/20210723125210.29987-7-david@redhat.com/

We'd have to detect the node id in the first loop instead.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] device-dax: use fallback nid when numa_node is invalid
  2021-07-29  7:59       ` David Hildenbrand
@ 2021-07-29 14:44         ` Justin He
  2021-08-02 15:47           ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Justin He @ 2021-07-29 14:44 UTC (permalink / raw)
  To: David Hildenbrand, Dan Williams, Vishal Verma, Dave Jiang
  Cc: nvdimm, linux-kernel, nd

Hi David

> -----Original Message-----
> From: David Hildenbrand <david@redhat.com>
> Sent: Thursday, July 29, 2021 3:59 PM
> To: Justin He <Justin.He@arm.com>; Dan Williams <dan.j.williams@intel.com>;
> Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>
> Cc: nvdimm@lists.linux.dev; linux-kernel@vger.kernel.org; nd <nd@arm.com>
> Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is
> invalid
> 
> Hi Justin,
> 
> >>>
> >>
> >> Note that this patch conflicts with:
> >>
> >> https://lkml.kernel.org/r/20210723125210.29987-7-david@redhat.com
> >>
> >> But nothing fundamental. Determining a single NID is similar to how I'm
> >> handling it for ACPI:
> >>
> >> https://lkml.kernel.org/r/20210723125210.29987-6-david@redhat.com
> >>
> >
> > Okay, got it. Thanks for the reminder.
> > Seems my patch is not useful after your patch.
> >
> 
> I think your patch still makes sense. With
> 
> https://lore.kernel.org/linux-acpi/20210723125210.29987-7-
> david@redhat.com/
> 
> We'd have to detect the node id in the first loop instead.

Ok, I got your point. I will do that in v2.

Btw, sorry for commenting there about your patch 06 since I didn't
subscribe lkml via this mailbox.

+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range range;
+
+		rc = dax_kmem_range(dev_dax, i, &range);
+		if (rc) {
+			dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
+					i, range.start, range.end);
+			continue;
+		}
+		total_len += range_len(&range);
+	}
You add an additional loop to get the total_len.
I wonder is it independent on 2nd loop?
If yes, why not merge the 2 loops into one?
Sorry if this question is too simple, I don't know too much
about the background of your patch.


--
Cheers,
Justin (Jia He)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] device-dax: use fallback nid when numa_node is invalid
  2021-07-29 14:44         ` Justin He
@ 2021-08-02 15:47           ` David Hildenbrand
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2021-08-02 15:47 UTC (permalink / raw)
  To: Justin He, Dan Williams, Vishal Verma, Dave Jiang
  Cc: nvdimm, linux-kernel, nd

On 29.07.21 16:44, Justin He wrote:
> Hi David
> 
>> -----Original Message-----
>> From: David Hildenbrand <david@redhat.com>
>> Sent: Thursday, July 29, 2021 3:59 PM
>> To: Justin He <Justin.He@arm.com>; Dan Williams <dan.j.williams@intel.com>;
>> Vishal Verma <vishal.l.verma@intel.com>; Dave Jiang <dave.jiang@intel.com>
>> Cc: nvdimm@lists.linux.dev; linux-kernel@vger.kernel.org; nd <nd@arm.com>
>> Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is
>> invalid
>>
>> Hi Justin,
>>
>>>>>
>>>>
>>>> Note that this patch conflicts with:
>>>>
>>>> https://lkml.kernel.org/r/20210723125210.29987-7-david@redhat.com
>>>>
>>>> But nothing fundamental. Determining a single NID is similar to how I'm
>>>> handling it for ACPI:
>>>>
>>>> https://lkml.kernel.org/r/20210723125210.29987-6-david@redhat.com
>>>>
>>>
>>> Okay, got it. Thanks for the reminder.
>>> Seems my patch is not useful after your patch.
>>>
>>
>> I think your patch still makes sense. With
>>
>> https://lore.kernel.org/linux-acpi/20210723125210.29987-7-
>> david@redhat.com/
>>
>> We'd have to detect the node id in the first loop instead.
> 
> Ok, I got your point. I will do that in v2.
> 
> Btw, sorry for commenting there about your patch 06 since I didn't
> subscribe lkml via this mailbox.

Sure, you really should subscribe :)

> > +	for (i = 0; i < dev_dax->nr_range; i++) {
> +		struct range range;
> +
> +		rc = dax_kmem_range(dev_dax, i, &range);
> +		if (rc) {
> +			dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
> +					i, range.start, range.end);
> +			continue;
> +		}
> +		total_len += range_len(&range);
> +	}
> You add an additional loop to get the total_len.
> I wonder is it independent on 2nd loop?
> If yes, why not merge the 2 loops into one?
> Sorry if this question is too simple, I don't know too much
> about the background of your patch.

We need total_len to register the memory group. We need the memory group 
to add memory. Therefore, we need a second loop to calculate total_len 
upfront.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-02 15:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-28  8:22 [PATCH 0/1] fix pmem RAM device when nid is NUMA_NO_NODE Jia He
2021-07-28  8:22 ` [PATCH] device-dax: use fallback nid when numa_node is invalid Jia He
2021-07-28 20:17   ` David Hildenbrand
2021-07-29  0:20     ` Justin He
2021-07-29  7:59       ` David Hildenbrand
2021-07-29 14:44         ` Justin He
2021-08-02 15:47           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).