linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
@ 2020-04-14 23:58 Vishal Verma
  2020-04-15  7:39 ` David Hildenbrand
  2020-04-15 10:43 ` Michal Hocko
  0 siblings, 2 replies; 9+ messages in thread
From: Vishal Verma @ 2020-04-14 23:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-nvdimm, Vishal Verma, David Hildenbrand, Dan Williams, Dave Hansen

A misbehaving qemu created a situation where the ACPI SRAT table
advertised one fewer proximity domains than intended. The NFIT table did
describe all the expected proximity domains. This caused the device dax
driver to assign an impossible target_node to the device, and when
hotplugged as system memory, this would fail with the following
signature:

  [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
  [  +0.001331] #PF: supervisor read access in kernel mode
  [  +0.000975] #PF: error_code(0x0000) - not-present page
  [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
  [  +0.001338] Oops: 0000 [#1] SMP PTI
  [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
  [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
  [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
                      00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
		      80 00 00 00 fe f0 80 a3 38 20
  [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
  [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
  [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
  [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
  [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
  [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
  [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
  [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
  [  +0.001095] Call Trace:
  [  +0.000388]  kswapd+0x103/0x520
  [  +0.000494]  ? finish_wait+0x80/0x80
  [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
  [  +0.000607]  kthread+0x120/0x140
  [  +0.000508]  ? kthread_create_on_node+0x60/0x60
  [  +0.000706]  ret_from_fork+0x3a/0x50

Add a check in the add_memory path to ensure that the node to which we
are adding memory is in the node_possible_map

Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 mm/memory_hotplug.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

v2:
- Centralize the check in the add_memory path (David)
- Instead of failing, add the memory to a nearby node, while warning
  (and tainting) to call out attention to the firmware bug (Dan)

v3:
- Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..536a809d6ebb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -980,6 +980,30 @@ static int check_hotplug_memory_range(u64 start, u64 size)
 	return 0;
 }
 
+/*
+ * Check that the node provided for adding memory was valid.
+ * If not, find the nearest valid node and add the memory there while
+ * tainting the kernel and displaying a warning to bring attention to the
+ * underlying firmware problem.
+ * Return nid if valid, or an adjusted node number that can be used instead
+ * if the original nid was not valid
+ */
+static int check_hotplug_node(int nid)
+{
+	int alt_nid;
+
+	if (node_possible(nid))
+		return nid;
+
+	alt_nid = numa_map_to_online_node(nid);
+	if (alt_nid == NUMA_NO_NODE)
+		alt_nid = first_online_node;
+	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
+		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
+		   nid, alt_nid);
+	return alt_nid;
+}
+
 static int online_memory_block(struct memory_block *mem, void *arg)
 {
 	return device_online(&mem->dev);
@@ -1005,6 +1029,10 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	if (ret)
 		return ret;
 
+	nid = check_hotplug_node(nid);
+	if (nid < 0)
+		return -ENXIO;
+
 	mem_hotplug_begin();
 
 	/*
-- 
2.21.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-14 23:58 [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node Vishal Verma
@ 2020-04-15  7:39 ` David Hildenbrand
  2020-04-15  7:44   ` David Hildenbrand
  2020-04-15 10:43 ` Michal Hocko
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2020-04-15  7:39 UTC (permalink / raw)
  To: Vishal Verma, linux-mm; +Cc: linux-nvdimm, Dan Williams, Dave Hansen

On 15.04.20 01:58, Vishal Verma wrote:
> A misbehaving qemu created a situation where the ACPI SRAT table
> advertised one fewer proximity domains than intended. The NFIT table did
> describe all the expected proximity domains. This caused the device dax
> driver to assign an impossible target_node to the device, and when
> hotplugged as system memory, this would fail with the following
> signature:
> 
>   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
>   [  +0.001331] #PF: supervisor read access in kernel mode
>   [  +0.000975] #PF: error_code(0x0000) - not-present page
>   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
>   [  +0.001338] Oops: 0000 [#1] SMP PTI
>   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
>   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
>   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
>                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
> 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
> 		      80 00 00 00 fe f0 80 a3 38 20
>   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
>   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
>   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
>   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
>   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
>   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
>   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
>   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
>   [  +0.001095] Call Trace:
>   [  +0.000388]  kswapd+0x103/0x520
>   [  +0.000494]  ? finish_wait+0x80/0x80
>   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
>   [  +0.000607]  kthread+0x120/0x140
>   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
>   [  +0.000706]  ret_from_fork+0x3a/0x50
> 
> Add a check in the add_memory path to ensure that the node to which we
> are adding memory is in the node_possible_map
> 
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  mm/memory_hotplug.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> v2:
> - Centralize the check in the add_memory path (David)
> - Instead of failing, add the memory to a nearby node, while warning
>   (and tainting) to call out attention to the firmware bug (Dan)
> 
> v3:
> - Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..536a809d6ebb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -980,6 +980,30 @@ static int check_hotplug_memory_range(u64 start, u64 size)
>  	return 0;
>  }
>  
> +/*
> + * Check that the node provided for adding memory was valid.
> + * If not, find the nearest valid node and add the memory there while
> + * tainting the kernel and displaying a warning to bring attention to the
> + * underlying firmware problem.
> + * Return nid if valid, or an adjusted node number that can be used instead
> + * if the original nid was not valid
> + */

-ETOOMUCHDOCUMENTAION

"If the given node cannot be used (!node_possible()), return the nearest
possible node and WARN_TAINT() about firmware issues."

> +static int check_hotplug_node(int nid)
> +{
> +	int alt_nid;
> +
> +	if (node_possible(nid))
> +		return nid;
> +
> +	alt_nid = numa_map_to_online_node(nid);
> +	if (alt_nid == NUMA_NO_NODE)
> +		alt_nid = first_online_node;
> +	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> +		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
> +		   nid, alt_nid);
> +	return alt_nid;
> +}
> +
>  static int online_memory_block(struct memory_block *mem, void *arg)
>  {
>  	return device_online(&mem->dev);
> @@ -1005,6 +1029,10 @@ int __ref add_memory_resource(int nid, struct resource *res)
>  	if (ret)
>  		return ret;
>  
> +	nid = check_hotplug_node(nid);
> +	if (nid < 0)
> +		return -ENXIO;
> +
>  	mem_hotplug_begin();
>  
>  	/*
> 

You should do the same on the memory removal path.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-15  7:39 ` David Hildenbrand
@ 2020-04-15  7:44   ` David Hildenbrand
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2020-04-15  7:44 UTC (permalink / raw)
  To: Vishal Verma, linux-mm; +Cc: linux-nvdimm, Dan Williams, Dave Hansen

On 15.04.20 09:39, David Hildenbrand wrote:
> On 15.04.20 01:58, Vishal Verma wrote:
>> A misbehaving qemu created a situation where the ACPI SRAT table
>> advertised one fewer proximity domains than intended. The NFIT table did
>> describe all the expected proximity domains. This caused the device dax
>> driver to assign an impossible target_node to the device, and when
>> hotplugged as system memory, this would fail with the following
>> signature:
>>
>>   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
>>   [  +0.001331] #PF: supervisor read access in kernel mode
>>   [  +0.000975] #PF: error_code(0x0000) - not-present page
>>   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
>>   [  +0.001338] Oops: 0000 [#1] SMP PTI
>>   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
>>   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
>>   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
>>                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
>> 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
>> 		      80 00 00 00 fe f0 80 a3 38 20
>>   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
>>   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
>>   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
>>   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
>>   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
>>   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
>>   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
>>   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
>>   [  +0.001095] Call Trace:
>>   [  +0.000388]  kswapd+0x103/0x520
>>   [  +0.000494]  ? finish_wait+0x80/0x80
>>   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
>>   [  +0.000607]  kthread+0x120/0x140
>>   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
>>   [  +0.000706]  ret_from_fork+0x3a/0x50
>>
>> Add a check in the add_memory path to ensure that the node to which we
>> are adding memory is in the node_possible_map
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>  mm/memory_hotplug.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> v2:
>> - Centralize the check in the add_memory path (David)
>> - Instead of failing, add the memory to a nearby node, while warning
>>   (and tainting) to call out attention to the firmware bug (Dan)
>>
>> v3:
>> - Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 0a54ffac8c68..536a809d6ebb 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -980,6 +980,30 @@ static int check_hotplug_memory_range(u64 start, u64 size)
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Check that the node provided for adding memory was valid.
>> + * If not, find the nearest valid node and add the memory there while
>> + * tainting the kernel and displaying a warning to bring attention to the
>> + * underlying firmware problem.
>> + * Return nid if valid, or an adjusted node number that can be used instead
>> + * if the original nid was not valid
>> + */
> 
> -ETOOMUCHDOCUMENTAION
> 
> "If the given node cannot be used (!node_possible()), return the nearest
> possible node and WARN_TAINT() about firmware issues."
> 
>> +static int check_hotplug_node(int nid)
>> +{
>> +	int alt_nid;
>> +
>> +	if (node_possible(nid))
>> +		return nid;
>> +
>> +	alt_nid = numa_map_to_online_node(nid);
>> +	if (alt_nid == NUMA_NO_NODE)
>> +		alt_nid = first_online_node;
>> +	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
>> +		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
>> +		   nid, alt_nid);
>> +	return alt_nid;
>> +}
>> +
>>  static int online_memory_block(struct memory_block *mem, void *arg)
>>  {
>>  	return device_online(&mem->dev);
>> @@ -1005,6 +1029,10 @@ int __ref add_memory_resource(int nid, struct resource *res)
>>  	if (ret)
>>  		return ret;
>>  
>> +	nid = check_hotplug_node(nid);
>> +	if (nid < 0)
>> +		return -ENXIO;
>> +
>>  	mem_hotplug_begin();
>>  
>>  	/*
>>
> 
> You should do the same on the memory removal path.

(I do wonder if the result could be different
(numa_map_to_online_node()/first_online_node)) on the removal path, and
if we should bail out when removing instead. Sounds better to me, adding
memory is more important in this case.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-14 23:58 [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node Vishal Verma
  2020-04-15  7:39 ` David Hildenbrand
@ 2020-04-15 10:43 ` Michal Hocko
  2020-04-15 20:32   ` Verma, Vishal L
  1 sibling, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2020-04-15 10:43 UTC (permalink / raw)
  To: Vishal Verma
  Cc: linux-mm, linux-nvdimm, David Hildenbrand, Dan Williams, Dave Hansen

On Tue 14-04-20 17:58:12, Vishal Verma wrote:
[...]
> +static int check_hotplug_node(int nid)
> +{
> +	int alt_nid;
> +
> +	if (node_possible(nid))
> +		return nid;
> +
> +	alt_nid = numa_map_to_online_node(nid);
> +	if (alt_nid == NUMA_NO_NODE)
> +		alt_nid = first_online_node;
> +	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> +		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
> +		   nid, alt_nid);

I really do not like this. Why should we try to be clever and change the
node id requested by the caller? I would just stick with node_possible
check and be done with this.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-15 10:43 ` Michal Hocko
@ 2020-04-15 20:32   ` Verma, Vishal L
  2020-04-16  6:19     ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Verma, Vishal L @ 2020-04-15 20:32 UTC (permalink / raw)
  To: mhocko; +Cc: Williams, Dan J, linux-mm, linux-nvdimm, david, dave.hansen

On Wed, 2020-04-15 at 12:43 +0200, Michal Hocko wrote:
> On Tue 14-04-20 17:58:12, Vishal Verma wrote:
> [...]
> > +static int check_hotplug_node(int nid)
> > +{
> > +	int alt_nid;
> > +
> > +	if (node_possible(nid))
> > +		return nid;
> > +
> > +	alt_nid = numa_map_to_online_node(nid);
> > +	if (alt_nid == NUMA_NO_NODE)
> > +		alt_nid = first_online_node;
> > +	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> > +		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
> > +		   nid, alt_nid);
> 
> I really do not like this. Why should we try to be clever and change the
> node id requested by the caller? I would just stick with node_possible
> check and be done with this.

Hi Michal,

Being clever allows us to still use the memory even if it is in a non-
optimal configuration. Failing here leaves the user no path to add this
memory until the firmware is fixed. It is the tradeoff between some
usability vs. how loud we want to be for the failure.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-15 20:32   ` Verma, Vishal L
@ 2020-04-16  6:19     ` Michal Hocko
  2020-04-16 16:13       ` Verma, Vishal L
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2020-04-16  6:19 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: Williams, Dan J, linux-mm, linux-nvdimm, david, dave.hansen

On Wed 15-04-20 20:32:00, Verma, Vishal L wrote:
> On Wed, 2020-04-15 at 12:43 +0200, Michal Hocko wrote:
> > On Tue 14-04-20 17:58:12, Vishal Verma wrote:
> > [...]
> > > +static int check_hotplug_node(int nid)
> > > +{
> > > +	int alt_nid;
> > > +
> > > +	if (node_possible(nid))
> > > +		return nid;
> > > +
> > > +	alt_nid = numa_map_to_online_node(nid);
> > > +	if (alt_nid == NUMA_NO_NODE)
> > > +		alt_nid = first_online_node;
> > > +	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
> > > +		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
> > > +		   nid, alt_nid);
> > 
> > I really do not like this. Why should we try to be clever and change the
> > node id requested by the caller? I would just stick with node_possible
> > check and be done with this.
> 
> Hi Michal,
> 
> Being clever allows us to still use the memory even if it is in a non-
> optimal configuration. Failing here leaves the user no path to add this
> memory until the firmware is fixed. It is the tradeoff between some
> usability vs. how loud we want to be for the failure.

Doing that papers over something that is clearly a FW issue and makes
it "my performance is suboptimal" deal with it OS problem.  Really, is
this something we have to care about. Your changelog talks about a Qemu
misconfiguration which is trivial to fix. Has this ever been observed
with a real HW?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-16  6:19     ` Michal Hocko
@ 2020-04-16 16:13       ` Verma, Vishal L
  2020-04-16 16:16         ` David Hildenbrand
  0 siblings, 1 reply; 9+ messages in thread
From: Verma, Vishal L @ 2020-04-16 16:13 UTC (permalink / raw)
  To: mhocko; +Cc: Williams, Dan J, linux-mm, linux-nvdimm, david, dave.hansen

On Thu, 2020-04-16 at 08:19 +0200, Michal Hocko wrote:
> On Wed 15-04-20 20:32:00, Verma, Vishal L wrote:
> > > 
> > > I really do not like this. Why should we try to be clever and change the
> > > node id requested by the caller? I would just stick with node_possible
> > > check and be done with this.
> > 
> > Hi Michal,
> > 
> > Being clever allows us to still use the memory even if it is in a non-
> > optimal configuration. Failing here leaves the user no path to add this
> > memory until the firmware is fixed. It is the tradeoff between some
> > usability vs. how loud we want to be for the failure.
> 
> Doing that papers over something that is clearly a FW issue and makes
> it "my performance is suboptimal" deal with it OS problem.  Really, is
> this something we have to care about. Your changelog talks about a Qemu
> misconfiguration which is trivial to fix. Has this ever been observed
> with a real HW?
> 
Well - more of a qemu bug I think - I can share the details, but it just
looked like it was producing a bogus SRAT. I think it is plausible that
such a firmware bug can happen out in the wild. The NFIT tables would
just need to reference a 'proximity domain' that the SRAT hasn't
previously described, and hotplug will happily go add memory from the
NFIT and the backing node related data structures would be missing.

I'm not too opposed to erroring out, so long as we are ok with the fact
that we will leave some memory stranded until there's a firmware fix.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-16 16:13       ` Verma, Vishal L
@ 2020-04-16 16:16         ` David Hildenbrand
  2020-04-16 16:18           ` Verma, Vishal L
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2020-04-16 16:16 UTC (permalink / raw)
  To: Verma, Vishal L, mhocko
  Cc: Williams, Dan J, linux-mm, linux-nvdimm, dave.hansen

On 16.04.20 18:13, Verma, Vishal L wrote:
> On Thu, 2020-04-16 at 08:19 +0200, Michal Hocko wrote:
>> On Wed 15-04-20 20:32:00, Verma, Vishal L wrote:
>>>>
>>>> I really do not like this. Why should we try to be clever and change the
>>>> node id requested by the caller? I would just stick with node_possible
>>>> check and be done with this.
>>>
>>> Hi Michal,
>>>
>>> Being clever allows us to still use the memory even if it is in a non-
>>> optimal configuration. Failing here leaves the user no path to add this
>>> memory until the firmware is fixed. It is the tradeoff between some
>>> usability vs. how loud we want to be for the failure.
>>
>> Doing that papers over something that is clearly a FW issue and makes
>> it "my performance is suboptimal" deal with it OS problem.  Really, is
>> this something we have to care about. Your changelog talks about a Qemu
>> misconfiguration which is trivial to fix. Has this ever been observed
>> with a real HW?
>>
> Well - more of a qemu bug I think - I can share the details, but it just
> looked like it was producing a bogus SRAT. I think it is plausible that
> such a firmware bug can happen out in the wild. The NFIT tables would
> just need to reference a 'proximity domain' that the SRAT hasn't
> previously described, and hotplug will happily go add memory from the
> NFIT and the backing node related data structures would be missing.
> 
> I'm not too opposed to erroring out, so long as we are ok with the fact
> that we will leave some memory stranded until there's a firmware fix.

So let's reject it and print a warning, so we know it's a thing. If this
actually shows up often in real live, we have good evidence that we
should tolerate buggy firmwares instead of warning/rejecting.

(rejecting from inside add_memory() still makes sense IMHO)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-16 16:16         ` David Hildenbrand
@ 2020-04-16 16:18           ` Verma, Vishal L
  0 siblings, 0 replies; 9+ messages in thread
From: Verma, Vishal L @ 2020-04-16 16:18 UTC (permalink / raw)
  To: david, mhocko; +Cc: Williams, Dan J, linux-mm, linux-nvdimm, dave.hansen

On Thu, 2020-04-16 at 18:16 +0200, David Hildenbrand wrote:
> > > 
> > > Doing that papers over something that is clearly a FW issue and makes
> > > it "my performance is suboptimal" deal with it OS problem.  Really, is
> > > this something we have to care about. Your changelog talks about a Qemu
> > > misconfiguration which is trivial to fix. Has this ever been observed
> > > with a real HW?
> > > 
> > Well - more of a qemu bug I think - I can share the details, but it just
> > looked like it was producing a bogus SRAT. I think it is plausible that
> > such a firmware bug can happen out in the wild. The NFIT tables would
> > just need to reference a 'proximity domain' that the SRAT hasn't
> > previously described, and hotplug will happily go add memory from the
> > NFIT and the backing node related data structures would be missing.
> > 
> > I'm not too opposed to erroring out, so long as we are ok with the fact
> > that we will leave some memory stranded until there's a firmware fix.
> 
> So let's reject it and print a warning, so we know it's a thing. If this
> actually shows up often in real live, we have good evidence that we
> should tolerate buggy firmwares instead of warning/rejecting.
> 
> (rejecting from inside add_memory() still makes sense IMHO)
> 
Sounds good, I'll send a v4.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-16 16:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-14 23:58 [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node Vishal Verma
2020-04-15  7:39 ` David Hildenbrand
2020-04-15  7:44   ` David Hildenbrand
2020-04-15 10:43 ` Michal Hocko
2020-04-15 20:32   ` Verma, Vishal L
2020-04-16  6:19     ` Michal Hocko
2020-04-16 16:13       ` Verma, Vishal L
2020-04-16 16:16         ` David Hildenbrand
2020-04-16 16:18           ` Verma, Vishal L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).