linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5] mm/memory_hotplug: refrain from adding memory into an impossible node
@ 2020-04-16 22:54 Vishal Verma
  2020-04-17  6:38 ` Michal Hocko
  0 siblings, 1 reply; 4+ messages in thread
From: Vishal Verma @ 2020-04-16 22:54 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-nvdimm, Michal Hocko, David Hildenbrand, Dave Hansen

A misbehaving qemu created a situation where the ACPI SRAT table
advertised one fewer proximity domains than intended. The NFIT table did
describe all the expected proximity domains. This caused the device dax
driver to assign an impossible target_node to the device, and when
hotplugged as system memory, this would fail with the following
signature:

  [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
  [  +0.001331] #PF: supervisor read access in kernel mode
  [  +0.000975] #PF: error_code(0x0000) - not-present page
  [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
  [  +0.001338] Oops: 0000 [#1] SMP PTI
  [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
  [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
      BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
  [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
                      00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
		      80 00 00 00 fe f0 80 a3 38 20
  [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
  [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
  [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
  [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
  [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
  [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
  [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
  [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
  [  +0.001095] Call Trace:
  [  +0.000388]  kswapd+0x103/0x520
  [  +0.000494]  ? finish_wait+0x80/0x80
  [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
  [  +0.000607]  kthread+0x120/0x140
  [  +0.000508]  ? kthread_create_on_node+0x60/0x60
  [  +0.000706]  ret_from_fork+0x3a/0x50

Add a check in the add_memory path to fail if the node to which we
are adding memory is in the node_possible_map

Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 mm/memory_hotplug.c | 5 +++++
 1 file changed, 5 insertions(+)

v2:
- Centralize the check in the add_memory path (David)
- Instead of failing, add the memory to a nearby node, while warning
  (and tainting) to call out attention to the firmware bug (Dan)

v3:
- Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)

v4:
- Error out instead of being smart about picking a node that wasn't
  asked for (Michal)

v5:
- Change the return code to -EINVAL (David)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..e07b80d149db 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1005,6 +1005,11 @@ int __ref add_memory_resource(int nid, struct resource *res)
 	if (ret)
 		return ret;
 
+	if (!node_possible(nid)) {
+		WARN(1, "node %d was absent from the node_possible_map\n", nid);
+		return -EINVAL;
+	}
+
 	mem_hotplug_begin();
 
 	/*
-- 
2.21.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v5] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-16 22:54 [PATCH v5] mm/memory_hotplug: refrain from adding memory into an impossible node Vishal Verma
@ 2020-04-17  6:38 ` Michal Hocko
  2020-04-21  0:14   ` Verma, Vishal L
  0 siblings, 1 reply; 4+ messages in thread
From: Michal Hocko @ 2020-04-17  6:38 UTC (permalink / raw)
  To: Vishal Verma; +Cc: linux-mm, linux-nvdimm, David Hildenbrand, Dave Hansen

On Thu 16-04-20 16:54:38, Vishal Verma wrote:
> A misbehaving qemu created a situation where the ACPI SRAT table
> advertised one fewer proximity domains than intended. The NFIT table did
> describe all the expected proximity domains. This caused the device dax
> driver to assign an impossible target_node to the device, and when
> hotplugged as system memory, this would fail with the following
> signature:
> 
>   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
>   [  +0.001331] #PF: supervisor read access in kernel mode
>   [  +0.000975] #PF: error_code(0x0000) - not-present page
>   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
>   [  +0.001338] Oops: 0000 [#1] SMP PTI
>   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
>   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
>   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
>                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
> 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
> 		      80 00 00 00 fe f0 80 a3 38 20
>   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
>   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
>   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
>   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
>   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
>   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
>   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
>   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
>   [  +0.001095] Call Trace:
>   [  +0.000388]  kswapd+0x103/0x520
>   [  +0.000494]  ? finish_wait+0x80/0x80
>   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
>   [  +0.000607]  kthread+0x120/0x140
>   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
>   [  +0.000706]  ret_from_fork+0x3a/0x50
> 
> Add a check in the add_memory path to fail if the node to which we
> are adding memory is in the node_possible_map
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

We can start thiking on how to handle such a misconfiguration more
gracefully when we see this hitting in real world and find out more why
that happens. E.g. if a FW/BIOS are not fixable then we can implement
some fallback strategy but this should be a good start.

Thanks!

> ---
>  mm/memory_hotplug.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> v2:
> - Centralize the check in the add_memory path (David)
> - Instead of failing, add the memory to a nearby node, while warning
>   (and tainting) to call out attention to the firmware bug (Dan)
> 
> v3:
> - Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)
> 
> v4:
> - Error out instead of being smart about picking a node that wasn't
>   asked for (Michal)
> 
> v5:
> - Change the return code to -EINVAL (David)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..e07b80d149db 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1005,6 +1005,11 @@ int __ref add_memory_resource(int nid, struct resource *res)
>  	if (ret)
>  		return ret;
>  
> +	if (!node_possible(nid)) {
> +		WARN(1, "node %d was absent from the node_possible_map\n", nid);
> +		return -EINVAL;
> +	}
> +
>  	mem_hotplug_begin();
>  
>  	/*
> -- 
> 2.21.1

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v5] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-17  6:38 ` Michal Hocko
@ 2020-04-21  0:14   ` Verma, Vishal L
  2020-04-21  6:44     ` Michal Hocko
  0 siblings, 1 reply; 4+ messages in thread
From: Verma, Vishal L @ 2020-04-21  0:14 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, linux-nvdimm, david, dave.hansen, akpm

On Fri, 2020-04-17 at 08:38 +0200, Michal Hocko wrote:
> On Thu 16-04-20 16:54:38, Vishal Verma wrote:
> > A misbehaving qemu created a situation where the ACPI SRAT table
> > advertised one fewer proximity domains than intended. The NFIT table did
> > describe all the expected proximity domains. This caused the device dax
> > driver to assign an impossible target_node to the device, and when
> > hotplugged as system memory, this would fail with the following
> > signature:
> > 
> >   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
> >   [  +0.001331] #PF: supervisor read access in kernel mode
> >   [  +0.000975] #PF: error_code(0x0000) - not-present page
> >   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
> >   [  +0.001338] Oops: 0000 [#1] SMP PTI
> >   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
> >   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> >       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
> >   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
> >                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
> > 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
> > 		      80 00 00 00 fe f0 80 a3 38 20
> >   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
> >   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
> >   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
> >   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
> >   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
> >   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
> >   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
> >   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
> >   [  +0.001095] Call Trace:
> >   [  +0.000388]  kswapd+0x103/0x520
> >   [  +0.000494]  ? finish_wait+0x80/0x80
> >   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
> >   [  +0.000607]  kthread+0x120/0x140
> >   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
> >   [  +0.000706]  ret_from_fork+0x3a/0x50
> > 
> > Add a check in the add_memory path to fail if the node to which we
> > are adding memory is in the node_possible_map
> > 
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> We can start thiking on how to handle such a misconfiguration more
> gracefully when we see this hitting in real world and find out more why
> that happens. E.g. if a FW/BIOS are not fixable then we can implement
> some fallback strategy but this should be a good start.
> 
> Thanks!

Thank you for the review Michal.

Should this go via Andrew and the mm tree?

> 
> > ---
> >  mm/memory_hotplug.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > v2:
> > - Centralize the check in the add_memory path (David)
> > - Instead of failing, add the memory to a nearby node, while warning
> >   (and tainting) to call out attention to the firmware bug (Dan)
> > 
> > v3:
> > - Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)
> > 
> > v4:
> > - Error out instead of being smart about picking a node that wasn't
> >   asked for (Michal)
> > 
> > v5:
> > - Change the return code to -EINVAL (David)
> > 
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 0a54ffac8c68..e07b80d149db 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1005,6 +1005,11 @@ int __ref add_memory_resource(int nid, struct resource *res)
> >  	if (ret)
> >  		return ret;
> >  
> > +	if (!node_possible(nid)) {
> > +		WARN(1, "node %d was absent from the node_possible_map\n", nid);
> > +		return -EINVAL;
> > +	}
> > +
> >  	mem_hotplug_begin();
> >  
> >  	/*
> > -- 
> > 2.21.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v5] mm/memory_hotplug: refrain from adding memory into an impossible node
  2020-04-21  0:14   ` Verma, Vishal L
@ 2020-04-21  6:44     ` Michal Hocko
  0 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2020-04-21  6:44 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: linux-mm, linux-nvdimm, david, dave.hansen, akpm

On Tue 21-04-20 00:14:43, Verma, Vishal L wrote:
> On Fri, 2020-04-17 at 08:38 +0200, Michal Hocko wrote:
> > On Thu 16-04-20 16:54:38, Vishal Verma wrote:
> > > A misbehaving qemu created a situation where the ACPI SRAT table
> > > advertised one fewer proximity domains than intended. The NFIT table did
> > > describe all the expected proximity domains. This caused the device dax
> > > driver to assign an impossible target_node to the device, and when
> > > hotplugged as system memory, this would fail with the following
> > > signature:
> > > 
> > >   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > >   [  +0.001331] #PF: supervisor read access in kernel mode
> > >   [  +0.000975] #PF: error_code(0x0000) - not-present page
> > >   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
> > >   [  +0.001338] Oops: 0000 [#1] SMP PTI
> > >   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
> > >   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > >       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > >   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
> > >   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
> > >                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
> > > 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
> > > 		      80 00 00 00 fe f0 80 a3 38 20
> > >   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
> > >   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
> > >   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
> > >   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
> > >   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
> > >   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
> > >   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
> > >   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
> > >   [  +0.001095] Call Trace:
> > >   [  +0.000388]  kswapd+0x103/0x520
> > >   [  +0.000494]  ? finish_wait+0x80/0x80
> > >   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
> > >   [  +0.000607]  kthread+0x120/0x140
> > >   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
> > >   [  +0.000706]  ret_from_fork+0x3a/0x50
> > > 
> > > Add a check in the add_memory path to fail if the node to which we
> > > are adding memory is in the node_possible_map
> > > 
> > > Cc: Michal Hocko <mhocko@kernel.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: Dave Hansen <dave.hansen@linux.intel.com>
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > > Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > 
> > We can start thiking on how to handle such a misconfiguration more
> > gracefully when we see this hitting in real world and find out more why
> > that happens. E.g. if a FW/BIOS are not fixable then we can implement
> > some fallback strategy but this should be a good start.
> > 
> > Thanks!
> 
> Thank you for the review Michal.
> 
> Should this go via Andrew and the mm tree?

Yes, this is the usual route for memory hotplug patches.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-04-21  6:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-16 22:54 [PATCH v5] mm/memory_hotplug: refrain from adding memory into an impossible node Vishal Verma
2020-04-17  6:38 ` Michal Hocko
2020-04-21  0:14   ` Verma, Vishal L
2020-04-21  6:44     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).