All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Vishal Verma <vishal.l.verma@intel.com>, linux-mm@kvack.org
Cc: linux-nvdimm@lists.01.org, Dave Hansen <dave.hansen@linux.intel.com>
Subject: Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
Date: Wed, 15 Apr 2020 09:44:17 +0200	[thread overview]
Message-ID: <584d7831-ff99-a0e1-9c7c-f82486ced0a3@redhat.com> (raw)
In-Reply-To: <8e1b55a3-0403-3fa5-ae5b-7c3b20f883dc@redhat.com>

On 15.04.20 09:39, David Hildenbrand wrote:
> On 15.04.20 01:58, Vishal Verma wrote:
>> A misbehaving qemu created a situation where the ACPI SRAT table
>> advertised one fewer proximity domains than intended. The NFIT table did
>> describe all the expected proximity domains. This caused the device dax
>> driver to assign an impossible target_node to the device, and when
>> hotplugged as system memory, this would fail with the following
>> signature:
>>
>>   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
>>   [  +0.001331] #PF: supervisor read access in kernel mode
>>   [  +0.000975] #PF: error_code(0x0000) - not-present page
>>   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
>>   [  +0.001338] Oops: 0000 [#1] SMP PTI
>>   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
>>   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
>>   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
>>                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
>> 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
>> 		      80 00 00 00 fe f0 80 a3 38 20
>>   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
>>   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
>>   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
>>   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
>>   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
>>   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
>>   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
>>   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
>>   [  +0.001095] Call Trace:
>>   [  +0.000388]  kswapd+0x103/0x520
>>   [  +0.000494]  ? finish_wait+0x80/0x80
>>   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
>>   [  +0.000607]  kthread+0x120/0x140
>>   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
>>   [  +0.000706]  ret_from_fork+0x3a/0x50
>>
>> Add a check in the add_memory path to ensure that the node to which we
>> are adding memory is in the node_possible_map
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>  mm/memory_hotplug.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> v2:
>> - Centralize the check in the add_memory path (David)
>> - Instead of failing, add the memory to a nearby node, while warning
>>   (and tainting) to call out attention to the firmware bug (Dan)
>>
>> v3:
>> - Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 0a54ffac8c68..536a809d6ebb 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -980,6 +980,30 @@ static int check_hotplug_memory_range(u64 start, u64 size)
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Check that the node provided for adding memory was valid.
>> + * If not, find the nearest valid node and add the memory there while
>> + * tainting the kernel and displaying a warning to bring attention to the
>> + * underlying firmware problem.
>> + * Return nid if valid, or an adjusted node number that can be used instead
>> + * if the original nid was not valid
>> + */
> 
> -ETOOMUCHDOCUMENTAION
> 
> "If the given node cannot be used (!node_possible()), return the nearest
> possible node and WARN_TAINT() about firmware issues."
> 
>> +static int check_hotplug_node(int nid)
>> +{
>> +	int alt_nid;
>> +
>> +	if (node_possible(nid))
>> +		return nid;
>> +
>> +	alt_nid = numa_map_to_online_node(nid);
>> +	if (alt_nid == NUMA_NO_NODE)
>> +		alt_nid = first_online_node;
>> +	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
>> +		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
>> +		   nid, alt_nid);
>> +	return alt_nid;
>> +}
>> +
>>  static int online_memory_block(struct memory_block *mem, void *arg)
>>  {
>>  	return device_online(&mem->dev);
>> @@ -1005,6 +1029,10 @@ int __ref add_memory_resource(int nid, struct resource *res)
>>  	if (ret)
>>  		return ret;
>>  
>> +	nid = check_hotplug_node(nid);
>> +	if (nid < 0)
>> +		return -ENXIO;
>> +
>>  	mem_hotplug_begin();
>>  
>>  	/*
>>
> 
> You should do the same on the memory removal path.

(I do wonder if the result could be different
(numa_map_to_online_node()/first_online_node)) on the removal path, and
if we should bail out when removing instead. Sounds better to me, adding
memory is more important in this case.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

WARNING: multiple messages have this Message-ID (diff)
From: David Hildenbrand <david@redhat.com>
To: Vishal Verma <vishal.l.verma@intel.com>, linux-mm@kvack.org
Cc: linux-nvdimm@lists.01.org,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Subject: Re: [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node
Date: Wed, 15 Apr 2020 09:44:17 +0200	[thread overview]
Message-ID: <584d7831-ff99-a0e1-9c7c-f82486ced0a3@redhat.com> (raw)
In-Reply-To: <8e1b55a3-0403-3fa5-ae5b-7c3b20f883dc@redhat.com>

On 15.04.20 09:39, David Hildenbrand wrote:
> On 15.04.20 01:58, Vishal Verma wrote:
>> A misbehaving qemu created a situation where the ACPI SRAT table
>> advertised one fewer proximity domains than intended. The NFIT table did
>> describe all the expected proximity domains. This caused the device dax
>> driver to assign an impossible target_node to the device, and when
>> hotplugged as system memory, this would fail with the following
>> signature:
>>
>>   [  +0.001627] BUG: kernel NULL pointer dereference, address: 0000000000000088
>>   [  +0.001331] #PF: supervisor read access in kernel mode
>>   [  +0.000975] #PF: error_code(0x0000) - not-present page
>>   [  +0.000976] PGD 80000001767d4067 P4D 80000001767d4067 PUD 10e0c4067 PMD 0
>>   [  +0.001338] Oops: 0000 [#1] SMP PTI
>>   [  +0.000676] CPU: 4 PID: 22737 Comm: kswapd3 Tainted: G           O      5.6.0-rc5 #9
>>   [  +0.001457] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>>       BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>   [  +0.001990] RIP: 0010:prepare_kswapd_sleep+0x7c/0xc0
>>   [  +0.000780] Code: 89 df e8 87 fd ff ff 89 c2 31 c0 84 d2 74 e6 0f 1f 44
>>                       00 00 48 8b 05 fb af 7a 01 48 63 93 88 1d 01 00 48 8b
>> 		      84 d0 20 0f 00 00 <48> 3b 98 88 00 00 00 75 28 f0 80 a0
>> 		      80 00 00 00 fe f0 80 a3 38 20
>>   [  +0.002877] RSP: 0018:ffffc900017a3e78 EFLAGS: 00010202
>>   [  +0.000805] RAX: 0000000000000000 RBX: ffff8881209e0000 RCX: 0000000000000000
>>   [  +0.001115] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8881209e0e80
>>   [  +0.001098] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000008000
>>   [  +0.001092] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000003
>>   [  +0.001092] R13: 0000000000000003 R14: 0000000000000000 R15: ffffc900017a3ec8
>>   [  +0.001091] FS:  0000000000000000(0000) GS:ffff888318c00000(0000) knlGS:0000000000000000
>>   [  +0.001275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [  +0.000882] CR2: 0000000000000088 CR3: 0000000120b50002 CR4: 00000000001606e0
>>   [  +0.001095] Call Trace:
>>   [  +0.000388]  kswapd+0x103/0x520
>>   [  +0.000494]  ? finish_wait+0x80/0x80
>>   [  +0.000547]  ? balance_pgdat+0x5a0/0x5a0
>>   [  +0.000607]  kthread+0x120/0x140
>>   [  +0.000508]  ? kthread_create_on_node+0x60/0x60
>>   [  +0.000706]  ret_from_fork+0x3a/0x50
>>
>> Add a check in the add_memory path to ensure that the node to which we
>> are adding memory is in the node_possible_map
>>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>  mm/memory_hotplug.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> v2:
>> - Centralize the check in the add_memory path (David)
>> - Instead of failing, add the memory to a nearby node, while warning
>>   (and tainting) to call out attention to the firmware bug (Dan)
>>
>> v3:
>> - Fix the CONFIG_NUMA=n case, and use node 0 as the final fallback (Dan)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 0a54ffac8c68..536a809d6ebb 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -980,6 +980,30 @@ static int check_hotplug_memory_range(u64 start, u64 size)
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Check that the node provided for adding memory was valid.
>> + * If not, find the nearest valid node and add the memory there while
>> + * tainting the kernel and displaying a warning to bring attention to the
>> + * underlying firmware problem.
>> + * Return nid if valid, or an adjusted node number that can be used instead
>> + * if the original nid was not valid
>> + */
> 
> -ETOOMUCHDOCUMENTAION
> 
> "If the given node cannot be used (!node_possible()), return the nearest
> possible node and WARN_TAINT() about firmware issues."
> 
>> +static int check_hotplug_node(int nid)
>> +{
>> +	int alt_nid;
>> +
>> +	if (node_possible(nid))
>> +		return nid;
>> +
>> +	alt_nid = numa_map_to_online_node(nid);
>> +	if (alt_nid == NUMA_NO_NODE)
>> +		alt_nid = first_online_node;
>> +	WARN_TAINT(1, TAINT_FIRMWARE_WORKAROUND,
>> +		   "node %d expected, but was absent from the node_possible_map, using %d instead\n",
>> +		   nid, alt_nid);
>> +	return alt_nid;
>> +}
>> +
>>  static int online_memory_block(struct memory_block *mem, void *arg)
>>  {
>>  	return device_online(&mem->dev);
>> @@ -1005,6 +1029,10 @@ int __ref add_memory_resource(int nid, struct resource *res)
>>  	if (ret)
>>  		return ret;
>>  
>> +	nid = check_hotplug_node(nid);
>> +	if (nid < 0)
>> +		return -ENXIO;
>> +
>>  	mem_hotplug_begin();
>>  
>>  	/*
>>
> 
> You should do the same on the memory removal path.

(I do wonder if the result could be different
(numa_map_to_online_node()/first_online_node)) on the removal path, and
if we should bail out when removing instead. Sounds better to me, adding
memory is more important in this case.

-- 
Thanks,

David / dhildenb



  reply	other threads:[~2020-04-15  7:44 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-14 23:58 [PATCH v3] mm/memory_hotplug: refrain from adding memory into an impossible node Vishal Verma
2020-04-14 23:58 ` Vishal Verma
2020-04-15  7:39 ` David Hildenbrand
2020-04-15  7:39   ` David Hildenbrand
2020-04-15  7:44   ` David Hildenbrand [this message]
2020-04-15  7:44     ` David Hildenbrand
2020-04-15 10:43 ` Michal Hocko
2020-04-15 10:43   ` Michal Hocko
2020-04-15 20:32   ` Verma, Vishal L
2020-04-15 20:32     ` Verma, Vishal L
2020-04-16  6:19     ` Michal Hocko
2020-04-16  6:19       ` Michal Hocko
2020-04-16 16:13       ` Verma, Vishal L
2020-04-16 16:13         ` Verma, Vishal L
2020-04-16 16:16         ` David Hildenbrand
2020-04-16 16:16           ` David Hildenbrand
2020-04-16 16:18           ` Verma, Vishal L
2020-04-16 16:18             ` Verma, Vishal L

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=584d7831-ff99-a0e1-9c7c-f82486ced0a3@redhat.com \
    --to=david@redhat.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.