driver core: ensure a device has valid node id in device_add()
diff mbox series

Message ID 1568009063-77714-1-git-send-email-linyunsheng@huawei.com
State New
Headers show
Series
  • driver core: ensure a device has valid node id in device_add()
Related show

Commit Message

Yunsheng Lin Sept. 9, 2019, 6:04 a.m. UTC
Currently a device does not belong to any of the numa nodes
(dev->numa_node is NUMA_NO_NODE) when the node id is neither
specified by fw nor by virtual device layer and the device has
no parent device.

According to discussion in [1]:
Even if a device's numa node is not specified, the device really
does belong to a node.

This patch sets the device node to node 0 in device_add() if the
device's node id is not specified and it either has no parent
device, or the parent device also does not have a valid node id.

[1] https://lkml.org/lkml/2019/9/2/466

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
Changelog RFC -> v1:
1. Drop log error message and use a "if" instead of "? :".
2. Drop the RFC tag.
---
 drivers/base/core.c  | 10 +++++++---
 include/linux/numa.h |  2 ++
 2 files changed, 9 insertions(+), 3 deletions(-)

Comments

Greg Kroah-Hartman Sept. 9, 2019, 9:53 a.m. UTC | #1
On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> Currently a device does not belong to any of the numa nodes
> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> specified by fw nor by virtual device layer and the device has
> no parent device.

Is this really a problem?

> According to discussion in [1]:
> Even if a device's numa node is not specified, the device really
> does belong to a node.

But as we do not know the node, can we cause more harm by randomly
picking one (i.e. putting it all in node 0)?

> This patch sets the device node to node 0 in device_add() if the
> device's node id is not specified and it either has no parent
> device, or the parent device also does not have a valid node id.
> 
> [1] https://lkml.org/lkml/2019/9/2/466
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Changelog RFC -> v1:
> 1. Drop log error message and use a "if" instead of "? :".
> 2. Drop the RFC tag.
> ---
>  drivers/base/core.c  | 10 +++++++---
>  include/linux/numa.h |  2 ++
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 1669d41..f79ad20 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2107,9 +2107,13 @@ int device_add(struct device *dev)
>  	if (kobj)
>  		dev->kobj.parent = kobj;
>  
> -	/* use parent numa_node */
> -	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> -		set_dev_node(dev, dev_to_node(parent));
> +	/* use parent numa_node or default node 0 */
> +	if (!numa_node_valid(dev_to_node(dev))) {
> +		if (parent && numa_node_valid(dev_to_node(parent)))
> +			set_dev_node(dev, dev_to_node(parent));
> +		else
> +			set_dev_node(dev, 0);
> +	}

Again, is this going to cause more harm than good?  What happens if we
leave it as "unknown", isn't that better than thinking we "know" it is
in node 0?

thanks,

greg k-h
Michal Hocko Sept. 9, 2019, 6:50 p.m. UTC | #2
On Mon 09-09-19 14:04:23, Yunsheng Lin wrote:
> Currently a device does not belong to any of the numa nodes
> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> specified by fw nor by virtual device layer and the device has
> no parent device.
> 
> According to discussion in [1]:

Please do not reference important parts of the justification via a link.
Just quote the relevant part to the changelog. It is just too easy that
external links die - not to mention lkml.org.

> Even if a device's numa node is not specified, the device really
> does belong to a node.

What does this mean?

> This patch sets the device node to node 0 in device_add() if the
> device's node id is not specified and it either has no parent
> device, or the parent device also does not have a valid node id.

Why is node 0 special? I have seen platforms with node 0 missing or
being memory less. The changelog also lacks an actual problem
descripton. Why do we even care about NUMA_NO_NODE? E.g. the page
allocator interprets NUMA_NO_NODE as the closest node with a memory.
And by closest it really means to the CPU which is performing the
allocation.
 
> [1] https://lkml.org/lkml/2019/9/2/466
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> Changelog RFC -> v1:
> 1. Drop log error message and use a "if" instead of "? :".
> 2. Drop the RFC tag.
> ---
>  drivers/base/core.c  | 10 +++++++---
>  include/linux/numa.h |  2 ++
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 1669d41..f79ad20 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2107,9 +2107,13 @@ int device_add(struct device *dev)
>  	if (kobj)
>  		dev->kobj.parent = kobj;
>  
> -	/* use parent numa_node */
> -	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
> -		set_dev_node(dev, dev_to_node(parent));
> +	/* use parent numa_node or default node 0 */
> +	if (!numa_node_valid(dev_to_node(dev))) {
> +		if (parent && numa_node_valid(dev_to_node(parent)))
> +			set_dev_node(dev, dev_to_node(parent));
> +		else
> +			set_dev_node(dev, 0);
> +	}
>  
>  	/* first, register with generic layer. */
>  	/* we require the name to be set before, and pass NULL */
> diff --git a/include/linux/numa.h b/include/linux/numa.h
> index 110b0e5..eccc757 100644
> --- a/include/linux/numa.h
> +++ b/include/linux/numa.h
> @@ -13,4 +13,6 @@
>  
>  #define	NUMA_NO_NODE	(-1)
>  
> +#define numa_node_valid(node)	((unsigned int)(node) < nr_node_ids)
> +
>  #endif /* _LINUX_NUMA_H */
> -- 
> 2.8.1
Yunsheng Lin Sept. 10, 2019, 6:43 a.m. UTC | #3
On 2019/9/9 17:53, Greg KH wrote:
> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
>> Currently a device does not belong to any of the numa nodes
>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
>> specified by fw nor by virtual device layer and the device has
>> no parent device.
> 
> Is this really a problem?

Not really.
Someone need to guess the node id when it is not specified, right?
This patch chooses to guess the node id in the driver core.

> 
>> According to discussion in [1]:
>> Even if a device's numa node is not specified, the device really
>> does belong to a node.
> 
> But as we do not know the node, can we cause more harm by randomly
> picking one (i.e. putting it all in node 0)?
If we do not pick node 0 for device with invalid node, then caller need
to check the node id and pick one, and currently different callers
does a different checking:

1) some does " < 0" check;
2) some does "== NUMA_NO_NODE" check;
3) some does ">= MAX_NUMNODES" check;
4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.

and caller of dev_to_node() may pick one node based on below if the
dev_to_node() return a invalid node based on above checking:
1) based on numa_mem_id().
2) pick a random one like in workqueue_select_cpu_near().

If we pick node 0 for device with invalid node in device_add(), we
may avoid the above different checking and picking for caller, but we
may lose some caller context info, for example, user may use node of the
cpu on which the process is using the device to allocate the resource
close to the process, or user may pick a random one if they know what
they are doing.

It seems there is trade off here, as I can see, we can guess and pick the
node at different stage when it is not specified.
1. guess and pick node 0 at device_add(), it has the advantage of ensure
   all devices will have a valid node at very begin of device creation,
   so the user does not have to check and guess one, but user might lose
   the opportunity to do their own guessing and picking.

2. Maybe provide a dev_to_valid_node() to always return a valid node id,
   for example return numa_mem_id() if dev->numa_node is not valid.
   User know what they are doing can still use dev_to_node().

3. Caller of dev_to_node() do their own checking and picking, which
   might lead to adding more different and reduplicate checking as above.
Yunsheng Lin Sept. 10, 2019, 7:08 a.m. UTC | #4
On 2019/9/10 2:50, Michal Hocko wrote:
> On Mon 09-09-19 14:04:23, Yunsheng Lin wrote:
>> Currently a device does not belong to any of the numa nodes
>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
>> specified by fw nor by virtual device layer and the device has
>> no parent device.
>>
>> According to discussion in [1]:
> 
> Please do not reference important parts of the justification via a link.
> Just quote the relevant part to the changelog. It is just too easy that
> external links die - not to mention lkml.org.

Ok

> 
>> Even if a device's numa node is not specified, the device really
>> does belong to a node.
> 
> What does this mean?

It means some one need to guess the node id if the node is not
specified.

> 
>> This patch sets the device node to node 0 in device_add() if the
>> device's node id is not specified and it either has no parent
>> device, or the parent device also does not have a valid node id.
> 
> Why is node 0 special? I have seen platforms with node 0 missing or
> being memory less. The changelog also lacks an actual problem

by node 0 missing, how do we know if node 0 is missing?
by node_online(0)?

> descripton. Why do we even care about NUMA_NO_NODE? E.g. the page
> allocator interprets NUMA_NO_NODE as the closest node with a memory.
> And by closest it really means to the CPU which is performing the
> allocation.

Yes, I should have mentioned that in the commit log.

I mentioned the below in the RFC, but somehow deleted when sending
V1:
"There may be explicit handling out there relying on NUMA_NO_NODE,
like in nvme_probe()."

>  
>> [1] https://lkml.org/lkml/2019/9/2/466
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> Changelog RFC -> v1:
>> 1. Drop log error message and use a "if" instead of "? :".
>> 2. Drop the RFC tag.
>> ---
>>  drivers/base/core.c  | 10 +++++++---
>>  include/linux/numa.h |  2 ++
>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>> index 1669d41..f79ad20 100644
>> --- a/drivers/base/core.c
>> +++ b/drivers/base/core.c
>> @@ -2107,9 +2107,13 @@ int device_add(struct device *dev)
>>  	if (kobj)
>>  		dev->kobj.parent = kobj;
>>  
>> -	/* use parent numa_node */
>> -	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
>> -		set_dev_node(dev, dev_to_node(parent));
>> +	/* use parent numa_node or default node 0 */
>> +	if (!numa_node_valid(dev_to_node(dev))) {
>> +		if (parent && numa_node_valid(dev_to_node(parent)))
>> +			set_dev_node(dev, dev_to_node(parent));
>> +		else
>> +			set_dev_node(dev, 0);
>> +	}
>>  
>>  	/* first, register with generic layer. */
>>  	/* we require the name to be set before, and pass NULL */
>> diff --git a/include/linux/numa.h b/include/linux/numa.h
>> index 110b0e5..eccc757 100644
>> --- a/include/linux/numa.h
>> +++ b/include/linux/numa.h
>> @@ -13,4 +13,6 @@
>>  
>>  #define	NUMA_NO_NODE	(-1)
>>  
>> +#define numa_node_valid(node)	((unsigned int)(node) < nr_node_ids)
>> +
>>  #endif /* _LINUX_NUMA_H */
>> -- 
>> 2.8.1
>
Michal Hocko Sept. 10, 2019, 7:13 a.m. UTC | #5
On Tue 10-09-19 14:43:32, Yunsheng Lin wrote:
> On 2019/9/9 17:53, Greg KH wrote:
[...]
> > But as we do not know the node, can we cause more harm by randomly
> > picking one (i.e. putting it all in node 0)?
> If we do not pick node 0 for device with invalid node, then caller need
> to check the node id and pick one, and currently different callers
> does a different checking:

Could you be more specific about who those callers are why do they need
the node id for? Because the page allocator can handle that as already
pointed out in my other email. Who else does really have to know about
the node apart the allocator?
Michal Hocko Sept. 10, 2019, 7:24 a.m. UTC | #6
Our emails crossed, sorry about that.

On Tue 10-09-19 15:08:20, Yunsheng Lin wrote:
> On 2019/9/10 2:50, Michal Hocko wrote:
> > On Mon 09-09-19 14:04:23, Yunsheng Lin wrote:
[...]
> >> Even if a device's numa node is not specified, the device really
> >> does belong to a node.
> > 
> > What does this mean?
> 
> It means some one need to guess the node id if the node is not
> specified.

I have asked about this in other email so let's not cross the
communication even more.
 
> >> This patch sets the device node to node 0 in device_add() if the
> >> device's node id is not specified and it either has no parent
> >> device, or the parent device also does not have a valid node id.
> > 
> > Why is node 0 special? I have seen platforms with node 0 missing or
> > being memory less. The changelog also lacks an actual problem
> 
> by node 0 missing, how do we know if node 0 is missing?
> by node_online(0)?

No, this is a dynamic situation. Node might get offline via hotremove.
In most cases it wouldn't because there will likely be some kernel
memory on node0 but you cannot really make any assumptions here. Besides
that nothing should really care.

> > descripton. Why do we even care about NUMA_NO_NODE? E.g. the page
> > allocator interprets NUMA_NO_NODE as the closest node with a memory.
> > And by closest it really means to the CPU which is performing the
> > allocation.
> 
> Yes, I should have mentioned that in the commit log.
> 
> I mentioned the below in the RFC, but somehow deleted when sending
> V1:
> "There may be explicit handling out there relying on NUMA_NO_NODE,
> like in nvme_probe()."

This code, and other doing similar things, is very likely bogus. Just
look at what the code does. It takes the node affinity from the dev and
uses it for an allocation. So far so good. But it tries to be clever
and special cases NUMA_NO_NODE to be first_node. So now the allocator
has used a proper fallback to the nearest node with memory for the
current CPU that is executing the code while dev will point to a
first_node which might be a completely different one. See the
discrepancy?
Greg Kroah-Hartman Sept. 10, 2019, 9:31 a.m. UTC | #7
On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> On 2019/9/9 17:53, Greg KH wrote:
> > On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> >> Currently a device does not belong to any of the numa nodes
> >> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> >> specified by fw nor by virtual device layer and the device has
> >> no parent device.
> > 
> > Is this really a problem?
> 
> Not really.
> Someone need to guess the node id when it is not specified, right?

No, why?  Guessing guarantees you will get it wrong on some systems.

Are you seeing real problems because the id is not being set?  What
problem is this fixing that you can actually observe?

thanks,

greg k-h
Yunsheng Lin Sept. 10, 2019, 10:40 a.m. UTC | #8
On 2019/9/10 15:24, Michal Hocko wrote:
> Our emails crossed, sorry about that.
> 
> On Tue 10-09-19 15:08:20, Yunsheng Lin wrote:
>> On 2019/9/10 2:50, Michal Hocko wrote:
>>> On Mon 09-09-19 14:04:23, Yunsheng Lin wrote:
> [...]
>>>> Even if a device's numa node is not specified, the device really
>>>> does belong to a node.
>>>
>>> What does this mean?
>>
>> It means some one need to guess the node id if the node is not
>> specified.
> 
> I have asked about this in other email so let's not cross the
> communication even more.

I may just answer your question here.

Besides the page allocator, cpu allocator or scheduler may need to
know about the node id to figure out which cpu to run is more
optimized, like in workqueue_select_cpu_near().


>  
>>>> This patch sets the device node to node 0 in device_add() if the
>>>> device's node id is not specified and it either has no parent
>>>> device, or the parent device also does not have a valid node id.
>>>
>>> Why is node 0 special? I have seen platforms with node 0 missing or
>>> being memory less. The changelog also lacks an actual problem
>>
>> by node 0 missing, how do we know if node 0 is missing?
>> by node_online(0)?
> 
> No, this is a dynamic situation. Node might get offline via hotremove.
> In most cases it wouldn't because there will likely be some kernel
> memory on node0 but you cannot really make any assumptions here. Besides
> that nothing should really care.

From the node checking:
'(unsigned)node_id >= nr_node_ids'

If the nr_node_ids > 0, then node 0 is a valid node according to
the above checking, is there a function to check if a node is
missing?

Also, I am not sure if I understand "nothing should really care".
Does it means a device still can be a numa that is missing, just
have some performance degradation?

> 
>>> descripton. Why do we even care about NUMA_NO_NODE? E.g. the page
>>> allocator interprets NUMA_NO_NODE as the closest node with a memory.
>>> And by closest it really means to the CPU which is performing the
>>> allocation.
>>
>> Yes, I should have mentioned that in the commit log.
>>
>> I mentioned the below in the RFC, but somehow deleted when sending
>> V1:
>> "There may be explicit handling out there relying on NUMA_NO_NODE,
>> like in nvme_probe()."
> 
> This code, and other doing similar things, is very likely bogus. Just
> look at what the code does. It takes the node affinity from the dev and
> uses it for an allocation. So far so good. But it tries to be clever
> and special cases NUMA_NO_NODE to be first_node. So now the allocator
> has used a proper fallback to the nearest node with memory for the
> current CPU that is executing the code while dev will point to a
> first_node which might be a completely different one. See the
> discrepancy?

Do you mean let kzalloc_node handle the NUMA_NO_NODE case, if node
id is NUMA_NO_NODE, kzalloc_node handles it as numa_mem_id().

If yes, above makes more sense.

>
Yunsheng Lin Sept. 10, 2019, 10:58 a.m. UTC | #9
On 2019/9/10 17:31, Greg KH wrote:
> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
>> On 2019/9/9 17:53, Greg KH wrote:
>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
>>>> Currently a device does not belong to any of the numa nodes
>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
>>>> specified by fw nor by virtual device layer and the device has
>>>> no parent device.
>>>
>>> Is this really a problem?
>>
>> Not really.
>> Someone need to guess the node id when it is not specified, right?
> 
> No, why?  Guessing guarantees you will get it wrong on some systems.
> 
> Are you seeing real problems because the id is not being set?  What
> problem is this fixing that you can actually observe?

When passing the return value of dev_to_node() to cpumask_of_node()
without checking the node id if the node id is not valid, there is
global-out-of-bounds detected by KASAN as below:

there are different checking to return value of dev_to_node(), I though
it is better to consistently do checking in cpumask_of_node(), then
discussion [1] [2] led to do the checking in device_add().

[   42.970381] ==================================================================
[   42.977595] BUG: KASAN: global-out-of-bounds in __bitmap_weight+0x48/0xb0
[   42.984370] Read of size 8 at addr ffff20008cdf8790 by task kworker/0:1/13
[   42.991230]
[   42.992712] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G           O      5.2.0-rc4-g8bde06a-dirty #3
[   43.001830] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B050.01 08/08/2019
[   43.011298] Workqueue: events work_for_cpu_fn
[   43.015643] Call trace:
[   43.018078]  dump_backtrace+0x0/0x1e8
[   43.021727]  show_stack+0x14/0x20
[   43.025031]  dump_stack+0xc4/0xfc
[   43.028335]  print_address_description+0x178/0x270
[   43.033113]  __kasan_report+0x164/0x1b8
[   43.036936]  kasan_report+0xc/0x18
[   43.040325]  __asan_load8+0x84/0xa8
[   43.043801]  __bitmap_weight+0x48/0xb0
[   43.047552]  hclge_init_ae_dev+0x988/0x1e78 [hclge]
[   43.052418]  hnae3_register_ae_dev+0xcc/0x278 [hnae3]
[   43.057467]  hns3_probe+0xe0/0x120 [hns3]
[   43.061464]  local_pci_probe+0x74/0xf0
[   43.065200]  work_for_cpu_fn+0x2c/0x48
[   43.068937]  process_one_work+0x3c0/0x878
[   43.072934]  worker_thread+0x400/0x670
[   43.076670]  kthread+0x1b0/0x1b8
[   43.079885]  ret_from_fork+0x10/0x18
[   43.083446]
[   43.084925] The buggy address belongs to the variable:
[   43.090052]  numa_distance+0x30/0x40
[   43.093613]
[   43.095091] Memory state around the buggy address:
[   43.099870]  ffff20008cdf8680: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 fa fa
[   43.107078]  ffff20008cdf8700: fa fa fa fa 04 fa fa fa fa fa fa fa 00 fa fa fa
[   43.114286] >ffff20008cdf8780: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
[   43.121494]                          ^
[   43.125230]  ffff20008cdf8800: 01 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa fa
[   43.132439]  ffff20008cdf8880: fa fa fa fa fa fa fa fa 00 00 fa fa fa fa fa fa
[   43.139646] ==================================================================


[1] https://lore.kernel.org/patchwork/patch/1122081/
[2] https://lore.kernel.org/patchwork/patch/1122516/
> 
> thanks,
> 
> greg k-h
> 
> .
>
Michal Hocko Sept. 10, 2019, 11:01 a.m. UTC | #10
On Tue 10-09-19 18:40:12, Yunsheng Lin wrote:
> On 2019/9/10 15:24, Michal Hocko wrote:
> > Our emails crossed, sorry about that.
> > 
> > On Tue 10-09-19 15:08:20, Yunsheng Lin wrote:
> >> On 2019/9/10 2:50, Michal Hocko wrote:
> >>> On Mon 09-09-19 14:04:23, Yunsheng Lin wrote:
> > [...]
> >>>> Even if a device's numa node is not specified, the device really
> >>>> does belong to a node.
> >>>
> >>> What does this mean?
> >>
> >> It means some one need to guess the node id if the node is not
> >> specified.
> > 
> > I have asked about this in other email so let's not cross the
> > communication even more.
> 
> I may just answer your question here.
> 
> Besides the page allocator, cpu allocator or scheduler may need to
> know about the node id to figure out which cpu to run is more
> optimized, like in workqueue_select_cpu_near().

OK, I can see some point there. Which of them do not handle this case
properly? At least workqueue_select_cpu_near seems to handle it
reasonably by not making any numa topology assumptions.

> >>>> This patch sets the device node to node 0 in device_add() if the
> >>>> device's node id is not specified and it either has no parent
> >>>> device, or the parent device also does not have a valid node id.
> >>>
> >>> Why is node 0 special? I have seen platforms with node 0 missing or
> >>> being memory less. The changelog also lacks an actual problem
> >>
> >> by node 0 missing, how do we know if node 0 is missing?
> >> by node_online(0)?
> > 
> > No, this is a dynamic situation. Node might get offline via hotremove.
> > In most cases it wouldn't because there will likely be some kernel
> > memory on node0 but you cannot really make any assumptions here. Besides
> > that nothing should really care.
> 
> >From the node checking:
> '(unsigned)node_id >= nr_node_ids'
> 
> If the nr_node_ids > 0, then node 0 is a valid node according to
> the above checking, is there a function to check if a node is
> missing?

What does that mean, really?

> Also, I am not sure if I understand "nothing should really care".
> Does it means a device still can be a numa that is missing, just
> have some performance degradation?

It really depends on what that numa affinity really represent. If the
affinity is to a non existent NUMA node then it likely has to hop over
some interconnect and pay some performance penalty. If there is no
affinity then it likely makes no difference on which node it sits and
which memory it gets.
 
> >>> descripton. Why do we even care about NUMA_NO_NODE? E.g. the page
> >>> allocator interprets NUMA_NO_NODE as the closest node with a memory.
> >>> And by closest it really means to the CPU which is performing the
> >>> allocation.
> >>
> >> Yes, I should have mentioned that in the commit log.
> >>
> >> I mentioned the below in the RFC, but somehow deleted when sending
> >> V1:
> >> "There may be explicit handling out there relying on NUMA_NO_NODE,
> >> like in nvme_probe()."
> > 
> > This code, and other doing similar things, is very likely bogus. Just
> > look at what the code does. It takes the node affinity from the dev and
> > uses it for an allocation. So far so good. But it tries to be clever
> > and special cases NUMA_NO_NODE to be first_node. So now the allocator
> > has used a proper fallback to the nearest node with memory for the
> > current CPU that is executing the code while dev will point to a
> > first_node which might be a completely different one. See the
> > discrepancy?
> 
> Do you mean let kzalloc_node handle the NUMA_NO_NODE case, if node
> id is NUMA_NO_NODE, kzalloc_node handles it as numa_mem_id().

Which is what the page allocator does and makes sense IMHO. So what is
the actual problem?
Michal Hocko Sept. 10, 2019, 11:04 a.m. UTC | #11
On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
> On 2019/9/10 17:31, Greg KH wrote:
> > On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> >> On 2019/9/9 17:53, Greg KH wrote:
> >>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> >>>> Currently a device does not belong to any of the numa nodes
> >>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> >>>> specified by fw nor by virtual device layer and the device has
> >>>> no parent device.
> >>>
> >>> Is this really a problem?
> >>
> >> Not really.
> >> Someone need to guess the node id when it is not specified, right?
> > 
> > No, why?  Guessing guarantees you will get it wrong on some systems.
> > 
> > Are you seeing real problems because the id is not being set?  What
> > problem is this fixing that you can actually observe?
> 
> When passing the return value of dev_to_node() to cpumask_of_node()
> without checking the node id if the node id is not valid, there is
> global-out-of-bounds detected by KASAN as below:

OK, I seem to remember this being brought up already. And now when I
think about it, we really want to make cpumask_of_node NUMA_NO_NODE
aware. That means using the same trick the allocator does for this
special case.
Greg Kroah-Hartman Sept. 10, 2019, 11:12 a.m. UTC | #12
On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
> > On 2019/9/10 17:31, Greg KH wrote:
> > > On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> > >> On 2019/9/9 17:53, Greg KH wrote:
> > >>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> > >>>> Currently a device does not belong to any of the numa nodes
> > >>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> > >>>> specified by fw nor by virtual device layer and the device has
> > >>>> no parent device.
> > >>>
> > >>> Is this really a problem?
> > >>
> > >> Not really.
> > >> Someone need to guess the node id when it is not specified, right?
> > > 
> > > No, why?  Guessing guarantees you will get it wrong on some systems.
> > > 
> > > Are you seeing real problems because the id is not being set?  What
> > > problem is this fixing that you can actually observe?
> > 
> > When passing the return value of dev_to_node() to cpumask_of_node()
> > without checking the node id if the node id is not valid, there is
> > global-out-of-bounds detected by KASAN as below:
> 
> OK, I seem to remember this being brought up already. And now when I
> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
> aware. That means using the same trick the allocator does for this
> special case.

That seems reasonable to me, and much more "obvious" as to what is going
on.

thanks,

greg k-h
Yunsheng Lin Sept. 10, 2019, 12:47 p.m. UTC | #13
On 2019/9/10 19:12, Greg KH wrote:
> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
>>> On 2019/9/10 17:31, Greg KH wrote:
>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
>>>>> On 2019/9/9 17:53, Greg KH wrote:
>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
>>>>>>> Currently a device does not belong to any of the numa nodes
>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
>>>>>>> specified by fw nor by virtual device layer and the device has
>>>>>>> no parent device.
>>>>>>
>>>>>> Is this really a problem?
>>>>>
>>>>> Not really.
>>>>> Someone need to guess the node id when it is not specified, right?
>>>>
>>>> No, why?  Guessing guarantees you will get it wrong on some systems.
>>>>
>>>> Are you seeing real problems because the id is not being set?  What
>>>> problem is this fixing that you can actually observe?
>>>
>>> When passing the return value of dev_to_node() to cpumask_of_node()
>>> without checking the node id if the node id is not valid, there is
>>> global-out-of-bounds detected by KASAN as below:
>>
>> OK, I seem to remember this being brought up already. And now when I
>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
>> aware. That means using the same trick the allocator does for this
>> special case.
> 
> That seems reasonable to me, and much more "obvious" as to what is going
> on.
> 

Ok, thanks for the suggestion.

For arm64 and x86, there are two versions of cpumask_of_node().

when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
   in arch/x86/mm/numa.c is used, which does partial node id checking:

const struct cpumask *cpumask_of_node(int node)
{
        if (node >= nr_node_ids) {
                printk(KERN_WARNING
                        "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
                        node, nr_node_ids);
                dump_stack();
                return cpu_none_mask;
        }
        if (node_to_cpumask_map[node] == NULL) {
                printk(KERN_WARNING
                        "cpumask_of_node(%d): no node_to_cpumask_map!\n",
                        node);
                dump_stack();
                return cpu_online_mask;
        }
        return node_to_cpumask_map[node];
}

when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
   in arch/x86/include/asm/topology.h is used:

static inline const struct cpumask *cpumask_of_node(int node)
{
        return node_to_cpumask_map[node];
}

As discussion in [1], adding the checking in cpumask_of_node() with
CONFIG_DEBUG_PER_CPU_MAPS not defined increases overhead for everyone,
and it is already true that cpumask_of_node() requires a valid node_id.

So maybe the overhead is worth it?

Hi, Peter
	Does the argument in this thread about making cpumask_of_node()
NUMA_NO_NODE aware make sense to you?

[1] https://lore.kernel.org/patchwork/patch/1122516/
Michal Hocko Sept. 10, 2019, 12:53 p.m. UTC | #14
On Tue 10-09-19 20:47:40, Yunsheng Lin wrote:
> On 2019/9/10 19:12, Greg KH wrote:
> > On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
> >> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
> >>> On 2019/9/10 17:31, Greg KH wrote:
> >>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> >>>>> On 2019/9/9 17:53, Greg KH wrote:
> >>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> >>>>>>> Currently a device does not belong to any of the numa nodes
> >>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> >>>>>>> specified by fw nor by virtual device layer and the device has
> >>>>>>> no parent device.
> >>>>>>
> >>>>>> Is this really a problem?
> >>>>>
> >>>>> Not really.
> >>>>> Someone need to guess the node id when it is not specified, right?
> >>>>
> >>>> No, why?  Guessing guarantees you will get it wrong on some systems.
> >>>>
> >>>> Are you seeing real problems because the id is not being set?  What
> >>>> problem is this fixing that you can actually observe?
> >>>
> >>> When passing the return value of dev_to_node() to cpumask_of_node()
> >>> without checking the node id if the node id is not valid, there is
> >>> global-out-of-bounds detected by KASAN as below:
> >>
> >> OK, I seem to remember this being brought up already. And now when I
> >> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
> >> aware. That means using the same trick the allocator does for this
> >> special case.
> > 
> > That seems reasonable to me, and much more "obvious" as to what is going
> > on.
> > 
> 
> Ok, thanks for the suggestion.
> 
> For arm64 and x86, there are two versions of cpumask_of_node().
> 
> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
>    in arch/x86/mm/numa.c is used, which does partial node id checking:
> 
> const struct cpumask *cpumask_of_node(int node)
> {
>         if (node >= nr_node_ids) {
>                 printk(KERN_WARNING
>                         "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
>                         node, nr_node_ids);
>                 dump_stack();
>                 return cpu_none_mask;
>         }
>         if (node_to_cpumask_map[node] == NULL) {
>                 printk(KERN_WARNING
>                         "cpumask_of_node(%d): no node_to_cpumask_map!\n",
>                         node);
>                 dump_stack();
>                 return cpu_online_mask;
>         }
>         return node_to_cpumask_map[node];
> }
> 
> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
>    in arch/x86/include/asm/topology.h is used:
> 
> static inline const struct cpumask *cpumask_of_node(int node)
> {
>         return node_to_cpumask_map[node];
> }

I would simply go with. There shouldn't be any need for heavy weight
checks that CONFIG_DEBUG_PER_CPU_MAPS has.

static inline const struct cpumask *cpumask_of_node(int node)
{
	/* A nice comment goes here */
	if (node == NUMA_NO_NODE)
		return node_to_cpumask_map[numa_mem_id()];
        return node_to_cpumask_map[node];
}
Michal Hocko Sept. 11, 2019, 5:33 a.m. UTC | #15
On Tue 10-09-19 14:53:39, Michal Hocko wrote:
> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote:
> > On 2019/9/10 19:12, Greg KH wrote:
> > > On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
> > >> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
> > >>> On 2019/9/10 17:31, Greg KH wrote:
> > >>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> > >>>>> On 2019/9/9 17:53, Greg KH wrote:
> > >>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> > >>>>>>> Currently a device does not belong to any of the numa nodes
> > >>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> > >>>>>>> specified by fw nor by virtual device layer and the device has
> > >>>>>>> no parent device.
> > >>>>>>
> > >>>>>> Is this really a problem?
> > >>>>>
> > >>>>> Not really.
> > >>>>> Someone need to guess the node id when it is not specified, right?
> > >>>>
> > >>>> No, why?  Guessing guarantees you will get it wrong on some systems.
> > >>>>
> > >>>> Are you seeing real problems because the id is not being set?  What
> > >>>> problem is this fixing that you can actually observe?
> > >>>
> > >>> When passing the return value of dev_to_node() to cpumask_of_node()
> > >>> without checking the node id if the node id is not valid, there is
> > >>> global-out-of-bounds detected by KASAN as below:
> > >>
> > >> OK, I seem to remember this being brought up already. And now when I
> > >> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
> > >> aware. That means using the same trick the allocator does for this
> > >> special case.
> > > 
> > > That seems reasonable to me, and much more "obvious" as to what is going
> > > on.
> > > 
> > 
> > Ok, thanks for the suggestion.
> > 
> > For arm64 and x86, there are two versions of cpumask_of_node().
> > 
> > when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
> >    in arch/x86/mm/numa.c is used, which does partial node id checking:
> > 
> > const struct cpumask *cpumask_of_node(int node)
> > {
> >         if (node >= nr_node_ids) {
> >                 printk(KERN_WARNING
> >                         "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
> >                         node, nr_node_ids);
> >                 dump_stack();
> >                 return cpu_none_mask;
> >         }
> >         if (node_to_cpumask_map[node] == NULL) {
> >                 printk(KERN_WARNING
> >                         "cpumask_of_node(%d): no node_to_cpumask_map!\n",
> >                         node);
> >                 dump_stack();
> >                 return cpu_online_mask;
> >         }
> >         return node_to_cpumask_map[node];
> > }
> > 
> > when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
> >    in arch/x86/include/asm/topology.h is used:
> > 
> > static inline const struct cpumask *cpumask_of_node(int node)
> > {
> >         return node_to_cpumask_map[node];
> > }
> 
> I would simply go with. There shouldn't be any need for heavy weight
> checks that CONFIG_DEBUG_PER_CPU_MAPS has.
> 
> static inline const struct cpumask *cpumask_of_node(int node)
> {
> 	/* A nice comment goes here */
> 	if (node == NUMA_NO_NODE)
> 		return node_to_cpumask_map[numa_mem_id()];
>         return node_to_cpumask_map[node];
> }

Sleeping over this and thinking more about the actual semantic the above
is wrong. We cannot really copy the page allocator logic. Why? Simply
because the page allocator doesn't enforce the near node affinity. It
just picks it up as a preferred node but then it is free to fallback to
any other numa node. This is not the case here and node_to_cpumask_map will
only restrict to the particular node's cpus which would have really non
deterministic behavior depending on where the code is executed. So in
fact we really want to return cpu_online_mask for NUMA_NO_NODE.

Sorry about the confusion.
> -- 
> Michal Hocko
> SUSE Labs
Yunsheng Lin Sept. 11, 2019, 6:15 a.m. UTC | #16
On 2019/9/11 13:33, Michal Hocko wrote:
> On Tue 10-09-19 14:53:39, Michal Hocko wrote:
>> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote:
>>> On 2019/9/10 19:12, Greg KH wrote:
>>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
>>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
>>>>>> On 2019/9/10 17:31, Greg KH wrote:
>>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
>>>>>>>> On 2019/9/9 17:53, Greg KH wrote:
>>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
>>>>>>>>>> Currently a device does not belong to any of the numa nodes
>>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
>>>>>>>>>> specified by fw nor by virtual device layer and the device has
>>>>>>>>>> no parent device.
>>>>>>>>>
>>>>>>>>> Is this really a problem?
>>>>>>>>
>>>>>>>> Not really.
>>>>>>>> Someone need to guess the node id when it is not specified, right?
>>>>>>>
>>>>>>> No, why?  Guessing guarantees you will get it wrong on some systems.
>>>>>>>
>>>>>>> Are you seeing real problems because the id is not being set?  What
>>>>>>> problem is this fixing that you can actually observe?
>>>>>>
>>>>>> When passing the return value of dev_to_node() to cpumask_of_node()
>>>>>> without checking the node id if the node id is not valid, there is
>>>>>> global-out-of-bounds detected by KASAN as below:
>>>>>
>>>>> OK, I seem to remember this being brought up already. And now when I
>>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
>>>>> aware. That means using the same trick the allocator does for this
>>>>> special case.
>>>>
>>>> That seems reasonable to me, and much more "obvious" as to what is going
>>>> on.
>>>>
>>>
>>> Ok, thanks for the suggestion.
>>>
>>> For arm64 and x86, there are two versions of cpumask_of_node().
>>>
>>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
>>>    in arch/x86/mm/numa.c is used, which does partial node id checking:
>>>
>>> const struct cpumask *cpumask_of_node(int node)
>>> {
>>>         if (node >= nr_node_ids) {
>>>                 printk(KERN_WARNING
>>>                         "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
>>>                         node, nr_node_ids);
>>>                 dump_stack();
>>>                 return cpu_none_mask;
>>>         }
>>>         if (node_to_cpumask_map[node] == NULL) {
>>>                 printk(KERN_WARNING
>>>                         "cpumask_of_node(%d): no node_to_cpumask_map!\n",
>>>                         node);
>>>                 dump_stack();
>>>                 return cpu_online_mask;
>>>         }
>>>         return node_to_cpumask_map[node];
>>> }
>>>
>>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
>>>    in arch/x86/include/asm/topology.h is used:
>>>
>>> static inline const struct cpumask *cpumask_of_node(int node)
>>> {
>>>         return node_to_cpumask_map[node];
>>> }
>>
>> I would simply go with. There shouldn't be any need for heavy weight
>> checks that CONFIG_DEBUG_PER_CPU_MAPS has.
>>
>> static inline const struct cpumask *cpumask_of_node(int node)
>> {
>> 	/* A nice comment goes here */
>> 	if (node == NUMA_NO_NODE)

How about "(unsigned int)node >= nr_node_ids", this is suggested
by Peter, it checks the case where the node id set by fw is bigger
or equal than nr_node_ids, and still handle the < 0 case, which
includes NUMA_NO_NODE.

Maybe define a macro like below to do that in order to do
the node checking consistently through kernel:

#define numa_node_valid(node)	((unsigned int)(node) < nr_node_ids)


>> 		return node_to_cpumask_map[numa_mem_id()];
>>         return node_to_cpumask_map[node];
>> }
> 
> Sleeping over this and thinking more about the actual semantic the above
> is wrong. We cannot really copy the page allocator logic. Why? Simply
> because the page allocator doesn't enforce the near node affinity. It
> just picks it up as a preferred node but then it is free to fallback to
> any other numa node. This is not the case here and node_to_cpumask_map will
> only restrict to the particular node's cpus which would have really non
> deterministic behavior depending on where the code is executed. So in
> fact we really want to return cpu_online_mask for NUMA_NO_NODE.

From below, if the __GFP_THISNODE is set, the fallback is not performed.
For node_to_cpumask_map() case, maybe we can return the cpumask that is
on the node of cpu_to_node(raw_smp_processor_id()) for NUMA_NO_NODE,
because the current cpu does belong to a node, and the node does have at
least one cpu, which is the cpu is calling the node_to_cpumask_map().

Make any sense?

/*
 * Allocate pages, preferring the node given as nid. The node must be valid and
 * online. For more general interface, see alloc_pages_node().
 */
static inline struct page *
__alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
{
	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));

	return __alloc_pages(gfp_mask, order, nid);
}

> 
> Sorry about the confusion.
>> -- 
>> Michal Hocko
>> SUSE Labs
>
Michal Hocko Sept. 11, 2019, 6:49 a.m. UTC | #17
On Wed 11-09-19 14:15:51, Yunsheng Lin wrote:
> On 2019/9/11 13:33, Michal Hocko wrote:
> > On Tue 10-09-19 14:53:39, Michal Hocko wrote:
> >> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote:
> >>> On 2019/9/10 19:12, Greg KH wrote:
> >>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
> >>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
> >>>>>> On 2019/9/10 17:31, Greg KH wrote:
> >>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> >>>>>>>> On 2019/9/9 17:53, Greg KH wrote:
> >>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> >>>>>>>>>> Currently a device does not belong to any of the numa nodes
> >>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> >>>>>>>>>> specified by fw nor by virtual device layer and the device has
> >>>>>>>>>> no parent device.
> >>>>>>>>>
> >>>>>>>>> Is this really a problem?
> >>>>>>>>
> >>>>>>>> Not really.
> >>>>>>>> Someone need to guess the node id when it is not specified, right?
> >>>>>>>
> >>>>>>> No, why?  Guessing guarantees you will get it wrong on some systems.
> >>>>>>>
> >>>>>>> Are you seeing real problems because the id is not being set?  What
> >>>>>>> problem is this fixing that you can actually observe?
> >>>>>>
> >>>>>> When passing the return value of dev_to_node() to cpumask_of_node()
> >>>>>> without checking the node id if the node id is not valid, there is
> >>>>>> global-out-of-bounds detected by KASAN as below:
> >>>>>
> >>>>> OK, I seem to remember this being brought up already. And now when I
> >>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
> >>>>> aware. That means using the same trick the allocator does for this
> >>>>> special case.
> >>>>
> >>>> That seems reasonable to me, and much more "obvious" as to what is going
> >>>> on.
> >>>>
> >>>
> >>> Ok, thanks for the suggestion.
> >>>
> >>> For arm64 and x86, there are two versions of cpumask_of_node().
> >>>
> >>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
> >>>    in arch/x86/mm/numa.c is used, which does partial node id checking:
> >>>
> >>> const struct cpumask *cpumask_of_node(int node)
> >>> {
> >>>         if (node >= nr_node_ids) {
> >>>                 printk(KERN_WARNING
> >>>                         "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
> >>>                         node, nr_node_ids);
> >>>                 dump_stack();
> >>>                 return cpu_none_mask;
> >>>         }
> >>>         if (node_to_cpumask_map[node] == NULL) {
> >>>                 printk(KERN_WARNING
> >>>                         "cpumask_of_node(%d): no node_to_cpumask_map!\n",
> >>>                         node);
> >>>                 dump_stack();
> >>>                 return cpu_online_mask;
> >>>         }
> >>>         return node_to_cpumask_map[node];
> >>> }
> >>>
> >>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
> >>>    in arch/x86/include/asm/topology.h is used:
> >>>
> >>> static inline const struct cpumask *cpumask_of_node(int node)
> >>> {
> >>>         return node_to_cpumask_map[node];
> >>> }
> >>
> >> I would simply go with. There shouldn't be any need for heavy weight
> >> checks that CONFIG_DEBUG_PER_CPU_MAPS has.
> >>
> >> static inline const struct cpumask *cpumask_of_node(int node)
> >> {
> >> 	/* A nice comment goes here */
> >> 	if (node == NUMA_NO_NODE)
> 
> How about "(unsigned int)node >= nr_node_ids", this is suggested
> by Peter, it checks the case where the node id set by fw is bigger
> or equal than nr_node_ids, and still handle the < 0 case, which
> includes NUMA_NO_NODE.

Isn't that a plain bug? Is something like that really happening?

> Maybe define a macro like below to do that in order to do
> the node checking consistently through kernel:
> 
> #define numa_node_valid(node)	((unsigned int)(node) < nr_node_ids)
> 
> 
> >> 		return node_to_cpumask_map[numa_mem_id()];
> >>         return node_to_cpumask_map[node];
> >> }
> > 
> > Sleeping over this and thinking more about the actual semantic the above
> > is wrong. We cannot really copy the page allocator logic. Why? Simply
> > because the page allocator doesn't enforce the near node affinity. It
> > just picks it up as a preferred node but then it is free to fallback to
> > any other numa node. This is not the case here and node_to_cpumask_map will
> > only restrict to the particular node's cpus which would have really non
> > deterministic behavior depending on where the code is executed. So in
> > fact we really want to return cpu_online_mask for NUMA_NO_NODE.
> 
> From below, if the __GFP_THISNODE is set, the fallback is not performed.

__GFP_THISNODE is a very specific situation when the caller knows which
node should be used for the allocation. NUMA_NO_NODE && __GFP_THISNODE
is a bug and I would dare to call it an undefined behavior.

> For node_to_cpumask_map() case, maybe we can return the cpumask that is
> on the node of cpu_to_node(raw_smp_processor_id()) for NUMA_NO_NODE,
> because the current cpu does belong to a node, and the node does have at
> least one cpu, which is the cpu is calling the node_to_cpumask_map().
> 
> Make any sense?

No. Please read the above paragraph again. NUMA_NO_NODE really means no
node affinity. So all cpus should be usable. Making any assumptions
about a local context is just wrong.
Yunsheng Lin Sept. 11, 2019, 7:22 a.m. UTC | #18
On 2019/9/11 14:49, Michal Hocko wrote:
> On Wed 11-09-19 14:15:51, Yunsheng Lin wrote:
>> On 2019/9/11 13:33, Michal Hocko wrote:
>>> On Tue 10-09-19 14:53:39, Michal Hocko wrote:
>>>> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote:
>>>>> On 2019/9/10 19:12, Greg KH wrote:
>>>>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
>>>>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
>>>>>>>> On 2019/9/10 17:31, Greg KH wrote:
>>>>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
>>>>>>>>>> On 2019/9/9 17:53, Greg KH wrote:
>>>>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
>>>>>>>>>>>> Currently a device does not belong to any of the numa nodes
>>>>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
>>>>>>>>>>>> specified by fw nor by virtual device layer and the device has
>>>>>>>>>>>> no parent device.
>>>>>>>>>>>
>>>>>>>>>>> Is this really a problem?
>>>>>>>>>>
>>>>>>>>>> Not really.
>>>>>>>>>> Someone need to guess the node id when it is not specified, right?
>>>>>>>>>
>>>>>>>>> No, why?  Guessing guarantees you will get it wrong on some systems.
>>>>>>>>>
>>>>>>>>> Are you seeing real problems because the id is not being set?  What
>>>>>>>>> problem is this fixing that you can actually observe?
>>>>>>>>
>>>>>>>> When passing the return value of dev_to_node() to cpumask_of_node()
>>>>>>>> without checking the node id if the node id is not valid, there is
>>>>>>>> global-out-of-bounds detected by KASAN as below:
>>>>>>>
>>>>>>> OK, I seem to remember this being brought up already. And now when I
>>>>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
>>>>>>> aware. That means using the same trick the allocator does for this
>>>>>>> special case.
>>>>>>
>>>>>> That seems reasonable to me, and much more "obvious" as to what is going
>>>>>> on.
>>>>>>
>>>>>
>>>>> Ok, thanks for the suggestion.
>>>>>
>>>>> For arm64 and x86, there are two versions of cpumask_of_node().
>>>>>
>>>>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
>>>>>    in arch/x86/mm/numa.c is used, which does partial node id checking:
>>>>>
>>>>> const struct cpumask *cpumask_of_node(int node)
>>>>> {
>>>>>         if (node >= nr_node_ids) {
>>>>>                 printk(KERN_WARNING
>>>>>                         "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
>>>>>                         node, nr_node_ids);
>>>>>                 dump_stack();
>>>>>                 return cpu_none_mask;
>>>>>         }
>>>>>         if (node_to_cpumask_map[node] == NULL) {
>>>>>                 printk(KERN_WARNING
>>>>>                         "cpumask_of_node(%d): no node_to_cpumask_map!\n",
>>>>>                         node);
>>>>>                 dump_stack();
>>>>>                 return cpu_online_mask;
>>>>>         }
>>>>>         return node_to_cpumask_map[node];
>>>>> }
>>>>>
>>>>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
>>>>>    in arch/x86/include/asm/topology.h is used:
>>>>>
>>>>> static inline const struct cpumask *cpumask_of_node(int node)
>>>>> {
>>>>>         return node_to_cpumask_map[node];
>>>>> }
>>>>
>>>> I would simply go with. There shouldn't be any need for heavy weight
>>>> checks that CONFIG_DEBUG_PER_CPU_MAPS has.
>>>>
>>>> static inline const struct cpumask *cpumask_of_node(int node)
>>>> {
>>>> 	/* A nice comment goes here */
>>>> 	if (node == NUMA_NO_NODE)
>>
>> How about "(unsigned int)node >= nr_node_ids", this is suggested
>> by Peter, it checks the case where the node id set by fw is bigger
>> or equal than nr_node_ids, and still handle the < 0 case, which
>> includes NUMA_NO_NODE.
> 
> Isn't that a plain bug? Is something like that really happening?

I have not seen one happened before except the NUMA_NO_NODE case.
Even with NUMA_NO_NODE case, we did not see it until we turn on
the KASAN detection.

It seems that there is no protection that prevent setting the node
of device to an invalid node.
And the kernel does have a few different check now:
1) some does " < 0" check;
2) some does "== NUMA_NO_NODE" check;
3) some does ">= MAX_NUMNODES" check;
4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.

We need to be consistent about the checking, right?
Michal Hocko Sept. 11, 2019, 7:34 a.m. UTC | #19
On Wed 11-09-19 15:22:30, Yunsheng Lin wrote:
[...]
> It seems that there is no protection that prevent setting the node
> of device to an invalid node.
> And the kernel does have a few different check now:
> 1) some does " < 0" check;
> 2) some does "== NUMA_NO_NODE" check;
> 3) some does ">= MAX_NUMNODES" check;
> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.
> 
> We need to be consistent about the checking, right?

You can try and chase each of them and see what to do with them. I
suspect they are a result of random attempts to fortify the code in many
cases. Consistency is certainly good but spreading more checks all over
the place just adds more cargo cult. Each check should be reasonably
justified.
Yunsheng Lin Sept. 11, 2019, 11:03 a.m. UTC | #20
On 2019/9/11 15:34, Michal Hocko wrote:
> On Wed 11-09-19 15:22:30, Yunsheng Lin wrote:
> [...]
>> It seems that there is no protection that prevent setting the node
>> of device to an invalid node.
>> And the kernel does have a few different check now:
>> 1) some does " < 0" check;
>> 2) some does "== NUMA_NO_NODE" check;
>> 3) some does ">= MAX_NUMNODES" check;
>> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.
>>
>> We need to be consistent about the checking, right?
> 
> You can try and chase each of them and see what to do with them. I
> suspect they are a result of random attempts to fortify the code in many
> cases. Consistency is certainly good but spreading more checks all over
> the place just adds more cargo cult. Each check should be reasonably
> justified.

Ok, Let me focus on making the node_to_cpumask_map() NUMA_NO_NODE aware
by only checking "node == NUMA_NO_NODE" first.

>
Yunsheng Lin Sept. 11, 2019, 11:41 a.m. UTC | #21
On 2019/9/11 19:03, Yunsheng Lin wrote:
> On 2019/9/11 15:34, Michal Hocko wrote:
>> On Wed 11-09-19 15:22:30, Yunsheng Lin wrote:
>> [...]
>>> It seems that there is no protection that prevent setting the node
>>> of device to an invalid node.
>>> And the kernel does have a few different check now:
>>> 1) some does " < 0" check;
>>> 2) some does "== NUMA_NO_NODE" check;
>>> 3) some does ">= MAX_NUMNODES" check;
>>> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.
>>>
>>> We need to be consistent about the checking, right?
>>
>> You can try and chase each of them and see what to do with them. I
>> suspect they are a result of random attempts to fortify the code in many
>> cases. Consistency is certainly good but spreading more checks all over
>> the place just adds more cargo cult. Each check should be reasonably
>> justified.
> 
> Ok, Let me focus on making the node_to_cpumask_map() NUMA_NO_NODE aware
> by only checking "node == NUMA_NO_NODE" first.

Hi, Michal
    It that ok for me to add your name to "Suggested-by" tag, since I am
going to quote some of your words on the commit log.

> 
>>
> 
> 
> .
>
Michal Hocko Sept. 11, 2019, 12:02 p.m. UTC | #22
On Wed 11-09-19 19:41:44, Yunsheng Lin wrote:
> On 2019/9/11 19:03, Yunsheng Lin wrote:
> > On 2019/9/11 15:34, Michal Hocko wrote:
> >> On Wed 11-09-19 15:22:30, Yunsheng Lin wrote:
> >> [...]
> >>> It seems that there is no protection that prevent setting the node
> >>> of device to an invalid node.
> >>> And the kernel does have a few different check now:
> >>> 1) some does " < 0" check;
> >>> 2) some does "== NUMA_NO_NODE" check;
> >>> 3) some does ">= MAX_NUMNODES" check;
> >>> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check.
> >>>
> >>> We need to be consistent about the checking, right?
> >>
> >> You can try and chase each of them and see what to do with them. I
> >> suspect they are a result of random attempts to fortify the code in many
> >> cases. Consistency is certainly good but spreading more checks all over
> >> the place just adds more cargo cult. Each check should be reasonably
> >> justified.
> > 
> > Ok, Let me focus on making the node_to_cpumask_map() NUMA_NO_NODE aware
> > by only checking "node == NUMA_NO_NODE" first.
> 
> Hi, Michal
>     It that ok for me to add your name to "Suggested-by" tag, since I am
> going to quote some of your words on the commit log.

Sure, no problem. Thanks!
Peter Zijlstra Sept. 23, 2019, 3:09 p.m. UTC | #23
On Wed, Sep 11, 2019 at 08:49:26AM +0200, Michal Hocko wrote:
> On Wed 11-09-19 14:15:51, Yunsheng Lin wrote:

> > >>>>>> When passing the return value of dev_to_node() to cpumask_of_node()
> > >>>>>> without checking the node id if the node id is not valid, there is
> > >>>>>> global-out-of-bounds detected by KASAN as below:
> > >>>>>
> > >>>>> OK, I seem to remember this being brought up already. And now when I
> > >>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
> > >>>>> aware. That means using the same trick the allocator does for this
> > >>>>> special case.

> No. Please read the above paragraph again. NUMA_NO_NODE really means no
> node affinity. So all cpus should be usable. Making any assumptions
> about a local context is just wrong.

So none of this makes sense to me. How can a device have NUMA_NO_NODE on
a NUMA machine. It needs to have a physical presence _somwhere_; and
that cannot be equidistant from all CPUs.

Please explain how this makes physical sense.

Patch
diff mbox series

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 1669d41..f79ad20 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -2107,9 +2107,13 @@  int device_add(struct device *dev)
 	if (kobj)
 		dev->kobj.parent = kobj;
 
-	/* use parent numa_node */
-	if (parent && (dev_to_node(dev) == NUMA_NO_NODE))
-		set_dev_node(dev, dev_to_node(parent));
+	/* use parent numa_node or default node 0 */
+	if (!numa_node_valid(dev_to_node(dev))) {
+		if (parent && numa_node_valid(dev_to_node(parent)))
+			set_dev_node(dev, dev_to_node(parent));
+		else
+			set_dev_node(dev, 0);
+	}
 
 	/* first, register with generic layer. */
 	/* we require the name to be set before, and pass NULL */
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 110b0e5..eccc757 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -13,4 +13,6 @@ 
 
 #define	NUMA_NO_NODE	(-1)
 
+#define numa_node_valid(node)	((unsigned int)(node) < nr_node_ids)
+
 #endif /* _LINUX_NUMA_H */