* [PATCH] driver core: ensure a device has valid node id in device_add() @ 2019-09-09 6:04 Yunsheng Lin 2019-09-09 9:53 ` Greg KH 2019-09-09 18:50 ` Michal Hocko 0 siblings, 2 replies; 24+ messages in thread From: Yunsheng Lin @ 2019-09-09 6:04 UTC (permalink / raw) To: gregkh, rafael; +Cc: linux-kernel, peterz, mingo, mhocko, linuxarm Currently a device does not belong to any of the numa nodes (dev->numa_node is NUMA_NO_NODE) when the node id is neither specified by fw nor by virtual device layer and the device has no parent device. According to discussion in [1]: Even if a device's numa node is not specified, the device really does belong to a node. This patch sets the device node to node 0 in device_add() if the device's node id is not specified and it either has no parent device, or the parent device also does not have a valid node id. [1] https://lkml.org/lkml/2019/9/2/466 Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> --- Changelog RFC -> v1: 1. Drop log error message and use a "if" instead of "? :". 2. Drop the RFC tag. --- drivers/base/core.c | 10 +++++++--- include/linux/numa.h | 2 ++ 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index 1669d41..f79ad20 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -2107,9 +2107,13 @@ int device_add(struct device *dev) if (kobj) dev->kobj.parent = kobj; - /* use parent numa_node */ - if (parent && (dev_to_node(dev) == NUMA_NO_NODE)) - set_dev_node(dev, dev_to_node(parent)); + /* use parent numa_node or default node 0 */ + if (!numa_node_valid(dev_to_node(dev))) { + if (parent && numa_node_valid(dev_to_node(parent))) + set_dev_node(dev, dev_to_node(parent)); + else + set_dev_node(dev, 0); + } /* first, register with generic layer. */ /* we require the name to be set before, and pass NULL */ diff --git a/include/linux/numa.h b/include/linux/numa.h index 110b0e5..eccc757 100644 --- a/include/linux/numa.h +++ b/include/linux/numa.h @@ -13,4 +13,6 @@ #define NUMA_NO_NODE (-1) +#define numa_node_valid(node) ((unsigned int)(node) < nr_node_ids) + #endif /* _LINUX_NUMA_H */ -- 2.8.1 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-09 6:04 [PATCH] driver core: ensure a device has valid node id in device_add() Yunsheng Lin @ 2019-09-09 9:53 ` Greg KH 2019-09-10 6:43 ` Yunsheng Lin 2019-09-09 18:50 ` Michal Hocko 1 sibling, 1 reply; 24+ messages in thread From: Greg KH @ 2019-09-09 9:53 UTC (permalink / raw) To: Yunsheng Lin; +Cc: rafael, linux-kernel, peterz, mingo, mhocko, linuxarm On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: > Currently a device does not belong to any of the numa nodes > (dev->numa_node is NUMA_NO_NODE) when the node id is neither > specified by fw nor by virtual device layer and the device has > no parent device. Is this really a problem? > According to discussion in [1]: > Even if a device's numa node is not specified, the device really > does belong to a node. But as we do not know the node, can we cause more harm by randomly picking one (i.e. putting it all in node 0)? > This patch sets the device node to node 0 in device_add() if the > device's node id is not specified and it either has no parent > device, or the parent device also does not have a valid node id. > > [1] https://lkml.org/lkml/2019/9/2/466 > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > Changelog RFC -> v1: > 1. Drop log error message and use a "if" instead of "? :". > 2. Drop the RFC tag. > --- > drivers/base/core.c | 10 +++++++--- > include/linux/numa.h | 2 ++ > 2 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 1669d41..f79ad20 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -2107,9 +2107,13 @@ int device_add(struct device *dev) > if (kobj) > dev->kobj.parent = kobj; > > - /* use parent numa_node */ > - if (parent && (dev_to_node(dev) == NUMA_NO_NODE)) > - set_dev_node(dev, dev_to_node(parent)); > + /* use parent numa_node or default node 0 */ > + if (!numa_node_valid(dev_to_node(dev))) { > + if (parent && numa_node_valid(dev_to_node(parent))) > + set_dev_node(dev, dev_to_node(parent)); > + else > + set_dev_node(dev, 0); > + } Again, is this going to cause more harm than good? What happens if we leave it as "unknown", isn't that better than thinking we "know" it is in node 0? thanks, greg k-h ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-09 9:53 ` Greg KH @ 2019-09-10 6:43 ` Yunsheng Lin 2019-09-10 7:13 ` Michal Hocko 2019-09-10 9:31 ` Greg KH 0 siblings, 2 replies; 24+ messages in thread From: Yunsheng Lin @ 2019-09-10 6:43 UTC (permalink / raw) To: Greg KH; +Cc: rafael, linux-kernel, peterz, mingo, mhocko, linuxarm On 2019/9/9 17:53, Greg KH wrote: > On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: >> Currently a device does not belong to any of the numa nodes >> (dev->numa_node is NUMA_NO_NODE) when the node id is neither >> specified by fw nor by virtual device layer and the device has >> no parent device. > > Is this really a problem? Not really. Someone need to guess the node id when it is not specified, right? This patch chooses to guess the node id in the driver core. > >> According to discussion in [1]: >> Even if a device's numa node is not specified, the device really >> does belong to a node. > > But as we do not know the node, can we cause more harm by randomly > picking one (i.e. putting it all in node 0)? If we do not pick node 0 for device with invalid node, then caller need to check the node id and pick one, and currently different callers does a different checking: 1) some does " < 0" check; 2) some does "== NUMA_NO_NODE" check; 3) some does ">= MAX_NUMNODES" check; 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check. and caller of dev_to_node() may pick one node based on below if the dev_to_node() return a invalid node based on above checking: 1) based on numa_mem_id(). 2) pick a random one like in workqueue_select_cpu_near(). If we pick node 0 for device with invalid node in device_add(), we may avoid the above different checking and picking for caller, but we may lose some caller context info, for example, user may use node of the cpu on which the process is using the device to allocate the resource close to the process, or user may pick a random one if they know what they are doing. It seems there is trade off here, as I can see, we can guess and pick the node at different stage when it is not specified. 1. guess and pick node 0 at device_add(), it has the advantage of ensure all devices will have a valid node at very begin of device creation, so the user does not have to check and guess one, but user might lose the opportunity to do their own guessing and picking. 2. Maybe provide a dev_to_valid_node() to always return a valid node id, for example return numa_mem_id() if dev->numa_node is not valid. User know what they are doing can still use dev_to_node(). 3. Caller of dev_to_node() do their own checking and picking, which might lead to adding more different and reduplicate checking as above. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 6:43 ` Yunsheng Lin @ 2019-09-10 7:13 ` Michal Hocko 2019-09-10 9:31 ` Greg KH 1 sibling, 0 replies; 24+ messages in thread From: Michal Hocko @ 2019-09-10 7:13 UTC (permalink / raw) To: Yunsheng Lin; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On Tue 10-09-19 14:43:32, Yunsheng Lin wrote: > On 2019/9/9 17:53, Greg KH wrote: [...] > > But as we do not know the node, can we cause more harm by randomly > > picking one (i.e. putting it all in node 0)? > If we do not pick node 0 for device with invalid node, then caller need > to check the node id and pick one, and currently different callers > does a different checking: Could you be more specific about who those callers are why do they need the node id for? Because the page allocator can handle that as already pointed out in my other email. Who else does really have to know about the node apart the allocator? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 6:43 ` Yunsheng Lin 2019-09-10 7:13 ` Michal Hocko @ 2019-09-10 9:31 ` Greg KH 2019-09-10 10:58 ` Yunsheng Lin 1 sibling, 1 reply; 24+ messages in thread From: Greg KH @ 2019-09-10 9:31 UTC (permalink / raw) To: Yunsheng Lin; +Cc: rafael, linux-kernel, peterz, mingo, mhocko, linuxarm On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: > On 2019/9/9 17:53, Greg KH wrote: > > On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: > >> Currently a device does not belong to any of the numa nodes > >> (dev->numa_node is NUMA_NO_NODE) when the node id is neither > >> specified by fw nor by virtual device layer and the device has > >> no parent device. > > > > Is this really a problem? > > Not really. > Someone need to guess the node id when it is not specified, right? No, why? Guessing guarantees you will get it wrong on some systems. Are you seeing real problems because the id is not being set? What problem is this fixing that you can actually observe? thanks, greg k-h ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 9:31 ` Greg KH @ 2019-09-10 10:58 ` Yunsheng Lin 2019-09-10 11:04 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-10 10:58 UTC (permalink / raw) To: Greg KH; +Cc: rafael, linux-kernel, peterz, mingo, mhocko, linuxarm On 2019/9/10 17:31, Greg KH wrote: > On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: >> On 2019/9/9 17:53, Greg KH wrote: >>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: >>>> Currently a device does not belong to any of the numa nodes >>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither >>>> specified by fw nor by virtual device layer and the device has >>>> no parent device. >>> >>> Is this really a problem? >> >> Not really. >> Someone need to guess the node id when it is not specified, right? > > No, why? Guessing guarantees you will get it wrong on some systems. > > Are you seeing real problems because the id is not being set? What > problem is this fixing that you can actually observe? When passing the return value of dev_to_node() to cpumask_of_node() without checking the node id if the node id is not valid, there is global-out-of-bounds detected by KASAN as below: there are different checking to return value of dev_to_node(), I though it is better to consistently do checking in cpumask_of_node(), then discussion [1] [2] led to do the checking in device_add(). [ 42.970381] ================================================================== [ 42.977595] BUG: KASAN: global-out-of-bounds in __bitmap_weight+0x48/0xb0 [ 42.984370] Read of size 8 at addr ffff20008cdf8790 by task kworker/0:1/13 [ 42.991230] [ 42.992712] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G O 5.2.0-rc4-g8bde06a-dirty #3 [ 43.001830] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B050.01 08/08/2019 [ 43.011298] Workqueue: events work_for_cpu_fn [ 43.015643] Call trace: [ 43.018078] dump_backtrace+0x0/0x1e8 [ 43.021727] show_stack+0x14/0x20 [ 43.025031] dump_stack+0xc4/0xfc [ 43.028335] print_address_description+0x178/0x270 [ 43.033113] __kasan_report+0x164/0x1b8 [ 43.036936] kasan_report+0xc/0x18 [ 43.040325] __asan_load8+0x84/0xa8 [ 43.043801] __bitmap_weight+0x48/0xb0 [ 43.047552] hclge_init_ae_dev+0x988/0x1e78 [hclge] [ 43.052418] hnae3_register_ae_dev+0xcc/0x278 [hnae3] [ 43.057467] hns3_probe+0xe0/0x120 [hns3] [ 43.061464] local_pci_probe+0x74/0xf0 [ 43.065200] work_for_cpu_fn+0x2c/0x48 [ 43.068937] process_one_work+0x3c0/0x878 [ 43.072934] worker_thread+0x400/0x670 [ 43.076670] kthread+0x1b0/0x1b8 [ 43.079885] ret_from_fork+0x10/0x18 [ 43.083446] [ 43.084925] The buggy address belongs to the variable: [ 43.090052] numa_distance+0x30/0x40 [ 43.093613] [ 43.095091] Memory state around the buggy address: [ 43.099870] ffff20008cdf8680: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 fa fa [ 43.107078] ffff20008cdf8700: fa fa fa fa 04 fa fa fa fa fa fa fa 00 fa fa fa [ 43.114286] >ffff20008cdf8780: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa [ 43.121494] ^ [ 43.125230] ffff20008cdf8800: 01 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa fa [ 43.132439] ffff20008cdf8880: fa fa fa fa fa fa fa fa 00 00 fa fa fa fa fa fa [ 43.139646] ================================================================== [1] https://lore.kernel.org/patchwork/patch/1122081/ [2] https://lore.kernel.org/patchwork/patch/1122516/ > > thanks, > > greg k-h > > . > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 10:58 ` Yunsheng Lin @ 2019-09-10 11:04 ` Michal Hocko 2019-09-10 11:12 ` Greg KH 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2019-09-10 11:04 UTC (permalink / raw) To: Yunsheng Lin; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: > On 2019/9/10 17:31, Greg KH wrote: > > On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: > >> On 2019/9/9 17:53, Greg KH wrote: > >>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: > >>>> Currently a device does not belong to any of the numa nodes > >>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither > >>>> specified by fw nor by virtual device layer and the device has > >>>> no parent device. > >>> > >>> Is this really a problem? > >> > >> Not really. > >> Someone need to guess the node id when it is not specified, right? > > > > No, why? Guessing guarantees you will get it wrong on some systems. > > > > Are you seeing real problems because the id is not being set? What > > problem is this fixing that you can actually observe? > > When passing the return value of dev_to_node() to cpumask_of_node() > without checking the node id if the node id is not valid, there is > global-out-of-bounds detected by KASAN as below: OK, I seem to remember this being brought up already. And now when I think about it, we really want to make cpumask_of_node NUMA_NO_NODE aware. That means using the same trick the allocator does for this special case. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 11:04 ` Michal Hocko @ 2019-09-10 11:12 ` Greg KH 2019-09-10 12:47 ` Yunsheng Lin 0 siblings, 1 reply; 24+ messages in thread From: Greg KH @ 2019-09-10 11:12 UTC (permalink / raw) To: Michal Hocko; +Cc: Yunsheng Lin, rafael, linux-kernel, peterz, mingo, linuxarm On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote: > On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: > > On 2019/9/10 17:31, Greg KH wrote: > > > On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: > > >> On 2019/9/9 17:53, Greg KH wrote: > > >>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: > > >>>> Currently a device does not belong to any of the numa nodes > > >>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither > > >>>> specified by fw nor by virtual device layer and the device has > > >>>> no parent device. > > >>> > > >>> Is this really a problem? > > >> > > >> Not really. > > >> Someone need to guess the node id when it is not specified, right? > > > > > > No, why? Guessing guarantees you will get it wrong on some systems. > > > > > > Are you seeing real problems because the id is not being set? What > > > problem is this fixing that you can actually observe? > > > > When passing the return value of dev_to_node() to cpumask_of_node() > > without checking the node id if the node id is not valid, there is > > global-out-of-bounds detected by KASAN as below: > > OK, I seem to remember this being brought up already. And now when I > think about it, we really want to make cpumask_of_node NUMA_NO_NODE > aware. That means using the same trick the allocator does for this > special case. That seems reasonable to me, and much more "obvious" as to what is going on. thanks, greg k-h ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 11:12 ` Greg KH @ 2019-09-10 12:47 ` Yunsheng Lin 2019-09-10 12:53 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-10 12:47 UTC (permalink / raw) To: Greg KH, Michal Hocko; +Cc: rafael, linux-kernel, peterz, mingo, linuxarm On 2019/9/10 19:12, Greg KH wrote: > On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote: >> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: >>> On 2019/9/10 17:31, Greg KH wrote: >>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: >>>>> On 2019/9/9 17:53, Greg KH wrote: >>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: >>>>>>> Currently a device does not belong to any of the numa nodes >>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither >>>>>>> specified by fw nor by virtual device layer and the device has >>>>>>> no parent device. >>>>>> >>>>>> Is this really a problem? >>>>> >>>>> Not really. >>>>> Someone need to guess the node id when it is not specified, right? >>>> >>>> No, why? Guessing guarantees you will get it wrong on some systems. >>>> >>>> Are you seeing real problems because the id is not being set? What >>>> problem is this fixing that you can actually observe? >>> >>> When passing the return value of dev_to_node() to cpumask_of_node() >>> without checking the node id if the node id is not valid, there is >>> global-out-of-bounds detected by KASAN as below: >> >> OK, I seem to remember this being brought up already. And now when I >> think about it, we really want to make cpumask_of_node NUMA_NO_NODE >> aware. That means using the same trick the allocator does for this >> special case. > > That seems reasonable to me, and much more "obvious" as to what is going > on. > Ok, thanks for the suggestion. For arm64 and x86, there are two versions of cpumask_of_node(). when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() in arch/x86/mm/numa.c is used, which does partial node id checking: const struct cpumask *cpumask_of_node(int node) { if (node >= nr_node_ids) { printk(KERN_WARNING "cpumask_of_node(%d): node > nr_node_ids(%u)\n", node, nr_node_ids); dump_stack(); return cpu_none_mask; } if (node_to_cpumask_map[node] == NULL) { printk(KERN_WARNING "cpumask_of_node(%d): no node_to_cpumask_map!\n", node); dump_stack(); return cpu_online_mask; } return node_to_cpumask_map[node]; } when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node() in arch/x86/include/asm/topology.h is used: static inline const struct cpumask *cpumask_of_node(int node) { return node_to_cpumask_map[node]; } As discussion in [1], adding the checking in cpumask_of_node() with CONFIG_DEBUG_PER_CPU_MAPS not defined increases overhead for everyone, and it is already true that cpumask_of_node() requires a valid node_id. So maybe the overhead is worth it? Hi, Peter Does the argument in this thread about making cpumask_of_node() NUMA_NO_NODE aware make sense to you? [1] https://lore.kernel.org/patchwork/patch/1122516/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 12:47 ` Yunsheng Lin @ 2019-09-10 12:53 ` Michal Hocko 2019-09-11 5:33 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2019-09-10 12:53 UTC (permalink / raw) To: Yunsheng Lin; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On Tue 10-09-19 20:47:40, Yunsheng Lin wrote: > On 2019/9/10 19:12, Greg KH wrote: > > On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote: > >> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: > >>> On 2019/9/10 17:31, Greg KH wrote: > >>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: > >>>>> On 2019/9/9 17:53, Greg KH wrote: > >>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: > >>>>>>> Currently a device does not belong to any of the numa nodes > >>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither > >>>>>>> specified by fw nor by virtual device layer and the device has > >>>>>>> no parent device. > >>>>>> > >>>>>> Is this really a problem? > >>>>> > >>>>> Not really. > >>>>> Someone need to guess the node id when it is not specified, right? > >>>> > >>>> No, why? Guessing guarantees you will get it wrong on some systems. > >>>> > >>>> Are you seeing real problems because the id is not being set? What > >>>> problem is this fixing that you can actually observe? > >>> > >>> When passing the return value of dev_to_node() to cpumask_of_node() > >>> without checking the node id if the node id is not valid, there is > >>> global-out-of-bounds detected by KASAN as below: > >> > >> OK, I seem to remember this being brought up already. And now when I > >> think about it, we really want to make cpumask_of_node NUMA_NO_NODE > >> aware. That means using the same trick the allocator does for this > >> special case. > > > > That seems reasonable to me, and much more "obvious" as to what is going > > on. > > > > Ok, thanks for the suggestion. > > For arm64 and x86, there are two versions of cpumask_of_node(). > > when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() > in arch/x86/mm/numa.c is used, which does partial node id checking: > > const struct cpumask *cpumask_of_node(int node) > { > if (node >= nr_node_ids) { > printk(KERN_WARNING > "cpumask_of_node(%d): node > nr_node_ids(%u)\n", > node, nr_node_ids); > dump_stack(); > return cpu_none_mask; > } > if (node_to_cpumask_map[node] == NULL) { > printk(KERN_WARNING > "cpumask_of_node(%d): no node_to_cpumask_map!\n", > node); > dump_stack(); > return cpu_online_mask; > } > return node_to_cpumask_map[node]; > } > > when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node() > in arch/x86/include/asm/topology.h is used: > > static inline const struct cpumask *cpumask_of_node(int node) > { > return node_to_cpumask_map[node]; > } I would simply go with. There shouldn't be any need for heavy weight checks that CONFIG_DEBUG_PER_CPU_MAPS has. static inline const struct cpumask *cpumask_of_node(int node) { /* A nice comment goes here */ if (node == NUMA_NO_NODE) return node_to_cpumask_map[numa_mem_id()]; return node_to_cpumask_map[node]; } -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 12:53 ` Michal Hocko @ 2019-09-11 5:33 ` Michal Hocko 2019-09-11 6:15 ` Yunsheng Lin 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2019-09-11 5:33 UTC (permalink / raw) To: Yunsheng Lin; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On Tue 10-09-19 14:53:39, Michal Hocko wrote: > On Tue 10-09-19 20:47:40, Yunsheng Lin wrote: > > On 2019/9/10 19:12, Greg KH wrote: > > > On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote: > > >> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: > > >>> On 2019/9/10 17:31, Greg KH wrote: > > >>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: > > >>>>> On 2019/9/9 17:53, Greg KH wrote: > > >>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: > > >>>>>>> Currently a device does not belong to any of the numa nodes > > >>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither > > >>>>>>> specified by fw nor by virtual device layer and the device has > > >>>>>>> no parent device. > > >>>>>> > > >>>>>> Is this really a problem? > > >>>>> > > >>>>> Not really. > > >>>>> Someone need to guess the node id when it is not specified, right? > > >>>> > > >>>> No, why? Guessing guarantees you will get it wrong on some systems. > > >>>> > > >>>> Are you seeing real problems because the id is not being set? What > > >>>> problem is this fixing that you can actually observe? > > >>> > > >>> When passing the return value of dev_to_node() to cpumask_of_node() > > >>> without checking the node id if the node id is not valid, there is > > >>> global-out-of-bounds detected by KASAN as below: > > >> > > >> OK, I seem to remember this being brought up already. And now when I > > >> think about it, we really want to make cpumask_of_node NUMA_NO_NODE > > >> aware. That means using the same trick the allocator does for this > > >> special case. > > > > > > That seems reasonable to me, and much more "obvious" as to what is going > > > on. > > > > > > > Ok, thanks for the suggestion. > > > > For arm64 and x86, there are two versions of cpumask_of_node(). > > > > when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() > > in arch/x86/mm/numa.c is used, which does partial node id checking: > > > > const struct cpumask *cpumask_of_node(int node) > > { > > if (node >= nr_node_ids) { > > printk(KERN_WARNING > > "cpumask_of_node(%d): node > nr_node_ids(%u)\n", > > node, nr_node_ids); > > dump_stack(); > > return cpu_none_mask; > > } > > if (node_to_cpumask_map[node] == NULL) { > > printk(KERN_WARNING > > "cpumask_of_node(%d): no node_to_cpumask_map!\n", > > node); > > dump_stack(); > > return cpu_online_mask; > > } > > return node_to_cpumask_map[node]; > > } > > > > when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node() > > in arch/x86/include/asm/topology.h is used: > > > > static inline const struct cpumask *cpumask_of_node(int node) > > { > > return node_to_cpumask_map[node]; > > } > > I would simply go with. There shouldn't be any need for heavy weight > checks that CONFIG_DEBUG_PER_CPU_MAPS has. > > static inline const struct cpumask *cpumask_of_node(int node) > { > /* A nice comment goes here */ > if (node == NUMA_NO_NODE) > return node_to_cpumask_map[numa_mem_id()]; > return node_to_cpumask_map[node]; > } Sleeping over this and thinking more about the actual semantic the above is wrong. We cannot really copy the page allocator logic. Why? Simply because the page allocator doesn't enforce the near node affinity. It just picks it up as a preferred node but then it is free to fallback to any other numa node. This is not the case here and node_to_cpumask_map will only restrict to the particular node's cpus which would have really non deterministic behavior depending on where the code is executed. So in fact we really want to return cpu_online_mask for NUMA_NO_NODE. Sorry about the confusion. > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 5:33 ` Michal Hocko @ 2019-09-11 6:15 ` Yunsheng Lin 2019-09-11 6:49 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-11 6:15 UTC (permalink / raw) To: Michal Hocko; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On 2019/9/11 13:33, Michal Hocko wrote: > On Tue 10-09-19 14:53:39, Michal Hocko wrote: >> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote: >>> On 2019/9/10 19:12, Greg KH wrote: >>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote: >>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: >>>>>> On 2019/9/10 17:31, Greg KH wrote: >>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: >>>>>>>> On 2019/9/9 17:53, Greg KH wrote: >>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: >>>>>>>>>> Currently a device does not belong to any of the numa nodes >>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither >>>>>>>>>> specified by fw nor by virtual device layer and the device has >>>>>>>>>> no parent device. >>>>>>>>> >>>>>>>>> Is this really a problem? >>>>>>>> >>>>>>>> Not really. >>>>>>>> Someone need to guess the node id when it is not specified, right? >>>>>>> >>>>>>> No, why? Guessing guarantees you will get it wrong on some systems. >>>>>>> >>>>>>> Are you seeing real problems because the id is not being set? What >>>>>>> problem is this fixing that you can actually observe? >>>>>> >>>>>> When passing the return value of dev_to_node() to cpumask_of_node() >>>>>> without checking the node id if the node id is not valid, there is >>>>>> global-out-of-bounds detected by KASAN as below: >>>>> >>>>> OK, I seem to remember this being brought up already. And now when I >>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE >>>>> aware. That means using the same trick the allocator does for this >>>>> special case. >>>> >>>> That seems reasonable to me, and much more "obvious" as to what is going >>>> on. >>>> >>> >>> Ok, thanks for the suggestion. >>> >>> For arm64 and x86, there are two versions of cpumask_of_node(). >>> >>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() >>> in arch/x86/mm/numa.c is used, which does partial node id checking: >>> >>> const struct cpumask *cpumask_of_node(int node) >>> { >>> if (node >= nr_node_ids) { >>> printk(KERN_WARNING >>> "cpumask_of_node(%d): node > nr_node_ids(%u)\n", >>> node, nr_node_ids); >>> dump_stack(); >>> return cpu_none_mask; >>> } >>> if (node_to_cpumask_map[node] == NULL) { >>> printk(KERN_WARNING >>> "cpumask_of_node(%d): no node_to_cpumask_map!\n", >>> node); >>> dump_stack(); >>> return cpu_online_mask; >>> } >>> return node_to_cpumask_map[node]; >>> } >>> >>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node() >>> in arch/x86/include/asm/topology.h is used: >>> >>> static inline const struct cpumask *cpumask_of_node(int node) >>> { >>> return node_to_cpumask_map[node]; >>> } >> >> I would simply go with. There shouldn't be any need for heavy weight >> checks that CONFIG_DEBUG_PER_CPU_MAPS has. >> >> static inline const struct cpumask *cpumask_of_node(int node) >> { >> /* A nice comment goes here */ >> if (node == NUMA_NO_NODE) How about "(unsigned int)node >= nr_node_ids", this is suggested by Peter, it checks the case where the node id set by fw is bigger or equal than nr_node_ids, and still handle the < 0 case, which includes NUMA_NO_NODE. Maybe define a macro like below to do that in order to do the node checking consistently through kernel: #define numa_node_valid(node) ((unsigned int)(node) < nr_node_ids) >> return node_to_cpumask_map[numa_mem_id()]; >> return node_to_cpumask_map[node]; >> } > > Sleeping over this and thinking more about the actual semantic the above > is wrong. We cannot really copy the page allocator logic. Why? Simply > because the page allocator doesn't enforce the near node affinity. It > just picks it up as a preferred node but then it is free to fallback to > any other numa node. This is not the case here and node_to_cpumask_map will > only restrict to the particular node's cpus which would have really non > deterministic behavior depending on where the code is executed. So in > fact we really want to return cpu_online_mask for NUMA_NO_NODE. From below, if the __GFP_THISNODE is set, the fallback is not performed. For node_to_cpumask_map() case, maybe we can return the cpumask that is on the node of cpu_to_node(raw_smp_processor_id()) for NUMA_NO_NODE, because the current cpu does belong to a node, and the node does have at least one cpu, which is the cpu is calling the node_to_cpumask_map(). Make any sense? /* * Allocate pages, preferring the node given as nid. The node must be valid and * online. For more general interface, see alloc_pages_node(). */ static inline struct page * __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); return __alloc_pages(gfp_mask, order, nid); } > > Sorry about the confusion. >> -- >> Michal Hocko >> SUSE Labs > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 6:15 ` Yunsheng Lin @ 2019-09-11 6:49 ` Michal Hocko 2019-09-11 7:22 ` Yunsheng Lin 2019-09-23 15:09 ` Peter Zijlstra 0 siblings, 2 replies; 24+ messages in thread From: Michal Hocko @ 2019-09-11 6:49 UTC (permalink / raw) To: Yunsheng Lin; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On Wed 11-09-19 14:15:51, Yunsheng Lin wrote: > On 2019/9/11 13:33, Michal Hocko wrote: > > On Tue 10-09-19 14:53:39, Michal Hocko wrote: > >> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote: > >>> On 2019/9/10 19:12, Greg KH wrote: > >>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote: > >>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: > >>>>>> On 2019/9/10 17:31, Greg KH wrote: > >>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: > >>>>>>>> On 2019/9/9 17:53, Greg KH wrote: > >>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: > >>>>>>>>>> Currently a device does not belong to any of the numa nodes > >>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither > >>>>>>>>>> specified by fw nor by virtual device layer and the device has > >>>>>>>>>> no parent device. > >>>>>>>>> > >>>>>>>>> Is this really a problem? > >>>>>>>> > >>>>>>>> Not really. > >>>>>>>> Someone need to guess the node id when it is not specified, right? > >>>>>>> > >>>>>>> No, why? Guessing guarantees you will get it wrong on some systems. > >>>>>>> > >>>>>>> Are you seeing real problems because the id is not being set? What > >>>>>>> problem is this fixing that you can actually observe? > >>>>>> > >>>>>> When passing the return value of dev_to_node() to cpumask_of_node() > >>>>>> without checking the node id if the node id is not valid, there is > >>>>>> global-out-of-bounds detected by KASAN as below: > >>>>> > >>>>> OK, I seem to remember this being brought up already. And now when I > >>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE > >>>>> aware. That means using the same trick the allocator does for this > >>>>> special case. > >>>> > >>>> That seems reasonable to me, and much more "obvious" as to what is going > >>>> on. > >>>> > >>> > >>> Ok, thanks for the suggestion. > >>> > >>> For arm64 and x86, there are two versions of cpumask_of_node(). > >>> > >>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() > >>> in arch/x86/mm/numa.c is used, which does partial node id checking: > >>> > >>> const struct cpumask *cpumask_of_node(int node) > >>> { > >>> if (node >= nr_node_ids) { > >>> printk(KERN_WARNING > >>> "cpumask_of_node(%d): node > nr_node_ids(%u)\n", > >>> node, nr_node_ids); > >>> dump_stack(); > >>> return cpu_none_mask; > >>> } > >>> if (node_to_cpumask_map[node] == NULL) { > >>> printk(KERN_WARNING > >>> "cpumask_of_node(%d): no node_to_cpumask_map!\n", > >>> node); > >>> dump_stack(); > >>> return cpu_online_mask; > >>> } > >>> return node_to_cpumask_map[node]; > >>> } > >>> > >>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node() > >>> in arch/x86/include/asm/topology.h is used: > >>> > >>> static inline const struct cpumask *cpumask_of_node(int node) > >>> { > >>> return node_to_cpumask_map[node]; > >>> } > >> > >> I would simply go with. There shouldn't be any need for heavy weight > >> checks that CONFIG_DEBUG_PER_CPU_MAPS has. > >> > >> static inline const struct cpumask *cpumask_of_node(int node) > >> { > >> /* A nice comment goes here */ > >> if (node == NUMA_NO_NODE) > > How about "(unsigned int)node >= nr_node_ids", this is suggested > by Peter, it checks the case where the node id set by fw is bigger > or equal than nr_node_ids, and still handle the < 0 case, which > includes NUMA_NO_NODE. Isn't that a plain bug? Is something like that really happening? > Maybe define a macro like below to do that in order to do > the node checking consistently through kernel: > > #define numa_node_valid(node) ((unsigned int)(node) < nr_node_ids) > > > >> return node_to_cpumask_map[numa_mem_id()]; > >> return node_to_cpumask_map[node]; > >> } > > > > Sleeping over this and thinking more about the actual semantic the above > > is wrong. We cannot really copy the page allocator logic. Why? Simply > > because the page allocator doesn't enforce the near node affinity. It > > just picks it up as a preferred node but then it is free to fallback to > > any other numa node. This is not the case here and node_to_cpumask_map will > > only restrict to the particular node's cpus which would have really non > > deterministic behavior depending on where the code is executed. So in > > fact we really want to return cpu_online_mask for NUMA_NO_NODE. > > From below, if the __GFP_THISNODE is set, the fallback is not performed. __GFP_THISNODE is a very specific situation when the caller knows which node should be used for the allocation. NUMA_NO_NODE && __GFP_THISNODE is a bug and I would dare to call it an undefined behavior. > For node_to_cpumask_map() case, maybe we can return the cpumask that is > on the node of cpu_to_node(raw_smp_processor_id()) for NUMA_NO_NODE, > because the current cpu does belong to a node, and the node does have at > least one cpu, which is the cpu is calling the node_to_cpumask_map(). > > Make any sense? No. Please read the above paragraph again. NUMA_NO_NODE really means no node affinity. So all cpus should be usable. Making any assumptions about a local context is just wrong. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 6:49 ` Michal Hocko @ 2019-09-11 7:22 ` Yunsheng Lin 2019-09-11 7:34 ` Michal Hocko 2019-09-23 15:09 ` Peter Zijlstra 1 sibling, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-11 7:22 UTC (permalink / raw) To: Michal Hocko; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On 2019/9/11 14:49, Michal Hocko wrote: > On Wed 11-09-19 14:15:51, Yunsheng Lin wrote: >> On 2019/9/11 13:33, Michal Hocko wrote: >>> On Tue 10-09-19 14:53:39, Michal Hocko wrote: >>>> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote: >>>>> On 2019/9/10 19:12, Greg KH wrote: >>>>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote: >>>>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote: >>>>>>>> On 2019/9/10 17:31, Greg KH wrote: >>>>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote: >>>>>>>>>> On 2019/9/9 17:53, Greg KH wrote: >>>>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote: >>>>>>>>>>>> Currently a device does not belong to any of the numa nodes >>>>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither >>>>>>>>>>>> specified by fw nor by virtual device layer and the device has >>>>>>>>>>>> no parent device. >>>>>>>>>>> >>>>>>>>>>> Is this really a problem? >>>>>>>>>> >>>>>>>>>> Not really. >>>>>>>>>> Someone need to guess the node id when it is not specified, right? >>>>>>>>> >>>>>>>>> No, why? Guessing guarantees you will get it wrong on some systems. >>>>>>>>> >>>>>>>>> Are you seeing real problems because the id is not being set? What >>>>>>>>> problem is this fixing that you can actually observe? >>>>>>>> >>>>>>>> When passing the return value of dev_to_node() to cpumask_of_node() >>>>>>>> without checking the node id if the node id is not valid, there is >>>>>>>> global-out-of-bounds detected by KASAN as below: >>>>>>> >>>>>>> OK, I seem to remember this being brought up already. And now when I >>>>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE >>>>>>> aware. That means using the same trick the allocator does for this >>>>>>> special case. >>>>>> >>>>>> That seems reasonable to me, and much more "obvious" as to what is going >>>>>> on. >>>>>> >>>>> >>>>> Ok, thanks for the suggestion. >>>>> >>>>> For arm64 and x86, there are two versions of cpumask_of_node(). >>>>> >>>>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() >>>>> in arch/x86/mm/numa.c is used, which does partial node id checking: >>>>> >>>>> const struct cpumask *cpumask_of_node(int node) >>>>> { >>>>> if (node >= nr_node_ids) { >>>>> printk(KERN_WARNING >>>>> "cpumask_of_node(%d): node > nr_node_ids(%u)\n", >>>>> node, nr_node_ids); >>>>> dump_stack(); >>>>> return cpu_none_mask; >>>>> } >>>>> if (node_to_cpumask_map[node] == NULL) { >>>>> printk(KERN_WARNING >>>>> "cpumask_of_node(%d): no node_to_cpumask_map!\n", >>>>> node); >>>>> dump_stack(); >>>>> return cpu_online_mask; >>>>> } >>>>> return node_to_cpumask_map[node]; >>>>> } >>>>> >>>>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node() >>>>> in arch/x86/include/asm/topology.h is used: >>>>> >>>>> static inline const struct cpumask *cpumask_of_node(int node) >>>>> { >>>>> return node_to_cpumask_map[node]; >>>>> } >>>> >>>> I would simply go with. There shouldn't be any need for heavy weight >>>> checks that CONFIG_DEBUG_PER_CPU_MAPS has. >>>> >>>> static inline const struct cpumask *cpumask_of_node(int node) >>>> { >>>> /* A nice comment goes here */ >>>> if (node == NUMA_NO_NODE) >> >> How about "(unsigned int)node >= nr_node_ids", this is suggested >> by Peter, it checks the case where the node id set by fw is bigger >> or equal than nr_node_ids, and still handle the < 0 case, which >> includes NUMA_NO_NODE. > > Isn't that a plain bug? Is something like that really happening? I have not seen one happened before except the NUMA_NO_NODE case. Even with NUMA_NO_NODE case, we did not see it until we turn on the KASAN detection. It seems that there is no protection that prevent setting the node of device to an invalid node. And the kernel does have a few different check now: 1) some does " < 0" check; 2) some does "== NUMA_NO_NODE" check; 3) some does ">= MAX_NUMNODES" check; 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check. We need to be consistent about the checking, right? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 7:22 ` Yunsheng Lin @ 2019-09-11 7:34 ` Michal Hocko 2019-09-11 11:03 ` Yunsheng Lin 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2019-09-11 7:34 UTC (permalink / raw) To: Yunsheng Lin; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On Wed 11-09-19 15:22:30, Yunsheng Lin wrote: [...] > It seems that there is no protection that prevent setting the node > of device to an invalid node. > And the kernel does have a few different check now: > 1) some does " < 0" check; > 2) some does "== NUMA_NO_NODE" check; > 3) some does ">= MAX_NUMNODES" check; > 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check. > > We need to be consistent about the checking, right? You can try and chase each of them and see what to do with them. I suspect they are a result of random attempts to fortify the code in many cases. Consistency is certainly good but spreading more checks all over the place just adds more cargo cult. Each check should be reasonably justified. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 7:34 ` Michal Hocko @ 2019-09-11 11:03 ` Yunsheng Lin 2019-09-11 11:41 ` Yunsheng Lin 0 siblings, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-11 11:03 UTC (permalink / raw) To: Michal Hocko; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On 2019/9/11 15:34, Michal Hocko wrote: > On Wed 11-09-19 15:22:30, Yunsheng Lin wrote: > [...] >> It seems that there is no protection that prevent setting the node >> of device to an invalid node. >> And the kernel does have a few different check now: >> 1) some does " < 0" check; >> 2) some does "== NUMA_NO_NODE" check; >> 3) some does ">= MAX_NUMNODES" check; >> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check. >> >> We need to be consistent about the checking, right? > > You can try and chase each of them and see what to do with them. I > suspect they are a result of random attempts to fortify the code in many > cases. Consistency is certainly good but spreading more checks all over > the place just adds more cargo cult. Each check should be reasonably > justified. Ok, Let me focus on making the node_to_cpumask_map() NUMA_NO_NODE aware by only checking "node == NUMA_NO_NODE" first. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 11:03 ` Yunsheng Lin @ 2019-09-11 11:41 ` Yunsheng Lin 2019-09-11 12:02 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-11 11:41 UTC (permalink / raw) To: Michal Hocko; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On 2019/9/11 19:03, Yunsheng Lin wrote: > On 2019/9/11 15:34, Michal Hocko wrote: >> On Wed 11-09-19 15:22:30, Yunsheng Lin wrote: >> [...] >>> It seems that there is no protection that prevent setting the node >>> of device to an invalid node. >>> And the kernel does have a few different check now: >>> 1) some does " < 0" check; >>> 2) some does "== NUMA_NO_NODE" check; >>> 3) some does ">= MAX_NUMNODES" check; >>> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check. >>> >>> We need to be consistent about the checking, right? >> >> You can try and chase each of them and see what to do with them. I >> suspect they are a result of random attempts to fortify the code in many >> cases. Consistency is certainly good but spreading more checks all over >> the place just adds more cargo cult. Each check should be reasonably >> justified. > > Ok, Let me focus on making the node_to_cpumask_map() NUMA_NO_NODE aware > by only checking "node == NUMA_NO_NODE" first. Hi, Michal It that ok for me to add your name to "Suggested-by" tag, since I am going to quote some of your words on the commit log. > >> > > > . > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 11:41 ` Yunsheng Lin @ 2019-09-11 12:02 ` Michal Hocko 0 siblings, 0 replies; 24+ messages in thread From: Michal Hocko @ 2019-09-11 12:02 UTC (permalink / raw) To: Yunsheng Lin; +Cc: Greg KH, rafael, linux-kernel, peterz, mingo, linuxarm On Wed 11-09-19 19:41:44, Yunsheng Lin wrote: > On 2019/9/11 19:03, Yunsheng Lin wrote: > > On 2019/9/11 15:34, Michal Hocko wrote: > >> On Wed 11-09-19 15:22:30, Yunsheng Lin wrote: > >> [...] > >>> It seems that there is no protection that prevent setting the node > >>> of device to an invalid node. > >>> And the kernel does have a few different check now: > >>> 1) some does " < 0" check; > >>> 2) some does "== NUMA_NO_NODE" check; > >>> 3) some does ">= MAX_NUMNODES" check; > >>> 4) some does "< 0 || >= MAX_NUMNODES || !node_online(node)" check. > >>> > >>> We need to be consistent about the checking, right? > >> > >> You can try and chase each of them and see what to do with them. I > >> suspect they are a result of random attempts to fortify the code in many > >> cases. Consistency is certainly good but spreading more checks all over > >> the place just adds more cargo cult. Each check should be reasonably > >> justified. > > > > Ok, Let me focus on making the node_to_cpumask_map() NUMA_NO_NODE aware > > by only checking "node == NUMA_NO_NODE" first. > > Hi, Michal > It that ok for me to add your name to "Suggested-by" tag, since I am > going to quote some of your words on the commit log. Sure, no problem. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-11 6:49 ` Michal Hocko 2019-09-11 7:22 ` Yunsheng Lin @ 2019-09-23 15:09 ` Peter Zijlstra 1 sibling, 0 replies; 24+ messages in thread From: Peter Zijlstra @ 2019-09-23 15:09 UTC (permalink / raw) To: Michal Hocko; +Cc: Yunsheng Lin, Greg KH, rafael, linux-kernel, mingo, linuxarm On Wed, Sep 11, 2019 at 08:49:26AM +0200, Michal Hocko wrote: > On Wed 11-09-19 14:15:51, Yunsheng Lin wrote: > > >>>>>> When passing the return value of dev_to_node() to cpumask_of_node() > > >>>>>> without checking the node id if the node id is not valid, there is > > >>>>>> global-out-of-bounds detected by KASAN as below: > > >>>>> > > >>>>> OK, I seem to remember this being brought up already. And now when I > > >>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE > > >>>>> aware. That means using the same trick the allocator does for this > > >>>>> special case. > No. Please read the above paragraph again. NUMA_NO_NODE really means no > node affinity. So all cpus should be usable. Making any assumptions > about a local context is just wrong. So none of this makes sense to me. How can a device have NUMA_NO_NODE on a NUMA machine. It needs to have a physical presence _somwhere_; and that cannot be equidistant from all CPUs. Please explain how this makes physical sense. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-09 6:04 [PATCH] driver core: ensure a device has valid node id in device_add() Yunsheng Lin 2019-09-09 9:53 ` Greg KH @ 2019-09-09 18:50 ` Michal Hocko 2019-09-10 7:08 ` Yunsheng Lin 1 sibling, 1 reply; 24+ messages in thread From: Michal Hocko @ 2019-09-09 18:50 UTC (permalink / raw) To: Yunsheng Lin; +Cc: gregkh, rafael, linux-kernel, peterz, mingo, linuxarm On Mon 09-09-19 14:04:23, Yunsheng Lin wrote: > Currently a device does not belong to any of the numa nodes > (dev->numa_node is NUMA_NO_NODE) when the node id is neither > specified by fw nor by virtual device layer and the device has > no parent device. > > According to discussion in [1]: Please do not reference important parts of the justification via a link. Just quote the relevant part to the changelog. It is just too easy that external links die - not to mention lkml.org. > Even if a device's numa node is not specified, the device really > does belong to a node. What does this mean? > This patch sets the device node to node 0 in device_add() if the > device's node id is not specified and it either has no parent > device, or the parent device also does not have a valid node id. Why is node 0 special? I have seen platforms with node 0 missing or being memory less. The changelog also lacks an actual problem descripton. Why do we even care about NUMA_NO_NODE? E.g. the page allocator interprets NUMA_NO_NODE as the closest node with a memory. And by closest it really means to the CPU which is performing the allocation. > [1] https://lkml.org/lkml/2019/9/2/466 > > Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> > --- > Changelog RFC -> v1: > 1. Drop log error message and use a "if" instead of "? :". > 2. Drop the RFC tag. > --- > drivers/base/core.c | 10 +++++++--- > include/linux/numa.h | 2 ++ > 2 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/drivers/base/core.c b/drivers/base/core.c > index 1669d41..f79ad20 100644 > --- a/drivers/base/core.c > +++ b/drivers/base/core.c > @@ -2107,9 +2107,13 @@ int device_add(struct device *dev) > if (kobj) > dev->kobj.parent = kobj; > > - /* use parent numa_node */ > - if (parent && (dev_to_node(dev) == NUMA_NO_NODE)) > - set_dev_node(dev, dev_to_node(parent)); > + /* use parent numa_node or default node 0 */ > + if (!numa_node_valid(dev_to_node(dev))) { > + if (parent && numa_node_valid(dev_to_node(parent))) > + set_dev_node(dev, dev_to_node(parent)); > + else > + set_dev_node(dev, 0); > + } > > /* first, register with generic layer. */ > /* we require the name to be set before, and pass NULL */ > diff --git a/include/linux/numa.h b/include/linux/numa.h > index 110b0e5..eccc757 100644 > --- a/include/linux/numa.h > +++ b/include/linux/numa.h > @@ -13,4 +13,6 @@ > > #define NUMA_NO_NODE (-1) > > +#define numa_node_valid(node) ((unsigned int)(node) < nr_node_ids) > + > #endif /* _LINUX_NUMA_H */ > -- > 2.8.1 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-09 18:50 ` Michal Hocko @ 2019-09-10 7:08 ` Yunsheng Lin 2019-09-10 7:24 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-10 7:08 UTC (permalink / raw) To: Michal Hocko; +Cc: gregkh, rafael, linux-kernel, peterz, mingo, linuxarm On 2019/9/10 2:50, Michal Hocko wrote: > On Mon 09-09-19 14:04:23, Yunsheng Lin wrote: >> Currently a device does not belong to any of the numa nodes >> (dev->numa_node is NUMA_NO_NODE) when the node id is neither >> specified by fw nor by virtual device layer and the device has >> no parent device. >> >> According to discussion in [1]: > > Please do not reference important parts of the justification via a link. > Just quote the relevant part to the changelog. It is just too easy that > external links die - not to mention lkml.org. Ok > >> Even if a device's numa node is not specified, the device really >> does belong to a node. > > What does this mean? It means some one need to guess the node id if the node is not specified. > >> This patch sets the device node to node 0 in device_add() if the >> device's node id is not specified and it either has no parent >> device, or the parent device also does not have a valid node id. > > Why is node 0 special? I have seen platforms with node 0 missing or > being memory less. The changelog also lacks an actual problem by node 0 missing, how do we know if node 0 is missing? by node_online(0)? > descripton. Why do we even care about NUMA_NO_NODE? E.g. the page > allocator interprets NUMA_NO_NODE as the closest node with a memory. > And by closest it really means to the CPU which is performing the > allocation. Yes, I should have mentioned that in the commit log. I mentioned the below in the RFC, but somehow deleted when sending V1: "There may be explicit handling out there relying on NUMA_NO_NODE, like in nvme_probe()." > >> [1] https://lkml.org/lkml/2019/9/2/466 >> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> >> --- >> Changelog RFC -> v1: >> 1. Drop log error message and use a "if" instead of "? :". >> 2. Drop the RFC tag. >> --- >> drivers/base/core.c | 10 +++++++--- >> include/linux/numa.h | 2 ++ >> 2 files changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/base/core.c b/drivers/base/core.c >> index 1669d41..f79ad20 100644 >> --- a/drivers/base/core.c >> +++ b/drivers/base/core.c >> @@ -2107,9 +2107,13 @@ int device_add(struct device *dev) >> if (kobj) >> dev->kobj.parent = kobj; >> >> - /* use parent numa_node */ >> - if (parent && (dev_to_node(dev) == NUMA_NO_NODE)) >> - set_dev_node(dev, dev_to_node(parent)); >> + /* use parent numa_node or default node 0 */ >> + if (!numa_node_valid(dev_to_node(dev))) { >> + if (parent && numa_node_valid(dev_to_node(parent))) >> + set_dev_node(dev, dev_to_node(parent)); >> + else >> + set_dev_node(dev, 0); >> + } >> >> /* first, register with generic layer. */ >> /* we require the name to be set before, and pass NULL */ >> diff --git a/include/linux/numa.h b/include/linux/numa.h >> index 110b0e5..eccc757 100644 >> --- a/include/linux/numa.h >> +++ b/include/linux/numa.h >> @@ -13,4 +13,6 @@ >> >> #define NUMA_NO_NODE (-1) >> >> +#define numa_node_valid(node) ((unsigned int)(node) < nr_node_ids) >> + >> #endif /* _LINUX_NUMA_H */ >> -- >> 2.8.1 > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 7:08 ` Yunsheng Lin @ 2019-09-10 7:24 ` Michal Hocko 2019-09-10 10:40 ` Yunsheng Lin 0 siblings, 1 reply; 24+ messages in thread From: Michal Hocko @ 2019-09-10 7:24 UTC (permalink / raw) To: Yunsheng Lin; +Cc: gregkh, rafael, linux-kernel, peterz, mingo, linuxarm Our emails crossed, sorry about that. On Tue 10-09-19 15:08:20, Yunsheng Lin wrote: > On 2019/9/10 2:50, Michal Hocko wrote: > > On Mon 09-09-19 14:04:23, Yunsheng Lin wrote: [...] > >> Even if a device's numa node is not specified, the device really > >> does belong to a node. > > > > What does this mean? > > It means some one need to guess the node id if the node is not > specified. I have asked about this in other email so let's not cross the communication even more. > >> This patch sets the device node to node 0 in device_add() if the > >> device's node id is not specified and it either has no parent > >> device, or the parent device also does not have a valid node id. > > > > Why is node 0 special? I have seen platforms with node 0 missing or > > being memory less. The changelog also lacks an actual problem > > by node 0 missing, how do we know if node 0 is missing? > by node_online(0)? No, this is a dynamic situation. Node might get offline via hotremove. In most cases it wouldn't because there will likely be some kernel memory on node0 but you cannot really make any assumptions here. Besides that nothing should really care. > > descripton. Why do we even care about NUMA_NO_NODE? E.g. the page > > allocator interprets NUMA_NO_NODE as the closest node with a memory. > > And by closest it really means to the CPU which is performing the > > allocation. > > Yes, I should have mentioned that in the commit log. > > I mentioned the below in the RFC, but somehow deleted when sending > V1: > "There may be explicit handling out there relying on NUMA_NO_NODE, > like in nvme_probe()." This code, and other doing similar things, is very likely bogus. Just look at what the code does. It takes the node affinity from the dev and uses it for an allocation. So far so good. But it tries to be clever and special cases NUMA_NO_NODE to be first_node. So now the allocator has used a proper fallback to the nearest node with memory for the current CPU that is executing the code while dev will point to a first_node which might be a completely different one. See the discrepancy? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 7:24 ` Michal Hocko @ 2019-09-10 10:40 ` Yunsheng Lin 2019-09-10 11:01 ` Michal Hocko 0 siblings, 1 reply; 24+ messages in thread From: Yunsheng Lin @ 2019-09-10 10:40 UTC (permalink / raw) To: Michal Hocko; +Cc: gregkh, rafael, linux-kernel, peterz, mingo, linuxarm On 2019/9/10 15:24, Michal Hocko wrote: > Our emails crossed, sorry about that. > > On Tue 10-09-19 15:08:20, Yunsheng Lin wrote: >> On 2019/9/10 2:50, Michal Hocko wrote: >>> On Mon 09-09-19 14:04:23, Yunsheng Lin wrote: > [...] >>>> Even if a device's numa node is not specified, the device really >>>> does belong to a node. >>> >>> What does this mean? >> >> It means some one need to guess the node id if the node is not >> specified. > > I have asked about this in other email so let's not cross the > communication even more. I may just answer your question here. Besides the page allocator, cpu allocator or scheduler may need to know about the node id to figure out which cpu to run is more optimized, like in workqueue_select_cpu_near(). > >>>> This patch sets the device node to node 0 in device_add() if the >>>> device's node id is not specified and it either has no parent >>>> device, or the parent device also does not have a valid node id. >>> >>> Why is node 0 special? I have seen platforms with node 0 missing or >>> being memory less. The changelog also lacks an actual problem >> >> by node 0 missing, how do we know if node 0 is missing? >> by node_online(0)? > > No, this is a dynamic situation. Node might get offline via hotremove. > In most cases it wouldn't because there will likely be some kernel > memory on node0 but you cannot really make any assumptions here. Besides > that nothing should really care. From the node checking: '(unsigned)node_id >= nr_node_ids' If the nr_node_ids > 0, then node 0 is a valid node according to the above checking, is there a function to check if a node is missing? Also, I am not sure if I understand "nothing should really care". Does it means a device still can be a numa that is missing, just have some performance degradation? > >>> descripton. Why do we even care about NUMA_NO_NODE? E.g. the page >>> allocator interprets NUMA_NO_NODE as the closest node with a memory. >>> And by closest it really means to the CPU which is performing the >>> allocation. >> >> Yes, I should have mentioned that in the commit log. >> >> I mentioned the below in the RFC, but somehow deleted when sending >> V1: >> "There may be explicit handling out there relying on NUMA_NO_NODE, >> like in nvme_probe()." > > This code, and other doing similar things, is very likely bogus. Just > look at what the code does. It takes the node affinity from the dev and > uses it for an allocation. So far so good. But it tries to be clever > and special cases NUMA_NO_NODE to be first_node. So now the allocator > has used a proper fallback to the nearest node with memory for the > current CPU that is executing the code while dev will point to a > first_node which might be a completely different one. See the > discrepancy? Do you mean let kzalloc_node handle the NUMA_NO_NODE case, if node id is NUMA_NO_NODE, kzalloc_node handles it as numa_mem_id(). If yes, above makes more sense. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] driver core: ensure a device has valid node id in device_add() 2019-09-10 10:40 ` Yunsheng Lin @ 2019-09-10 11:01 ` Michal Hocko 0 siblings, 0 replies; 24+ messages in thread From: Michal Hocko @ 2019-09-10 11:01 UTC (permalink / raw) To: Yunsheng Lin; +Cc: gregkh, rafael, linux-kernel, peterz, mingo, linuxarm On Tue 10-09-19 18:40:12, Yunsheng Lin wrote: > On 2019/9/10 15:24, Michal Hocko wrote: > > Our emails crossed, sorry about that. > > > > On Tue 10-09-19 15:08:20, Yunsheng Lin wrote: > >> On 2019/9/10 2:50, Michal Hocko wrote: > >>> On Mon 09-09-19 14:04:23, Yunsheng Lin wrote: > > [...] > >>>> Even if a device's numa node is not specified, the device really > >>>> does belong to a node. > >>> > >>> What does this mean? > >> > >> It means some one need to guess the node id if the node is not > >> specified. > > > > I have asked about this in other email so let's not cross the > > communication even more. > > I may just answer your question here. > > Besides the page allocator, cpu allocator or scheduler may need to > know about the node id to figure out which cpu to run is more > optimized, like in workqueue_select_cpu_near(). OK, I can see some point there. Which of them do not handle this case properly? At least workqueue_select_cpu_near seems to handle it reasonably by not making any numa topology assumptions. > >>>> This patch sets the device node to node 0 in device_add() if the > >>>> device's node id is not specified and it either has no parent > >>>> device, or the parent device also does not have a valid node id. > >>> > >>> Why is node 0 special? I have seen platforms with node 0 missing or > >>> being memory less. The changelog also lacks an actual problem > >> > >> by node 0 missing, how do we know if node 0 is missing? > >> by node_online(0)? > > > > No, this is a dynamic situation. Node might get offline via hotremove. > > In most cases it wouldn't because there will likely be some kernel > > memory on node0 but you cannot really make any assumptions here. Besides > > that nothing should really care. > > >From the node checking: > '(unsigned)node_id >= nr_node_ids' > > If the nr_node_ids > 0, then node 0 is a valid node according to > the above checking, is there a function to check if a node is > missing? What does that mean, really? > Also, I am not sure if I understand "nothing should really care". > Does it means a device still can be a numa that is missing, just > have some performance degradation? It really depends on what that numa affinity really represent. If the affinity is to a non existent NUMA node then it likely has to hop over some interconnect and pay some performance penalty. If there is no affinity then it likely makes no difference on which node it sits and which memory it gets. > >>> descripton. Why do we even care about NUMA_NO_NODE? E.g. the page > >>> allocator interprets NUMA_NO_NODE as the closest node with a memory. > >>> And by closest it really means to the CPU which is performing the > >>> allocation. > >> > >> Yes, I should have mentioned that in the commit log. > >> > >> I mentioned the below in the RFC, but somehow deleted when sending > >> V1: > >> "There may be explicit handling out there relying on NUMA_NO_NODE, > >> like in nvme_probe()." > > > > This code, and other doing similar things, is very likely bogus. Just > > look at what the code does. It takes the node affinity from the dev and > > uses it for an allocation. So far so good. But it tries to be clever > > and special cases NUMA_NO_NODE to be first_node. So now the allocator > > has used a proper fallback to the nearest node with memory for the > > current CPU that is executing the code while dev will point to a > > first_node which might be a completely different one. See the > > discrepancy? > > Do you mean let kzalloc_node handle the NUMA_NO_NODE case, if node > id is NUMA_NO_NODE, kzalloc_node handles it as numa_mem_id(). Which is what the page allocator does and makes sense IMHO. So what is the actual problem? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2019-09-23 15:09 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-09 6:04 [PATCH] driver core: ensure a device has valid node id in device_add() Yunsheng Lin 2019-09-09 9:53 ` Greg KH 2019-09-10 6:43 ` Yunsheng Lin 2019-09-10 7:13 ` Michal Hocko 2019-09-10 9:31 ` Greg KH 2019-09-10 10:58 ` Yunsheng Lin 2019-09-10 11:04 ` Michal Hocko 2019-09-10 11:12 ` Greg KH 2019-09-10 12:47 ` Yunsheng Lin 2019-09-10 12:53 ` Michal Hocko 2019-09-11 5:33 ` Michal Hocko 2019-09-11 6:15 ` Yunsheng Lin 2019-09-11 6:49 ` Michal Hocko 2019-09-11 7:22 ` Yunsheng Lin 2019-09-11 7:34 ` Michal Hocko 2019-09-11 11:03 ` Yunsheng Lin 2019-09-11 11:41 ` Yunsheng Lin 2019-09-11 12:02 ` Michal Hocko 2019-09-23 15:09 ` Peter Zijlstra 2019-09-09 18:50 ` Michal Hocko 2019-09-10 7:08 ` Yunsheng Lin 2019-09-10 7:24 ` Michal Hocko 2019-09-10 10:40 ` Yunsheng Lin 2019-09-10 11:01 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).