From: Michal Hocko <mhocko@kernel.org>
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: Greg KH <gregkh@linuxfoundation.org>,
rafael@kernel.org, linux-kernel@vger.kernel.org,
peterz@infradead.org, mingo@kernel.org, linuxarm@huawei.com
Subject: Re: [PATCH] driver core: ensure a device has valid node id in device_add()
Date: Wed, 11 Sep 2019 08:49:26 +0200 [thread overview]
Message-ID: <20190911064926.GJ4023@dhcp22.suse.cz> (raw)
In-Reply-To: <ca590101-bfc8-3934-d803-537aacb707e0@huawei.com>
On Wed 11-09-19 14:15:51, Yunsheng Lin wrote:
> On 2019/9/11 13:33, Michal Hocko wrote:
> > On Tue 10-09-19 14:53:39, Michal Hocko wrote:
> >> On Tue 10-09-19 20:47:40, Yunsheng Lin wrote:
> >>> On 2019/9/10 19:12, Greg KH wrote:
> >>>> On Tue, Sep 10, 2019 at 01:04:51PM +0200, Michal Hocko wrote:
> >>>>> On Tue 10-09-19 18:58:05, Yunsheng Lin wrote:
> >>>>>> On 2019/9/10 17:31, Greg KH wrote:
> >>>>>>> On Tue, Sep 10, 2019 at 02:43:32PM +0800, Yunsheng Lin wrote:
> >>>>>>>> On 2019/9/9 17:53, Greg KH wrote:
> >>>>>>>>> On Mon, Sep 09, 2019 at 02:04:23PM +0800, Yunsheng Lin wrote:
> >>>>>>>>>> Currently a device does not belong to any of the numa nodes
> >>>>>>>>>> (dev->numa_node is NUMA_NO_NODE) when the node id is neither
> >>>>>>>>>> specified by fw nor by virtual device layer and the device has
> >>>>>>>>>> no parent device.
> >>>>>>>>>
> >>>>>>>>> Is this really a problem?
> >>>>>>>>
> >>>>>>>> Not really.
> >>>>>>>> Someone need to guess the node id when it is not specified, right?
> >>>>>>>
> >>>>>>> No, why? Guessing guarantees you will get it wrong on some systems.
> >>>>>>>
> >>>>>>> Are you seeing real problems because the id is not being set? What
> >>>>>>> problem is this fixing that you can actually observe?
> >>>>>>
> >>>>>> When passing the return value of dev_to_node() to cpumask_of_node()
> >>>>>> without checking the node id if the node id is not valid, there is
> >>>>>> global-out-of-bounds detected by KASAN as below:
> >>>>>
> >>>>> OK, I seem to remember this being brought up already. And now when I
> >>>>> think about it, we really want to make cpumask_of_node NUMA_NO_NODE
> >>>>> aware. That means using the same trick the allocator does for this
> >>>>> special case.
> >>>>
> >>>> That seems reasonable to me, and much more "obvious" as to what is going
> >>>> on.
> >>>>
> >>>
> >>> Ok, thanks for the suggestion.
> >>>
> >>> For arm64 and x86, there are two versions of cpumask_of_node().
> >>>
> >>> when CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node()
> >>> in arch/x86/mm/numa.c is used, which does partial node id checking:
> >>>
> >>> const struct cpumask *cpumask_of_node(int node)
> >>> {
> >>> if (node >= nr_node_ids) {
> >>> printk(KERN_WARNING
> >>> "cpumask_of_node(%d): node > nr_node_ids(%u)\n",
> >>> node, nr_node_ids);
> >>> dump_stack();
> >>> return cpu_none_mask;
> >>> }
> >>> if (node_to_cpumask_map[node] == NULL) {
> >>> printk(KERN_WARNING
> >>> "cpumask_of_node(%d): no node_to_cpumask_map!\n",
> >>> node);
> >>> dump_stack();
> >>> return cpu_online_mask;
> >>> }
> >>> return node_to_cpumask_map[node];
> >>> }
> >>>
> >>> when CONFIG_DEBUG_PER_CPU_MAPS is undefined, the cpumask_of_node()
> >>> in arch/x86/include/asm/topology.h is used:
> >>>
> >>> static inline const struct cpumask *cpumask_of_node(int node)
> >>> {
> >>> return node_to_cpumask_map[node];
> >>> }
> >>
> >> I would simply go with. There shouldn't be any need for heavy weight
> >> checks that CONFIG_DEBUG_PER_CPU_MAPS has.
> >>
> >> static inline const struct cpumask *cpumask_of_node(int node)
> >> {
> >> /* A nice comment goes here */
> >> if (node == NUMA_NO_NODE)
>
> How about "(unsigned int)node >= nr_node_ids", this is suggested
> by Peter, it checks the case where the node id set by fw is bigger
> or equal than nr_node_ids, and still handle the < 0 case, which
> includes NUMA_NO_NODE.
Isn't that a plain bug? Is something like that really happening?
> Maybe define a macro like below to do that in order to do
> the node checking consistently through kernel:
>
> #define numa_node_valid(node) ((unsigned int)(node) < nr_node_ids)
>
>
> >> return node_to_cpumask_map[numa_mem_id()];
> >> return node_to_cpumask_map[node];
> >> }
> >
> > Sleeping over this and thinking more about the actual semantic the above
> > is wrong. We cannot really copy the page allocator logic. Why? Simply
> > because the page allocator doesn't enforce the near node affinity. It
> > just picks it up as a preferred node but then it is free to fallback to
> > any other numa node. This is not the case here and node_to_cpumask_map will
> > only restrict to the particular node's cpus which would have really non
> > deterministic behavior depending on where the code is executed. So in
> > fact we really want to return cpu_online_mask for NUMA_NO_NODE.
>
> From below, if the __GFP_THISNODE is set, the fallback is not performed.
__GFP_THISNODE is a very specific situation when the caller knows which
node should be used for the allocation. NUMA_NO_NODE && __GFP_THISNODE
is a bug and I would dare to call it an undefined behavior.
> For node_to_cpumask_map() case, maybe we can return the cpumask that is
> on the node of cpu_to_node(raw_smp_processor_id()) for NUMA_NO_NODE,
> because the current cpu does belong to a node, and the node does have at
> least one cpu, which is the cpu is calling the node_to_cpumask_map().
>
> Make any sense?
No. Please read the above paragraph again. NUMA_NO_NODE really means no
node affinity. So all cpus should be usable. Making any assumptions
about a local context is just wrong.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2019-09-11 6:49 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-09 6:04 [PATCH] driver core: ensure a device has valid node id in device_add() Yunsheng Lin
2019-09-09 9:53 ` Greg KH
2019-09-10 6:43 ` Yunsheng Lin
2019-09-10 7:13 ` Michal Hocko
2019-09-10 9:31 ` Greg KH
2019-09-10 10:58 ` Yunsheng Lin
2019-09-10 11:04 ` Michal Hocko
2019-09-10 11:12 ` Greg KH
2019-09-10 12:47 ` Yunsheng Lin
2019-09-10 12:53 ` Michal Hocko
2019-09-11 5:33 ` Michal Hocko
2019-09-11 6:15 ` Yunsheng Lin
2019-09-11 6:49 ` Michal Hocko [this message]
2019-09-11 7:22 ` Yunsheng Lin
2019-09-11 7:34 ` Michal Hocko
2019-09-11 11:03 ` Yunsheng Lin
2019-09-11 11:41 ` Yunsheng Lin
2019-09-11 12:02 ` Michal Hocko
2019-09-23 15:09 ` Peter Zijlstra
2019-09-09 18:50 ` Michal Hocko
2019-09-10 7:08 ` Yunsheng Lin
2019-09-10 7:24 ` Michal Hocko
2019-09-10 10:40 ` Yunsheng Lin
2019-09-10 11:01 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190911064926.GJ4023@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=linyunsheng@huawei.com \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).