Re: [PATCH] arm64: numa: check the node id before accessing node_to_cpumask_map

From: Michal Hocko <mhocko@kernel.org>
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: will@kernel.org, akpm@linux-foundation.org, rppt@linux.ibm.com,
	anshuman.khandual@arm.com, adobriyan@gmail.com, cai@lca.pw,
	robin.murphy@arm.com, tglx@linutronix.de,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linuxarm@huawei.com
Subject: Re: [PATCH] arm64: numa: check the node id before accessing node_to_cpumask_map
Date: Fri, 30 Aug 2019 10:39:25 +0200	[thread overview]
Message-ID: <20190830083925.GV28313@dhcp22.suse.cz> (raw)
In-Reply-To: <740cae36-f1a9-4d20-e4ea-3100076de533@huawei.com>

On Fri 30-08-19 16:08:14, Yunsheng Lin wrote:
> On 2019/8/30 14:44, Michal Hocko wrote:
> > On Fri 30-08-19 14:35:26, Yunsheng Lin wrote:
> >> On 2019/8/30 13:55, Michal Hocko wrote:
> >>> On Fri 30-08-19 10:26:31, Yunsheng Lin wrote:
> >>>> Some buggy bios may not set the device' numa id, and dev_to_node
> >>>> will return -1, which may cause global-out-of-bounds error
> >>>> detected by KASAN.
> >>>
> >>> Why should we workaround a buggy bios like that? Is it so widespread and
> >>> no BIOS update available? Also, why is this arm64 specific?
> >>
> >> For our case, there is BIOS update available. I just thought it might
> >> be better to protect from this case when BIOS has not implemented the
> >> device' numa id setting feature or the feature from BIOS has some bug.
> >>
> >> It is not arm64 specific, right now I only have arm64 board. If it is
> >> ok to protect this from the buggy BIOS, maybe all other arch can be
> >> changed too.
> > 
> > If we are to really care then this should be consistent among
> > architectures IMHO. But I am not really sure this is really worth it.
> > The code is quite old and I do not really remember any reports. 
> 
> It is only detected by enabling KASAN, the system seems to run fine without
> any visible error if KASAN is disabled. Maybe there is why no report has
> been seen?
> 
> Also according to Section 6.2.14 from ACPI spec 6.3 [1], the setting of proximity
> domain is optional, as below:
> 
> This optional object is used to describe proximity domain
> associations within a machine. _PXM evaluates to an integer
> that identifies a device as belonging to a Proximity Domain
> defined in the System Resource Affinity Table (SRAT).
> 
> 
> Do you think it is ok to resend the fix with above clarification and below
> log:

OK, if the specification really allows to provide NUMA_NO_NODE (-1) then
the code really has to be prepared for that. And ideally all arches
should deal with that.

> [   42.970381] ==================================================================
> [   42.977595] BUG: KASAN: global-out-of-bounds in __bitmap_weight+0x48/0xb0
> [   42.984370] Read of size 8 at addr ffff20008cdf8790 by task kworker/0:1/13
> [   42.991230]
> [   42.992712] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G           O      5.2.0-rc4-g8bde06a-dirty #3
> [   43.001830] Hardware name: Huawei TaiShan 2280 V2/BC82AMDA, BIOS TA BIOS 2280-A CS V2.B050.01 08/08/2019
> [   43.011298] Workqueue: events work_for_cpu_fn
> [   43.015643] Call trace:
> [   43.018078]  dump_backtrace+0x0/0x1e8
> [   43.021727]  show_stack+0x14/0x20
> [   43.025031]  dump_stack+0xc4/0xfc
> [   43.028335]  print_address_description+0x178/0x270
> [   43.033113]  __kasan_report+0x164/0x1b8
> [   43.036936]  kasan_report+0xc/0x18
> [   43.040325]  __asan_load8+0x84/0xa8
> [   43.043801]  __bitmap_weight+0x48/0xb0
> [   43.047552]  hclge_init_ae_dev+0x988/0x1e78 [hclge]
> [   43.052418]  hnae3_register_ae_dev+0xcc/0x278 [hnae3]
> [   43.057467]  hns3_probe+0xe0/0x120 [hns3]
> [   43.061464]  local_pci_probe+0x74/0xf0
> [   43.065200]  work_for_cpu_fn+0x2c/0x48
> [   43.068937]  process_one_work+0x3c0/0x878
> [   43.072934]  worker_thread+0x400/0x670
> [   43.076670]  kthread+0x1b0/0x1b8
> [   43.079885]  ret_from_fork+0x10/0x18
> [   43.083446]
> [   43.084925] The buggy address belongs to the variable:
> [   43.090052]  numa_distance+0x30/0x40
> [   43.093613]
> [   43.095091] Memory state around the buggy address:
> [   43.099870]  ffff20008cdf8680: fa fa fa fa 04 fa fa fa fa fa fa fa 00 00 fa fa
> [   43.107078]  ffff20008cdf8700: fa fa fa fa 04 fa fa fa fa fa fa fa 00 fa fa fa
> [   43.114286] >ffff20008cdf8780: fa fa fa fa 00 00 00 00 00 00 00 00 fa fa fa fa
> [   43.121494]                          ^
> [   43.125230]  ffff20008cdf8800: 01 fa fa fa fa fa fa fa 04 fa fa fa fa fa fa fa
> [   43.132439]  ffff20008cdf8880: fa fa fa fa fa fa fa fa 00 00 fa fa fa fa fa fa
> [   43.139646] ==================================================================
> 
> 
> > 
> >>>> This patch changes cpumask_of_node to return cpu_none_mask if the
> >>>> node is not valid, and sync the cpumask_of_node between the
> >>>> cpumask_of_node function in numa.h and numa.c.
> >>>
> >>> Why?
> >>
> >> When CONFIG_DEBUG_PER_CPU_MAPS is defined, the cpumask_of_node() in
> >> numa.c is used, if not, the cpumask_of_node() in numa.h is used.
> >>
> >> I am not sure why there is difference between them, and it is there
> >> when since the below commit:
> >> 1a2db300348b ("arm64, numa: Add NUMA support for arm64 platforms.")
> >>
> >> I synced them to keep them consistent whether CONFIG_DEBUG_PER_CPU_MAPS
> >> is defined.
> > 
> > Such a change should be made in a separate patch with a full
> > clarification/justification. From the above it is still not clear why
> > this is needed though.
> 
> Ok.
> 
> How about:
> 
> Currently there are different implementations of cpumask_of_node() depend
> on the arch, for example:
> 
> ia64:
> #define cpumask_of_node(node) ((node) == -1 ?				\
> 			       cpu_all_mask :				\
> 			       &node_to_cpu_mask[node])
> 
> 
> alpha:
> static const struct cpumask *cpumask_of_node(int node)
> {
> 	int cpu;
> 
> 	if (node == NUMA_NO_NODE)
> 		return cpu_all_mask;
> 
> 	cpumask_clear(&node_to_cpumask_map[node]);
> 
> 	for_each_online_cpu(cpu) {
> 		if (cpu_to_node(cpu) == node)
> 			cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> 	}
> 
> 	return &node_to_cpumask_map[node];
> }
> 
> Even for the same arch, there are two implementations depend on the
> CONFIG_DEBUG_PER_CPU_MAPS configuration.
> 
> arm64/x86 without CONFIG_DEBUG_PER_CPU_MAPS:
> static inline const struct cpumask *cpumask_of_node(int node)
> {
> 	return node_to_cpumask_map[node];
> }
> 
> arm64/x86 with CONFIG_DEBUG_PER_CPU_MAPS:
> const struct cpumask *cpumask_of_node(int node)
> {
> 	if (WARN_ON(node >= nr_node_ids))
> 		return cpu_none_mask;
> 
> 	if (WARN_ON(node_to_cpumask_map[node] == NULL))
> 		return cpu_online_mask;
> 
> 	return node_to_cpumask_map[node];
> }
> 
> It seems the cpumask_of_node with CONFIG_DEBUG_PER_CPU_MAPS is used
> to catch the erorr case and give a warning to user when node id is not
> valid.

Yeah the config help text
config DEBUG_PER_CPU_MAPS
        bool "Debug access to per_cpu maps"
        depends on DEBUG_KERNEL
        depends on SMP
        help
          Say Y to verify that the per_cpu map being accessed has
          been set up. This adds a fair amount of code to kernel memory
          and decreases performance.

          Say N if unsure.

suggests that this is intentionally hidden behind a config so a normal
path shouldn't really duplicate it. If those checks make sense in
general then the config option should be dropped I think.
-- 
Michal Hocko
SUSE Labs