From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 094D9C32789 for ; Fri, 2 Nov 2018 12:09:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A1BC4204FD for ; Fri, 2 Nov 2018 12:09:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A1BC4204FD Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726721AbeKBVQK (ORCPT ); Fri, 2 Nov 2018 17:16:10 -0400 Received: from szxga06-in.huawei.com ([45.249.212.32]:38275 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726008AbeKBVQK (ORCPT ); Fri, 2 Nov 2018 17:16:10 -0400 Received: from DGGEMS410-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id F056097B68C14; Fri, 2 Nov 2018 20:09:06 +0800 (CST) Received: from [127.0.0.1] (10.202.226.41) by DGGEMS410-HUB.china.huawei.com (10.3.19.210) with Microsoft SMTP Server id 14.3.408.0; Fri, 2 Nov 2018 20:08:59 +0800 Subject: Re: Crash report: Broken NUMA distance map causes crash on arm64 system To: Peter Zijlstra References: <20181030092640.GE1459@hirez.programming.kicks-ass.net> <20181031204622.GB3141@hirez.programming.kicks-ass.net> <20181102105003.GK3178@hirez.programming.kicks-ass.net> CC: "devicetree@vger.kernel.org" , "Anshuman Khandual" , Catalin Marinas , Will Deacon , Linuxarm , , Rob Herring , Frank Rowand , Ingo Molnar , "linux-arm-kernel@lists.infradead.org" From: John Garry Message-ID: Date: Fri, 2 Nov 2018 12:08:53 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <20181102105003.GK3178@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.226.41] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/11/2018 10:50, Peter Zijlstra wrote: > On Thu, Nov 01, 2018 at 10:01:05AM +0000, John Garry wrote: >> On 31/10/2018 20:46, Peter Zijlstra wrote: > >>>> I also note that if I apply the patch, below, to reject the invalid NUMA >>>> distance, we're still getting a warning/error: >>>> >>>> [ 7.144407] CPU: All CPU(s) started at EL2 >>>> [ 7.148678] alternatives: patching kernel code >>>> [ 7.153557] ERROR: Node-0 not representative >>>> [ 7.153557] >>>> [ 7.159365] 10 15 20 25 >>>> [ 7.162097] 15 10 25 30 >>>> [ 7.164832] 20 25 10 15 >>>> [ 7.167562] 25 30 15 10 >>> >>> Yeah, that's an 'obviously' broken topology too. >>> >> >> AFAICT, this conforms to ACPI spec SLIT rules, and the kernel SLIT >> validation allows this also. So maybe we should shout louder here or even >> mark the SLIT as invalid if totally broken. > > Right. Part of the problem is that I only have a few necessary > conditions (symmetry, trace-identity, node0-complete-distance) for the > distance table, but together they are not sufficient to determine if a > topology is 'good'. > > (also, strictly speaking, non-symmetric topologies are possible -- think > uni-directional mesh links -- we've just not ever seen and validated > those) > > That said; I can certainly make this code give up and warn harder on > those conditions. That sounds reasonable, but I think that you would know better on the policy of whether it's better to give up when conditions are not met. Regardless, a louder warn sounds good. > > How is something like the below? > > --- > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 9d74371e4aad..41e703dd875f 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1207,6 +1218,11 @@ sd_init(struct sched_domain_topology_level *tl, > return sd; > } > > +static const struct cpumask *cpu_system_mask(int cpu) > +{ > + return cpu_online_mask; > +} > + > /* > * Topology list, bottom-up. > */ > @@ -1337,7 +1353,7 @@ void sched_init_numa(void) > int level = 0; > int i, j, k; > > - sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL); > + sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1), GFP_KERNEL); Ahhhh, unfortuately this change in isolation is not fixing the crash for me. I'm still crashing in the same place repeatedly. Here's relevant log with KASAN enabled: [ 10.655465] ================================================================== [ 10.662772] BUG: KASAN: use-after-free in free_sched_groups.part.1+0x4c/0xf0 [ 10.669896] Read of size 8 at addr ffff801db4668000 by task swapper/0/1 [ 10.676578] [ 10.678079] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc6-00018-gf417e94-dirty #945 [ 10.686432] Hardware name: Hisilicon Hip07 D05 Development Board (DT) [ 10.692940] Call trace: [ 10.695407] dump_backtrace+0x0/0x230 [ 10.699104] show_stack+0x14/0x20 [ 10.702450] dump_stack+0xa0/0xc4 [ 10.705796] print_address_description+0x60/0x270 [ 10.710546] kasan_report+0x258/0x360 [ 10.714243] __asan_load8+0x84/0xa8 [ 10.717764] free_sched_groups.part.1+0x4c/0xf0 [ 10.722339] destroy_sched_domain+0x50/0x118 [ 10.726651] cpu_attach_domain+0x140/0x450 [ 10.730787] build_sched_domains+0x1750/0x1820 [ 10.735274] sched_init_domains+0x8c/0xb0 [ 10.739324] sched_init_smp+0x2c/0x7c [ 10.743021] kernel_init_freeable+0x138/0x2f0 [ 10.747421] kernel_init+0x10/0x120 [ 10.750941] ret_from_fork+0x10/0x18 [ 10.754547] [ 10.756046] Allocated by task 1: [ 10.759303] kasan_kmalloc+0xd0/0x180 [ 10.763000] build_sched_domains+0x2d0/0x1820 [ 10.767400] sched_init_domains+0x8c/0xb0 [ 10.771448] sched_init_smp+0x2c/0x7c [ 10.775145] kernel_init_freeable+0x138/0x2f0 [ 10.779544] kernel_init+0x10/0x120 [ 10.783064] ret_from_fork+0x10/0x18 [ 10.786671] [ 10.788168] Freed by task 1: [ 10.791074] __kasan_slab_free+0x114/0x220 [ 10.795209] kasan_slab_free+0x10/0x18 [ 10.798993] kfree+0x74/0x210 [ 10.801986] free_sched_groups.part.1+0x74/0xf0 [ 10.806562] destroy_sched_domain+0x50/0x118 [ 10.810874] cpu_attach_domain+0x140/0x450 [ 10.815009] build_sched_domains+0x1750/0x1820 [ 10.819496] sched_init_domains+0x8c/0xb0 [ 10.823544] sched_init_smp+0x2c/0x7c [ 10.827241] kernel_init_freeable+0x138/0x2f0 [ 10.831639] kernel_init+0x10/0x120 [ 10.835159] ret_from_fork+0x10/0x18 [ 10.838766] [ 10.840264] The buggy address belongs to the object at ffff801db4668000 [ 10.840264] which belongs to the cache kmalloc-128 of size 128 [ 10.852926] The buggy address is located 0 bytes inside of [ 10.852926] 128-byte region [ffff801db4668000, ffff801db4668080) [ 10.864619] The buggy address belongs to the page: [ 10.869459] page:ffff7e0076d19a00 count:1 mapcount:0 mapping:ffff801dbac0fc00 index:0xffff801db4669f00 compound_mapcount: 0 [ 10.880717] flags: 0x1fffc00000008100(slab|head) [ 10.885383] raw: 1fffc00000008100 ffff7e0076d19888 ffff7e0076d19a88 ffff801dbac0fc00 [ 10.893211] raw: ffff801db4669f00 0000000000200001 00000001ffffffff 0000000000000000 [ 10.901037] page dumped because: kasan: bad access detected [ 10.906664] [ 10.908162] Memory state around the buggy address: [ 10.913001] ffff801db4667f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 10.920300] ffff801db4667f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 10.927599] >ffff801db4668000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 10.934897] ^ [ 10.938153] ffff801db4668080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 10.945453] ffff801db4668100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 10.952752] ================================================================== [ 10.960050] Disabling lock debugging due to kernel taint [ 10.982852] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 10.991738] Mem abort info: [ 10.994554] ESR = 0x96000004 [ 10.997641] Exception class = DABT (current EL), IL = 32 bits [ 11.003626] SET = 0, FnV = 0 [ 11.006706] EA = 0, S1PTW = 0 [ 11.009878] Data abort info: [ 11.012782] ISV = 0, ISS = 0x00000004 [ 11.016659] CM = 0, WnR = 0 [ 11.019651] [0000000000000000] user address but active_mm is swapper [ 11.026078] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 11.031707] Modules linked in: [ 11.034791] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 4.19.0-rc6-00018-gf417e94-dirty #945 [ 11.044550] Hardware name: Hisilicon Hip07 D05 Development Board (DT) [ 11.051058] pstate: 40000005 (nZcv daif -PAN -UAO) [ 11.055898] pc : free_sched_groups.part.1+0x4c/0xf0 [ 11.060825] lr : free_sched_groups.part.1+0x4c/0xf0 [ 11.065750] sp : ffff801db5d4fb60 [ 11.069093] x29: ffff801db5d4fb60 x28: 00000000ffff0000 [ 11.074462] x27: ffff2000094ad000 x26: ffff801db4b34000 [ 11.079830] x25: ffff200008c7e080 x24: 0000000000000001 [ 11.085197] x23: ffff200008c7e0c0 x22: ffff801db45b9810 [ 11.090565] x21: ffff801db45c8000 x20: 0000000000000000 [ 11.095932] x19: 0000000000000000 x18: 0000000034d5d91d [ 11.101300] x17: 00000000b44797ff x16: 0000000000000000 [ 11.106667] x15: 0000000000000000 x14: 3d6769726f5f6773 [ 11.112035] x13: 2030303230633534 x12: 6264313038666666 [ 11.117403] x11: 6678303d6367733e x10: 2d677320313d6367 [ 11.122770] x9 : 735f656572662073 x8 : 3038633534626431 [ 11.128138] x7 : 3038666666667830 x6 : 0000000000000000 [ 11.133505] x5 : ffff2000094ad6c8 x4 : 0000000000000000 [ 11.138872] x3 : ffff200008154bf4 x2 : 0000000000000000 [ 11.144239] x1 : 176366c3c3548000 x0 : 0000000000000000 [ 11.149608] Process swapper/0 (pid: 1, stack limit = 0x(____ptrval____)) [ 11.156379] Call trace: [ 11.158845] free_sched_groups.part.1+0x4c/0xf0 [ 11.163421] destroy_sched_domain+0x50/0x118 [ 11.167732] cpu_attach_domain+0x140/0x450 [ 11.171868] build_sched_domains+0x1750/0x1820 [ 11.176356] sched_init_domains+0x8c/0xb0 [ 11.180404] sched_init_smp+0x2c/0x7c [ 11.184100] kernel_init_freeable+0x138/0x2f0 [ 11.188499] kernel_init+0x10/0x120 [ 11.192020] ret_from_fork+0x10/0x18 [ 11.195629] Code: eb15029f 54000200 aa1303e0 94059f92 (f9400274) [ 11.201867] ---[ end trace 794769e2506810a5 ]--- [ 11.206558] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 11.206558] [ 11.215809] SMP: stopping secondary CPUs [ 11.219795] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 11.219795] ]--- The KASAN bug report is essentially the same with and without this change. I did not try the full patch, and I guess the problem would go away after the full patch is applied, but you may want to know the issue still.. Thanks, John > if (!sched_domains_numa_distance) > return; > > @@ -1353,39 +1369,22 @@ void sched_init_numa(void) > * node_distance(i,j) in order to avoid cubic time. > */ > next_distance = curr_distance; > + > for (i = 0; i < nr_node_ids; i++) { > for (j = 0; j < nr_node_ids; j++) { > - for (k = 0; k < nr_node_ids; k++) { > - int distance = node_distance(i, k); > - > - if (distance > curr_distance && > - (distance < next_distance || > - next_distance == curr_distance)) > - next_distance = distance; > - > - /* > - * While not a strong assumption it would be nice to know > - * about cases where if node A is connected to B, B is not > - * equally connected to A. > - */ > - if (sched_debug() && node_distance(k, i) != distance) > - sched_numa_warn("Node-distance not symmetric"); > - > - if (sched_debug() && i && !find_numa_distance(distance)) > - sched_numa_warn("Node-0 not representative"); > - } > - if (next_distance != curr_distance) { > - sched_domains_numa_distance[level++] = next_distance; > - sched_domains_numa_levels = level; > - curr_distance = next_distance; > - } else break; > - } > + int distance = node_distance(0, j); > > - /* > - * In case of sched_debug() we verify the above assumption. > - */ > - if (!sched_debug()) > - break; > + if (distance > curr_distance && > + (distance < next_distance || > + next_distance == curr_distance)) > + next_distance = distance; > + > + } > + if (next_distance != curr_distance) { > + sched_domains_numa_distance[level++] = next_distance; > + sched_domains_numa_levels = level; > + curr_distance = next_distance; > + } else break; > } > > /* > @@ -1428,7 +1427,24 @@ void sched_init_numa(void) > sched_domains_numa_masks[i][j] = mask; > > for_each_node(k) { > - if (node_distance(j, k) > sched_domains_numa_distance[i]) > + int distance = node_distance(j, k); > + > + if (!i && j != k) { > + sched_numa_warn("Non-trace locality\n"); > + goto fail; > + } > + > + if (distance > curr_distance) { > + sched_numa_warn("Node-0 not complete\n"); > + goto fail; > + } > + > + if (distance != node_distance(k, j)) { > + sched_numa_warn("Non symmetric\n"); > + goto fail; > + } > + > + if (distance > sched_domains_numa_distance[i]) > continue; > > cpumask_or(mask, mask, cpumask_of_node(k)); > @@ -1437,9 +1453,10 @@ void sched_init_numa(void) > } > > /* Compute default topology size */ > - for (i = 0; sched_domain_topology[i].mask; i++); > + for (k = 0; sched_domain_topology[k].mask; k++) > + ; > > - tl = kzalloc((i + level + 1) * > + tl = kzalloc((k + level + 1) * > sizeof(struct sched_domain_topology_level), GFP_KERNEL); > if (!tl) > return; > @@ -1447,7 +1464,7 @@ void sched_init_numa(void) > /* > * Copy the default topology bits.. > */ > - for (i = 0; sched_domain_topology[i].mask; i++) > + for (i = 0; i < k; i++) > tl[i] = sched_domain_topology[i]; > > /* > @@ -1478,6 +1495,31 @@ void sched_init_numa(void) > sched_max_numa_distance = sched_domains_numa_distance[level - 1]; > > init_numa_topology_type(); > + > + return; > + > +fail: > + /* Compute default topology size */ > + for (k = 0; sched_domain_topology[k].mask; k++) > + ; > + > + tl = kzalloc((k + 2) * > + sizeof(struct sched_domain_topology_level), GFP_KERNEL); > + if (!tl) > + return; > + > + /* > + * Copy the default topology bits.. > + */ > + for (i = 0; i < k; i++) > + tl[i] = sched_domain_topology[i]; > + > + tl[i] = (struct sched_domain_topology_level){ > + .mask = cpu_system_mask, > + SD_INIT_NAME(SYSTEM) > + }; > + > + sched_domain_topology = tl; > } > > void sched_domains_numa_masks_set(unsigned int cpu) > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > . >