From: Michal Hocko <mhocko@kernel.org> To: Pingfan Liu <kernelfans@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, Bjorn Helgaas <bhelgaas@google.com>, Jonathan Cameron <Jonathan.Cameron@huawei.com> Subject: Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline Date: Fri, 7 Dec 2018 12:30:44 +0100 [thread overview] Message-ID: <20181207113044.GB1286@dhcp22.suse.cz> (raw) In-Reply-To: <CAFgQCTsFBUcOE9UKQ2vz=hg2FWp_QurZMQmJZ2wYLBqXkFHKHQ@mail.gmail.com> On Fri 07-12-18 17:40:09, Pingfan Liu wrote: > On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 07-12-18 10:56:51, Pingfan Liu wrote: > > [...] > > > In a short word, the fix method should consider about the two factors: > > > semantic of online-node and the effect on all archs > > > > I am pretty sure there is a lot of room for unification in this area. > > Nevertheless I strongly believe the bug should be fixed firs with the > > simplest way and all the cleanup should be done on top. > > > > Do I get it right that the diff worked for you and I can prepare a full > > patch? > > > Sure, I am glad to test you new patch. From 46e68be89d9c299fd497b2b8bea3f2add144f17f Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 7 Dec 2018 12:23:32 +0100 Subject: [PATCH] x86, numa: always initialize all possible nodes Pingfan Liu has reported the following splat [ 5.772742] BUG: unable to handle kernel paging request at 0000000000002088 [ 5.773618] PGD 0 P4D 0 [ 5.773618] Oops: 0000 [#1] SMP NOPTI [ 5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3 [ 5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018 [ 5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0 [ 5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6 89 [ 5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246 [ 5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000 [ 5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 [ 5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002 [ 5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000 [ 5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002 [ 5.773618] FS: 0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000 [ 5.773618] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0 [ 5.773618] Call Trace: [ 5.773618] new_slab+0xa9/0x570 [ 5.773618] ___slab_alloc+0x375/0x540 [ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] __slab_alloc+0x1c/0x38 [ 5.773618] __kmalloc_node_track_caller+0xc8/0x270 [ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] devm_kmalloc+0x28/0x60 [ 5.773618] pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] really_probe+0x73/0x420 [ 5.773618] driver_probe_device+0x115/0x130 [ 5.773618] __driver_attach+0x103/0x110 [ 5.773618] ? driver_probe_device+0x130/0x130 [ 5.773618] bus_for_each_dev+0x67/0xc0 [ 5.773618] ? klist_add_tail+0x3b/0x70 [ 5.773618] bus_add_driver+0x41/0x260 [ 5.773618] ? pcie_port_setup+0x4d/0x4d [ 5.773618] driver_register+0x5b/0xe0 [ 5.773618] ? pcie_port_setup+0x4d/0x4d [ 5.773618] do_one_initcall+0x4e/0x1d4 [ 5.773618] ? init_setup+0x25/0x28 [ 5.773618] kernel_init_freeable+0x1c1/0x26e [ 5.773618] ? loglevel+0x5b/0x5b [ 5.773618] ? rest_init+0xb0/0xb0 [ 5.773618] kernel_init+0xa/0x110 [ 5.773618] ret_from_fork+0x22/0x40 [ 5.773618] Modules linked in: [ 5.773618] CR2: 0000000000002088 [ 5.773618] ---[ end trace 1030c9120a03d081 ]--- with his AMD machine with the following topology NUMA node0 CPU(s): 0,8,16,24 NUMA node1 CPU(s): 2,10,18,26 NUMA node2 CPU(s): 4,12,20,28 NUMA node3 CPU(s): 6,14,22,30 NUMA node4 CPU(s): 1,9,17,25 NUMA node5 CPU(s): 3,11,19,27 NUMA node6 CPU(s): 5,13,21,29 NUMA node7 CPU(s): 7,15,23,31 [ 0.007418] Early memory node ranges [ 0.007419] node 1: [mem 0x0000000000001000-0x000000000008efff] [ 0.007420] node 1: [mem 0x0000000000090000-0x000000000009ffff] [ 0.007422] node 1: [mem 0x0000000000100000-0x000000005c3d6fff] [ 0.007422] node 1: [mem 0x00000000643df000-0x0000000068ff7fff] [ 0.007423] node 1: [mem 0x000000006c528000-0x000000006fffffff] [ 0.007424] node 1: [mem 0x0000000100000000-0x000000047fffffff] [ 0.007425] node 5: [mem 0x0000000480000000-0x000000087effffff] and nr_cpus set to 4. The underlying reason is tha the device is bound to node 2 which doesn't have any memory and init_cpu_to_node only initializes memory-less nodes for possible cpus which nr_cpus restrics. This in turn means that proper zonelists are not allocated and the page allocator blows up. Fix the issue by moving init_memory_less_node into numa_register_memblks and always initialize all possible nodes consistently at a single place. Reported-by: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- arch/x86/mm/numa.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 1308f5408bf7..4575ae4d5449 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void) } } +static void __init init_memory_less_node(int nid) +{ + unsigned long zones_size[MAX_NR_ZONES] = {0}; + unsigned long zholes_size[MAX_NR_ZONES] = {0}; + + free_area_init_node(nid, zones_size, 0, zholes_size); + + /* + * All zonelists will be built later in start_kernel() after per cpu + * areas are initialized. + */ +} + static int __init numa_register_memblks(struct numa_meminfo *mi) { unsigned long uninitialized_var(pfn_align); @@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); + if (!end) + init_memory_less_node(nid); } /* Dump memblock with node info and return. */ @@ -721,21 +736,6 @@ void __init x86_numa_init(void) numa_init(dummy_numa_init); } -static void __init init_memory_less_node(int nid) -{ - unsigned long zones_size[MAX_NR_ZONES] = {0}; - unsigned long zholes_size[MAX_NR_ZONES] = {0}; - - /* Allocate and initialize node data. Memory-less node is now online.*/ - alloc_node_data(nid); - free_area_init_node(nid, zones_size, 0, zholes_size); - - /* - * All zonelists will be built later in start_kernel() after per cpu - * areas are initialized. - */ -} - /* * Setup early cpu_to_node. * @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) - init_memory_less_node(node); - numa_set_node(cpu, node); } } -- 2.19.2 -- Michal Hocko SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org> To: Pingfan Liu <kernelfans@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>, Mike Rapoport <rppt@linux.vnet.ibm.com>, Bjorn Helgaas <bhelgaas@google.com>, Jonathan Cameron <Jonathan.Cameron@huawei.com> Subject: Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline Date: Fri, 7 Dec 2018 12:30:44 +0100 [thread overview] Message-ID: <20181207113044.GB1286@dhcp22.suse.cz> (raw) In-Reply-To: <CAFgQCTsFBUcOE9UKQ2vz=hg2FWp_QurZMQmJZ2wYLBqXkFHKHQ@mail.gmail.com> On Fri 07-12-18 17:40:09, Pingfan Liu wrote: > On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 07-12-18 10:56:51, Pingfan Liu wrote: > > [...] > > > In a short word, the fix method should consider about the two factors: > > > semantic of online-node and the effect on all archs > > > > I am pretty sure there is a lot of room for unification in this area. > > Nevertheless I strongly believe the bug should be fixed firs with the > > simplest way and all the cleanup should be done on top. > > > > Do I get it right that the diff worked for you and I can prepare a full > > patch? > > > Sure, I am glad to test you new patch. >From 46e68be89d9c299fd497b2b8bea3f2add144f17f Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Fri, 7 Dec 2018 12:23:32 +0100 Subject: [PATCH] x86, numa: always initialize all possible nodes Pingfan Liu has reported the following splat [ 5.772742] BUG: unable to handle kernel paging request at 0000000000002088 [ 5.773618] PGD 0 P4D 0 [ 5.773618] Oops: 0000 [#1] SMP NOPTI [ 5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3 [ 5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 06/29/2018 [ 5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0 [ 5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89 e1 44 89 e6 89 [ 5.773618] RSP: 0018:ffffaa600005fb20 EFLAGS: 00010246 [ 5.773618] RAX: 0000000000000000 RBX: 00000000006012c0 RCX: 0000000000000000 [ 5.773618] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080 [ 5.773618] RBP: 00000000006012c0 R08: 0000000000000000 R09: 0000000000000002 [ 5.773618] R10: 00000000006080c0 R11: 0000000000000002 R12: 0000000000000000 [ 5.773618] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000002 [ 5.773618] FS: 0000000000000000(0000) GS:ffff8c69afe00000(0000) knlGS:0000000000000000 [ 5.773618] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5.773618] CR2: 0000000000002088 CR3: 000000087e00a000 CR4: 00000000003406e0 [ 5.773618] Call Trace: [ 5.773618] new_slab+0xa9/0x570 [ 5.773618] ___slab_alloc+0x375/0x540 [ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] __slab_alloc+0x1c/0x38 [ 5.773618] __kmalloc_node_track_caller+0xc8/0x270 [ 5.773618] ? pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] devm_kmalloc+0x28/0x60 [ 5.773618] pinctrl_bind_pins+0x2b/0x2a0 [ 5.773618] really_probe+0x73/0x420 [ 5.773618] driver_probe_device+0x115/0x130 [ 5.773618] __driver_attach+0x103/0x110 [ 5.773618] ? driver_probe_device+0x130/0x130 [ 5.773618] bus_for_each_dev+0x67/0xc0 [ 5.773618] ? klist_add_tail+0x3b/0x70 [ 5.773618] bus_add_driver+0x41/0x260 [ 5.773618] ? pcie_port_setup+0x4d/0x4d [ 5.773618] driver_register+0x5b/0xe0 [ 5.773618] ? pcie_port_setup+0x4d/0x4d [ 5.773618] do_one_initcall+0x4e/0x1d4 [ 5.773618] ? init_setup+0x25/0x28 [ 5.773618] kernel_init_freeable+0x1c1/0x26e [ 5.773618] ? loglevel+0x5b/0x5b [ 5.773618] ? rest_init+0xb0/0xb0 [ 5.773618] kernel_init+0xa/0x110 [ 5.773618] ret_from_fork+0x22/0x40 [ 5.773618] Modules linked in: [ 5.773618] CR2: 0000000000002088 [ 5.773618] ---[ end trace 1030c9120a03d081 ]--- with his AMD machine with the following topology NUMA node0 CPU(s): 0,8,16,24 NUMA node1 CPU(s): 2,10,18,26 NUMA node2 CPU(s): 4,12,20,28 NUMA node3 CPU(s): 6,14,22,30 NUMA node4 CPU(s): 1,9,17,25 NUMA node5 CPU(s): 3,11,19,27 NUMA node6 CPU(s): 5,13,21,29 NUMA node7 CPU(s): 7,15,23,31 [ 0.007418] Early memory node ranges [ 0.007419] node 1: [mem 0x0000000000001000-0x000000000008efff] [ 0.007420] node 1: [mem 0x0000000000090000-0x000000000009ffff] [ 0.007422] node 1: [mem 0x0000000000100000-0x000000005c3d6fff] [ 0.007422] node 1: [mem 0x00000000643df000-0x0000000068ff7fff] [ 0.007423] node 1: [mem 0x000000006c528000-0x000000006fffffff] [ 0.007424] node 1: [mem 0x0000000100000000-0x000000047fffffff] [ 0.007425] node 5: [mem 0x0000000480000000-0x000000087effffff] and nr_cpus set to 4. The underlying reason is tha the device is bound to node 2 which doesn't have any memory and init_cpu_to_node only initializes memory-less nodes for possible cpus which nr_cpus restrics. This in turn means that proper zonelists are not allocated and the page allocator blows up. Fix the issue by moving init_memory_less_node into numa_register_memblks and always initialize all possible nodes consistently at a single place. Reported-by: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: Michal Hocko <mhocko@suse.com> --- arch/x86/mm/numa.c | 33 +++++++++++++++------------------ 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 1308f5408bf7..4575ae4d5449 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void) } } +static void __init init_memory_less_node(int nid) +{ + unsigned long zones_size[MAX_NR_ZONES] = {0}; + unsigned long zholes_size[MAX_NR_ZONES] = {0}; + + free_area_init_node(nid, zones_size, 0, zholes_size); + + /* + * All zonelists will be built later in start_kernel() after per cpu + * areas are initialized. + */ +} + static int __init numa_register_memblks(struct numa_meminfo *mi) { unsigned long uninitialized_var(pfn_align); @@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); + if (!end) + init_memory_less_node(nid); } /* Dump memblock with node info and return. */ @@ -721,21 +736,6 @@ void __init x86_numa_init(void) numa_init(dummy_numa_init); } -static void __init init_memory_less_node(int nid) -{ - unsigned long zones_size[MAX_NR_ZONES] = {0}; - unsigned long zholes_size[MAX_NR_ZONES] = {0}; - - /* Allocate and initialize node data. Memory-less node is now online.*/ - alloc_node_data(nid); - free_area_init_node(nid, zones_size, 0, zholes_size); - - /* - * All zonelists will be built later in start_kernel() after per cpu - * areas are initialized. - */ -} - /* * Setup early cpu_to_node. * @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) - init_memory_less_node(node); - numa_set_node(cpu, node); } } -- 2.19.2 -- Michal Hocko SUSE Labs
next prev parent reply other threads:[~2018-12-07 11:30 UTC|newest] Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-12-04 3:05 [PATCH] mm/alloc: fallback to first node if the wanted node offline Pingfan Liu 2018-12-04 3:53 ` David Rientjes 2018-12-04 7:16 ` Pingfan Liu 2018-12-05 5:49 ` Pingfan Liu 2018-12-05 19:00 ` David Rientjes 2018-12-04 6:54 ` Wei Yang 2018-12-04 7:20 ` Pingfan Liu 2018-12-04 8:34 ` Wei Yang 2018-12-04 8:52 ` Pingfan Liu 2018-12-04 9:09 ` Wei Yang 2018-12-05 5:50 ` Pingfan Liu 2018-12-04 7:22 ` Michal Hocko 2018-12-04 8:20 ` Pingfan Liu 2018-12-04 8:40 ` Wei Yang 2018-12-04 8:56 ` Pingfan Liu 2018-12-04 8:56 ` Michal Hocko 2018-12-04 14:42 ` Vlastimil Babka 2018-12-05 5:38 ` Pingfan Liu 2018-12-05 9:21 ` Michal Hocko 2018-12-05 9:29 ` Pingfan Liu 2018-12-05 9:40 ` Vlastimil Babka 2018-12-06 3:07 ` Pingfan Liu 2018-12-06 8:28 ` Michal Hocko 2018-12-06 10:03 ` Pingfan Liu 2018-12-06 10:44 ` Pingfan Liu 2018-12-06 12:11 ` Michal Hocko 2018-12-07 2:56 ` Pingfan Liu 2018-12-07 7:53 ` Michal Hocko 2018-12-07 9:40 ` Pingfan Liu 2018-12-07 11:30 ` Michal Hocko [this message] 2018-12-07 11:30 ` Michal Hocko 2018-12-07 13:20 ` Pingfan Liu 2018-12-07 14:22 ` Michal Hocko 2018-12-07 14:27 ` Pingfan Liu 2018-12-07 14:50 ` Michal Hocko 2018-12-07 15:56 ` Michal Hocko 2018-12-10 4:00 ` Pingfan Liu 2018-12-10 7:57 ` Pingfan Liu 2018-12-10 12:37 ` Michal Hocko 2018-12-11 8:05 ` Pingfan Liu 2018-12-11 9:44 ` Michal Hocko 2018-12-12 8:33 ` Pingfan Liu 2018-12-12 8:31 ` Pingfan Liu 2018-12-12 11:53 ` Michal Hocko 2018-12-13 8:37 ` Pingfan Liu 2018-12-13 9:04 ` Pingfan Liu 2018-12-17 13:29 ` Michal Hocko 2018-12-20 7:19 ` Pingfan Liu 2018-12-20 9:19 ` Michal Hocko 2019-01-08 14:34 ` Michal Hocko 2019-01-09 3:13 ` Pingfan Liu 2019-01-09 3:13 ` Pingfan Liu 2019-01-11 3:12 ` Pingfan Liu 2019-01-11 3:12 ` Pingfan Liu 2019-01-11 9:23 ` Michal Hocko 2018-12-17 12:57 ` Michal Hocko 2018-12-05 9:43 ` Michal Hocko 2018-12-06 3:34 ` Pingfan Liu 2018-12-06 7:23 ` Michal Hocko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20181207113044.GB1286@dhcp22.suse.cz \ --to=mhocko@kernel.org \ --cc=Jonathan.Cameron@huawei.com \ --cc=akpm@linux-foundation.org \ --cc=bhelgaas@google.com \ --cc=kernelfans@gmail.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=rppt@linux.vnet.ibm.com \ --cc=vbabka@suse.cz \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.