* Allocation failure of ring buffer for trace @ 2017-11-13 17:48 YASUAKI ISHIMATSU 2017-11-13 23:53 ` Yang Shi 2017-11-14 11:46 ` Mel Gorman 0 siblings, 2 replies; 8+ messages in thread From: YASUAKI ISHIMATSU @ 2017-11-13 17:48 UTC (permalink / raw) To: Mel Gorman; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi When using trace_buf_size= boot option, memory allocation of ring buffer for trace fails as follows: [ ] x86: Booting SMP configuration: [ ] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 [ ] .... node #1, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 [ ] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 [ ] .... node #3, CPUs: #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 [ ] .... node #4, CPUs: #96 #97 #98 #99 #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119 [ ] .... node #5, CPUs: #120 #121 #122 #123 #124 #125 #126 #127 #128 #129 #130 #131 #132 #133 #134 #135 #136 #137 #138 #139 #140 #141 #142 #143 [ ] .... node #6, CPUs: #144 #145 #146 #147 #148 #149 #150 #151 #152 #153 #154 [ ] swapper/0: page allocation failure: order:0, mode:0x16004c0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOTRACK), nodemask=(null) [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc8+ #13 [ ] Hardware name: ... [ ] Call Trace: [ ] dump_stack+0x63/0x89 [ ] warn_alloc+0x114/0x1c0 [ ] ? _find_next_bit+0x60/0x60 [ ] __alloc_pages_slowpath+0x9a6/0xba7 [ ] __alloc_pages_nodemask+0x26a/0x290 [ ] new_slab+0x297/0x500 [ ] ___slab_alloc+0x335/0x4a0 [ ] ? __rb_allocate_pages+0xae/0x180 [ ] ? __rb_allocate_pages+0xae/0x180 [ ] __slab_alloc+0x40/0x66 [ ] __kmalloc_node+0xbd/0x270 [ ] __rb_allocate_pages+0xae/0x180 [ ] rb_allocate_cpu_buffer+0x204/0x2f0 [ ] trace_rb_cpu_prepare+0x7e/0xc5 [ ] cpuhp_invoke_callback+0x3ea/0x5c0 [ ] ? init_idle+0x1a7/0x1c0 [ ] ? ring_buffer_record_is_on+0x20/0x20 [ ] _cpu_up+0xbc/0x190 [ ] do_cpu_up+0x87/0xb0 [ ] cpu_up+0x13/0x20 [ ] smp_init+0x69/0xca [ ] kernel_init_freeable+0x115/0x244 [ ] ? rest_init+0xb0/0xb0 [ ] kernel_init+0xe/0x109 [ ] ret_from_fork+0x25/0x30 [ ] Mem-Info: [ ] active_anon:0 inactive_anon:0 isolated_anon:0 [ ] active_file:0 inactive_file:0 isolated_file:0 [ ] unevictable:0 dirty:0 writeback:0 unstable:0 [ ] slab_reclaimable:1260 slab_unreclaimable:489185 [ ] mapped:0 shmem:0 pagetables:0 bounce:0 [ ] free:46 free_pcp:1421 free_cma:0 . [ ] failed to allocate ring buffer on CPU 155 In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And "trace_buf_size=100M" is set. When using trace_buf_size=100M, kernel allocates 100 MB memory per CPU before calling free_are_init_core(). Kernel tries to allocates 38.4GB (100 MB * 384 CPU) memory. But available memory at this time is about 16GB (2 GB * 8 nodes) due to the following commit: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") So allocation failure occurs. Thanks, Yasuaki Ishimatsu -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Allocation failure of ring buffer for trace 2017-11-13 17:48 Allocation failure of ring buffer for trace YASUAKI ISHIMATSU @ 2017-11-13 23:53 ` Yang Shi 2017-11-14 11:46 ` Mel Gorman 1 sibling, 0 replies; 8+ messages in thread From: Yang Shi @ 2017-11-13 23:53 UTC (permalink / raw) To: YASUAKI ISHIMATSU Cc: Mel Gorman, rostedt, mingo, Linux Kernel Mailing List, linux-mm, koki.sanagi [-- Attachment #1: Type: text/plain, Size: 3600 bytes --] AFAIK, CONFIG_DEFERRED_STRUCT_PAGE_INIT will just initialize a small amount of page structs, then defer the remaining page structs initialization to kernel threads (one thread per node, called pgdatinit0/1/2/3). So, if your trace buffer allocation is *before* the kernel threads finishing the page struct initialization, you may run into this case. Yang On Mon, Nov 13, 2017 at 9:48 AM, YASUAKI ISHIMATSU <yasu.isimatu@gmail.com> wrote: > When using trace_buf_size= boot option, memory allocation of ring buffer > for trace fails as follows: > > [ ] x86: Booting SMP configuration: > [ ] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 > #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 > [ ] .... node #1, CPUs: #24 #25 #26 #27 #28 #29 #30 #31 #32 > #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 > [ ] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55 #56 > #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 > [ ] .... node #3, CPUs: #72 #73 #74 #75 #76 #77 #78 #79 #80 > #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 > [ ] .... node #4, CPUs: #96 #97 #98 #99 #100 #101 #102 #103 #104 > #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119 > [ ] .... node #5, CPUs: #120 #121 #122 #123 #124 #125 #126 #127 #128 > #129 #130 #131 #132 #133 #134 #135 #136 #137 #138 #139 #140 #141 #142 #143 > [ ] .... node #6, CPUs: #144 #145 #146 #147 #148 #149 #150 #151 #152 > #153 #154 > [ ] swapper/0: page allocation failure: order:0, > mode:0x16004c0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOTRACK), > nodemask=(null) > [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc8+ #13 > [ ] Hardware name: ... > [ ] Call Trace: > [ ] dump_stack+0x63/0x89 > [ ] warn_alloc+0x114/0x1c0 > [ ] ? _find_next_bit+0x60/0x60 > [ ] __alloc_pages_slowpath+0x9a6/0xba7 > [ ] __alloc_pages_nodemask+0x26a/0x290 > [ ] new_slab+0x297/0x500 > [ ] ___slab_alloc+0x335/0x4a0 > [ ] ? __rb_allocate_pages+0xae/0x180 > [ ] ? __rb_allocate_pages+0xae/0x180 > [ ] __slab_alloc+0x40/0x66 > [ ] __kmalloc_node+0xbd/0x270 > [ ] __rb_allocate_pages+0xae/0x180 > [ ] rb_allocate_cpu_buffer+0x204/0x2f0 > [ ] trace_rb_cpu_prepare+0x7e/0xc5 > [ ] cpuhp_invoke_callback+0x3ea/0x5c0 > [ ] ? init_idle+0x1a7/0x1c0 > [ ] ? ring_buffer_record_is_on+0x20/0x20 > [ ] _cpu_up+0xbc/0x190 > [ ] do_cpu_up+0x87/0xb0 > [ ] cpu_up+0x13/0x20 > [ ] smp_init+0x69/0xca > [ ] kernel_init_freeable+0x115/0x244 > [ ] ? rest_init+0xb0/0xb0 > [ ] kernel_init+0xe/0x109 > [ ] ret_from_fork+0x25/0x30 > [ ] Mem-Info: > [ ] active_anon:0 inactive_anon:0 isolated_anon:0 > [ ] active_file:0 inactive_file:0 isolated_file:0 > [ ] unevictable:0 dirty:0 writeback:0 unstable:0 > [ ] slab_reclaimable:1260 slab_unreclaimable:489185 > [ ] mapped:0 shmem:0 pagetables:0 bounce:0 > [ ] free:46 free_pcp:1421 free_cma:0 > . > [ ] failed to allocate ring buffer on CPU 155 > > In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And > "trace_buf_size=100M" is set. > > When using trace_buf_size=100M, kernel allocates 100 MB memory > per CPU before calling free_are_init_core(). Kernel tries to > allocates 38.4GB (100 MB * 384 CPU) memory. But available memory > at this time is about 16GB (2 GB * 8 nodes) due to the following commit: > > 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages > if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") > > So allocation failure occurs. > > Thanks, > Yasuaki Ishimatsu > [-- Attachment #2: Type: text/html, Size: 4520 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Allocation failure of ring buffer for trace 2017-11-13 17:48 Allocation failure of ring buffer for trace YASUAKI ISHIMATSU 2017-11-13 23:53 ` Yang Shi @ 2017-11-14 11:46 ` Mel Gorman 2017-11-14 15:39 ` YASUAKI ISHIMATSU 2017-11-15 4:11 ` YASUAKI ISHIMATSU 1 sibling, 2 replies; 8+ messages in thread From: Mel Gorman @ 2017-11-14 11:46 UTC (permalink / raw) To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote: > When using trace_buf_size= boot option, memory allocation of ring buffer > for trace fails as follows: > > [ ] x86: Booting SMP configuration: > <SNIP> > > In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And > "trace_buf_size=100M" is set. > > When using trace_buf_size=100M, kernel allocates 100 MB memory > per CPU before calling free_are_init_core(). Kernel tries to > allocates 38.4GB (100 MB * 384 CPU) memory. But available memory > at this time is about 16GB (2 GB * 8 nodes) due to the following commit: > > 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages > if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") > 1. What is the use case for such a large trace buffer being allocated at boot time? 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an option for you given that it's a custom-built kernel and not a distribution kernel? Basically, as the allocation context is within smp_init(), there are no opportunities to do the deferred meminit early. Furthermore, the partial initialisation of memory occurs before the size of the trace buffers is set so there is no opportunity to adjust the amount of memory that is pre-initialised. We could potentially catch when memory is low during system boot and adjust the amount that is initialised serially but the complexity would be high. Given that deferred meminit is basically a minor optimisation that only affects very large machines and trace_buf_size being used is somewhat specialised, I think the most straight-forward option is to go back to serialised meminit if trace_buf_size is specified like this; diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 710143741eb5..6ef0ab13f774 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone); void page_alloc_init_late(void); +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +extern void __init disable_deferred_meminit(void); +extern void page_alloc_init_late_prepare(void); +#else +static inline void disable_deferred_meminit(void) +{ +} + +static inline void page_alloc_init_late_prepare(void) +{ +} +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ + /* * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what * GFP flags are used before interrupts are enabled. Once interrupts are diff --git a/init/main.c b/init/main.c index 0ee9c6866ada..0248b8b5bc3a 100644 --- a/init/main.c +++ b/init/main.c @@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void) do_pre_smp_initcalls(); lockup_detector_init(); + page_alloc_init_late_prepare(); + smp_init(); sched_init_smp(); diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 752e5daf0896..cfa7175ff093 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str) if (buf_size == 0) return 0; trace_buf_size = buf_size; + + /* + * The size of buffers are unpredictable so initialise all memory + * before the allocation attempt occurs. + */ + disable_deferred_meminit(); + return 1; } __setup("trace_buf_size=", set_buf_size); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 77e4d3c5c57b..4dd0e153b0f2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes); int page_group_by_mobility_disabled __read_mostly; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +bool __initdata deferred_meminit_disabled; + +/* + * Allow deferred meminit to be disabled by subsystems that require large + * allocations before the memory allocator is fully initialised. It should + * only be used in cases where the size of the allocation may not fit into + * the 2G per node that is allocated serially. + */ +void __init disable_deferred_meminit(void) +{ + deferred_meminit_disabled = true; +} + static inline void reset_deferred_meminit(pg_data_t *pgdat) { unsigned long max_initialise; @@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data) } #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +/* + * Serialised init of remaining memory if large buffers of unknown size + * are required that might fail before parallelised meminit can start + */ +void __init page_alloc_init_late_prepare(void) +{ + int nid; + + if (!deferred_meminit_disabled) + return; + + for_each_node_state(nid, N_MEMORY) + deferred_init_memmap(NODE_DATA(nid)); +} +#endif + void __init page_alloc_init_late(void) { struct zone *zone; -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: Allocation failure of ring buffer for trace 2017-11-14 11:46 ` Mel Gorman @ 2017-11-14 15:39 ` YASUAKI ISHIMATSU 2017-11-14 15:53 ` Mel Gorman 2017-11-15 4:11 ` YASUAKI ISHIMATSU 1 sibling, 1 reply; 8+ messages in thread From: YASUAKI ISHIMATSU @ 2017-11-14 15:39 UTC (permalink / raw) To: Mel Gorman Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu On 11/14/2017 06:46 AM, Mel Gorman wrote: > On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote: >> When using trace_buf_size= boot option, memory allocation of ring buffer >> for trace fails as follows: >> >> [ ] x86: Booting SMP configuration: >> <SNIP> >> >> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And >> "trace_buf_size=100M" is set. >> >> When using trace_buf_size=100M, kernel allocates 100 MB memory >> per CPU before calling free_are_init_core(). Kernel tries to >> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory >> at this time is about 16GB (2 GB * 8 nodes) due to the following commit: >> >> 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages >> if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") >> > > 1. What is the use case for such a large trace buffer being allocated at > boot time? I'm not sure the use case. I found the following commit log: commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 Author: Michal Hocko <mhocko@suse.com> Date: Fri Jun 2 14:46:49 2017 -0700 mm: consider memblock reservations for deferred memory initialization sizing So I thought similar memory exhaustion may occurs on other boot option. And I reproduced the issue. > 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an > option for you given that it's a custom-built kernel and not a > distribution kernel? The issue also occurred on distribution kernels. So we have to fix the issue. Thanks, Yasuaki Ishimatsu > > Basically, as the allocation context is within smp_init(), there are no > opportunities to do the deferred meminit early. Furthermore, the partial > initialisation of memory occurs before the size of the trace buffers is > set so there is no opportunity to adjust the amount of memory that is > pre-initialised. We could potentially catch when memory is low during > system boot and adjust the amount that is initialised serially but the > complexity would be high. Given that deferred meminit is basically a minor > optimisation that only affects very large machines and trace_buf_size being > used is somewhat specialised, I think the most straight-forward option is > to go back to serialised meminit if trace_buf_size is specified like this; > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 710143741eb5..6ef0ab13f774 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone); > > void page_alloc_init_late(void); > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +extern void __init disable_deferred_meminit(void); > +extern void page_alloc_init_late_prepare(void); > +#else > +static inline void disable_deferred_meminit(void) > +{ > +} > + > +static inline void page_alloc_init_late_prepare(void) > +{ > +} > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > + > /* > * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what > * GFP flags are used before interrupts are enabled. Once interrupts are > diff --git a/init/main.c b/init/main.c > index 0ee9c6866ada..0248b8b5bc3a 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void) > do_pre_smp_initcalls(); > lockup_detector_init(); > > + page_alloc_init_late_prepare(); > + > smp_init(); > sched_init_smp(); > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 752e5daf0896..cfa7175ff093 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str) > if (buf_size == 0) > return 0; > trace_buf_size = buf_size; > + > + /* > + * The size of buffers are unpredictable so initialise all memory > + * before the allocation attempt occurs. > + */ > + disable_deferred_meminit(); > + > return 1; > } > __setup("trace_buf_size=", set_buf_size); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 77e4d3c5c57b..4dd0e153b0f2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes); > int page_group_by_mobility_disabled __read_mostly; > > #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +bool __initdata deferred_meminit_disabled; > + > +/* > + * Allow deferred meminit to be disabled by subsystems that require large > + * allocations before the memory allocator is fully initialised. It should > + * only be used in cases where the size of the allocation may not fit into > + * the 2G per node that is allocated serially. > + */ > +void __init disable_deferred_meminit(void) > +{ > + deferred_meminit_disabled = true; > +} > + > static inline void reset_deferred_meminit(pg_data_t *pgdat) > { > unsigned long max_initialise; > @@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data) > } > #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +/* > + * Serialised init of remaining memory if large buffers of unknown size > + * are required that might fail before parallelised meminit can start > + */ > +void __init page_alloc_init_late_prepare(void) > +{ > + int nid; > + > + if (!deferred_meminit_disabled) > + return; > + > + for_each_node_state(nid, N_MEMORY) > + deferred_init_memmap(NODE_DATA(nid)); > +} > +#endif > + > void __init page_alloc_init_late(void) > { > struct zone *zone; > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Allocation failure of ring buffer for trace 2017-11-14 15:39 ` YASUAKI ISHIMATSU @ 2017-11-14 15:53 ` Mel Gorman 2017-11-14 16:40 ` YASUAKI ISHIMATSU 2017-11-14 17:05 ` Mel Gorman 0 siblings, 2 replies; 8+ messages in thread From: Mel Gorman @ 2017-11-14 15:53 UTC (permalink / raw) To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi On Tue, Nov 14, 2017 at 10:39:19AM -0500, YASUAKI ISHIMATSU wrote: > > > On 11/14/2017 06:46 AM, Mel Gorman wrote: > > On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote: > >> When using trace_buf_size= boot option, memory allocation of ring buffer > >> for trace fails as follows: > >> > >> [ ] x86: Booting SMP configuration: > >> <SNIP> > >> > >> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And > >> "trace_buf_size=100M" is set. > >> > >> When using trace_buf_size=100M, kernel allocates 100 MB memory > >> per CPU before calling free_are_init_core(). Kernel tries to > >> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory > >> at this time is about 16GB (2 GB * 8 nodes) due to the following commit: > >> > >> 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages > >> if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") > >> > > > > 1. What is the use case for such a large trace buffer being allocated at > > boot time? > > I'm not sure the use case. I found the following commit log: > > commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 > Author: Michal Hocko <mhocko@suse.com> > Date: Fri Jun 2 14:46:49 2017 -0700 > > mm: consider memblock reservations for deferred memory initialization sizing > > So I thought similar memory exhaustion may occurs on other boot option. > And I reproduced the issue. > That was different, it was a premature OOM caused by reservations that were of a known size. It's not related to trace_buf_size in any fashion. > > > 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an > > option for you given that it's a custom-built kernel and not a > > distribution kernel? > > The issue also occurred on distribution kernels. So we have to fix the issue. > I'm aware of now bugs against a distribution kernel. However, does the patch work for you? -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Allocation failure of ring buffer for trace 2017-11-14 15:53 ` Mel Gorman @ 2017-11-14 16:40 ` YASUAKI ISHIMATSU 2017-11-14 17:05 ` Mel Gorman 1 sibling, 0 replies; 8+ messages in thread From: YASUAKI ISHIMATSU @ 2017-11-14 16:40 UTC (permalink / raw) To: Mel Gorman Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu On 11/14/2017 10:53 AM, Mel Gorman wrote: > On Tue, Nov 14, 2017 at 10:39:19AM -0500, YASUAKI ISHIMATSU wrote: >> >> >> On 11/14/2017 06:46 AM, Mel Gorman wrote: >>> On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote: >>>> When using trace_buf_size= boot option, memory allocation of ring buffer >>>> for trace fails as follows: >>>> >>>> [ ] x86: Booting SMP configuration: >>>> <SNIP> >>>> >>>> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And >>>> "trace_buf_size=100M" is set. >>>> >>>> When using trace_buf_size=100M, kernel allocates 100 MB memory >>>> per CPU before calling free_are_init_core(). Kernel tries to >>>> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory >>>> at this time is about 16GB (2 GB * 8 nodes) due to the following commit: >>>> >>>> 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages >>>> if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") >>>> >>> >>> 1. What is the use case for such a large trace buffer being allocated at >>> boot time? >> >> I'm not sure the use case. I found the following commit log: >> >> commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 >> Author: Michal Hocko <mhocko@suse.com> >> Date: Fri Jun 2 14:46:49 2017 -0700 >> >> mm: consider memblock reservations for deferred memory initialization sizing >> >> So I thought similar memory exhaustion may occurs on other boot option. >> And I reproduced the issue. >> > > That was different, it was a premature OOM caused by reservations that > were of a known size. It's not related to trace_buf_size in any fashion. Yes. I know there are different bugs. I thought memory exhaustion at boot time may occur by other boot option. So I tried trace_buf_size boot option. > >> >>> 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an >>> option for you given that it's a custom-built kernel and not a >>> distribution kernel? >> >> The issue also occurred on distribution kernels. So we have to fix the issue. >> > > I'm aware of now bugs against a distribution kernel. However, does the > patch work for you? > I'll apply it. Thanks, Yasuaki Ishimatsu -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Allocation failure of ring buffer for trace 2017-11-14 15:53 ` Mel Gorman 2017-11-14 16:40 ` YASUAKI ISHIMATSU @ 2017-11-14 17:05 ` Mel Gorman 1 sibling, 0 replies; 8+ messages in thread From: Mel Gorman @ 2017-11-14 17:05 UTC (permalink / raw) To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi On Tue, Nov 14, 2017 at 03:53:27PM +0000, Mel Gorman wrote: > > The issue also occurred on distribution kernels. So we have to fix the issue. > > > > I'm aware of now bugs against a distribution kernel. I don't know what happened there. I'm *not* aware of any bugs against a distribution kernel. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Allocation failure of ring buffer for trace 2017-11-14 11:46 ` Mel Gorman 2017-11-14 15:39 ` YASUAKI ISHIMATSU @ 2017-11-15 4:11 ` YASUAKI ISHIMATSU 1 sibling, 0 replies; 8+ messages in thread From: YASUAKI ISHIMATSU @ 2017-11-15 4:11 UTC (permalink / raw) To: Mel Gorman Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu Hi Mel, Your patch works good. Here are the results of your patch. - boot up without trace_buf_size boot option When system boots up without trace_buf_size boot option, deferred_init_memmap() runs after booting SMP configuration. There no change of boot sequence between 4.14.0 and 4.14.0 with your patch. [ 0.256285] x86: Booting SMP configuration: ... [ 5.313195] node 0 initialised, 15530251 pages in 653ms [ 5.330691] node 1 initialised, 15988494 pages in 670ms [ 5.331746] node 2 initialised, 15988493 pages in 671ms [ 5.332166] node 6 initialised, 15982779 pages in 670ms [ 5.332673] node 3 initialised, 15988494 pages in 671ms [ 5.332618] node 4 initialised, 15988494 pages in 672ms [ 5.334187] node 7 initialised, 15987304 pages in 672ms [ 5.334976] node 5 initialised, 15988494 pages in 673ms - boot up with trace_buf_size boot option When system boots up with trace_buf_size boot option, deferred_init_memmap() runs before booting SMP configuration. So every memory on all nodes is initialised before allocating trace buffer. And system can boot up even if we set trace_buf_size boot option. [ 0.932114] node 0 initialised, 15530251 pages in 684ms [ 1.604918] node 1 initialised, 15988494 pages in 671ms [ 2.278933] node 2 initialised, 15988494 pages in 673ms [ 2.965076] node 3 initialised, 15988494 pages in 686ms [ 3.669064] node 4 initialised, 15988494 pages in 703ms [ 4.354983] node 5 initialised, 15988493 pages in 684ms [ 5.028681] node 6 initialised, 15982779 pages in 673ms [ 5.716102] node 7 initialised, 15987304 pages in 687ms [ 5.727855] smp: Bringing up secondary CPUs ... [ 5.745937] x86: Booting SMP configuration: Thanks, Yasuaki Ishimatsu On 11/14/2017 06:46 AM, Mel Gorman wrote: > On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote: >> When using trace_buf_size= boot option, memory allocation of ring buffer >> for trace fails as follows: >> >> [ ] x86: Booting SMP configuration: >> <SNIP> >> >> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And >> "trace_buf_size=100M" is set. >> >> When using trace_buf_size=100M, kernel allocates 100 MB memory >> per CPU before calling free_are_init_core(). Kernel tries to >> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory >> at this time is about 16GB (2 GB * 8 nodes) due to the following commit: >> >> 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages >> if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") >> > > 1. What is the use case for such a large trace buffer being allocated at > boot time? > 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an > option for you given that it's a custom-built kernel and not a > distribution kernel? > > Basically, as the allocation context is within smp_init(), there are no > opportunities to do the deferred meminit early. Furthermore, the partial > initialisation of memory occurs before the size of the trace buffers is > set so there is no opportunity to adjust the amount of memory that is > pre-initialised. We could potentially catch when memory is low during > system boot and adjust the amount that is initialised serially but the > complexity would be high. Given that deferred meminit is basically a minor > optimisation that only affects very large machines and trace_buf_size being > used is somewhat specialised, I think the most straight-forward option is > to go back to serialised meminit if trace_buf_size is specified like this; > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 710143741eb5..6ef0ab13f774 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone); > > void page_alloc_init_late(void); > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +extern void __init disable_deferred_meminit(void); > +extern void page_alloc_init_late_prepare(void); > +#else > +static inline void disable_deferred_meminit(void) > +{ > +} > + > +static inline void page_alloc_init_late_prepare(void) > +{ > +} > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > + > /* > * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what > * GFP flags are used before interrupts are enabled. Once interrupts are > diff --git a/init/main.c b/init/main.c > index 0ee9c6866ada..0248b8b5bc3a 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void) > do_pre_smp_initcalls(); > lockup_detector_init(); > > + page_alloc_init_late_prepare(); > + > smp_init(); > sched_init_smp(); > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 752e5daf0896..cfa7175ff093 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str) > if (buf_size == 0) > return 0; > trace_buf_size = buf_size; > + > + /* > + * The size of buffers are unpredictable so initialise all memory > + * before the allocation attempt occurs. > + */ > + disable_deferred_meminit(); > + > return 1; > } > __setup("trace_buf_size=", set_buf_size); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 77e4d3c5c57b..4dd0e153b0f2 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes); > int page_group_by_mobility_disabled __read_mostly; > > #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +bool __initdata deferred_meminit_disabled; > + > +/* > + * Allow deferred meminit to be disabled by subsystems that require large > + * allocations before the memory allocator is fully initialised. It should > + * only be used in cases where the size of the allocation may not fit into > + * the 2G per node that is allocated serially. > + */ > +void __init disable_deferred_meminit(void) > +{ > + deferred_meminit_disabled = true; > +} > + > static inline void reset_deferred_meminit(pg_data_t *pgdat) > { > unsigned long max_initialise; > @@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data) > } > #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +/* > + * Serialised init of remaining memory if large buffers of unknown size > + * are required that might fail before parallelised meminit can start > + */ > +void __init page_alloc_init_late_prepare(void) > +{ > + int nid; > + > + if (!deferred_meminit_disabled) > + return; > + > + for_each_node_state(nid, N_MEMORY) > + deferred_init_memmap(NODE_DATA(nid)); > +} > +#endif > + > void __init page_alloc_init_late(void) > { > struct zone *zone; > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-11-15 4:11 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-13 17:48 Allocation failure of ring buffer for trace YASUAKI ISHIMATSU 2017-11-13 23:53 ` Yang Shi 2017-11-14 11:46 ` Mel Gorman 2017-11-14 15:39 ` YASUAKI ISHIMATSU 2017-11-14 15:53 ` Mel Gorman 2017-11-14 16:40 ` YASUAKI ISHIMATSU 2017-11-14 17:05 ` Mel Gorman 2017-11-15 4:11 ` YASUAKI ISHIMATSU
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).