All of lore.kernel.org
 help / color / mirror / Atom feed
* Allocation failure of ring buffer for trace
@ 2017-11-13 17:48 ` YASUAKI ISHIMATSU
  0 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-13 17:48 UTC (permalink / raw)
  To: Mel Gorman; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

When using trace_buf_size= boot option, memory allocation of ring buffer
for trace fails as follows:

[ ] x86: Booting SMP configuration:
[ ] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7   #8   #9  #10  #11  #12  #13  #14  #15  #16  #17  #18  #19  #20  #21  #22  #23
[ ] .... node  #1, CPUs:    #24  #25  #26  #27  #28  #29  #30  #31  #32  #33  #34  #35  #36  #37  #38  #39  #40  #41  #42  #43  #44  #45  #46  #47
[ ] .... node  #2, CPUs:    #48  #49  #50  #51  #52  #53  #54  #55  #56  #57  #58  #59  #60  #61  #62  #63  #64  #65  #66  #67  #68  #69  #70  #71
[ ] .... node  #3, CPUs:    #72  #73  #74  #75  #76  #77  #78  #79  #80  #81  #82  #83  #84  #85  #86  #87  #88  #89  #90  #91  #92  #93  #94  #95
[ ] .... node  #4, CPUs:    #96  #97  #98  #99 #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
[ ] .... node  #5, CPUs:   #120 #121 #122 #123 #124 #125 #126 #127 #128 #129 #130 #131 #132 #133 #134 #135 #136 #137 #138 #139 #140 #141 #142 #143
[ ] .... node  #6, CPUs:   #144 #145 #146 #147 #148 #149 #150 #151 #152 #153 #154
[ ] swapper/0: page allocation failure: order:0, mode:0x16004c0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOTRACK), nodemask=(null)
[ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc8+ #13
[ ] Hardware name: ...
[ ] Call Trace:
[ ]  dump_stack+0x63/0x89
[ ]  warn_alloc+0x114/0x1c0
[ ]  ? _find_next_bit+0x60/0x60
[ ]  __alloc_pages_slowpath+0x9a6/0xba7
[ ]  __alloc_pages_nodemask+0x26a/0x290
[ ]  new_slab+0x297/0x500
[ ]  ___slab_alloc+0x335/0x4a0
[ ]  ? __rb_allocate_pages+0xae/0x180
[ ]  ? __rb_allocate_pages+0xae/0x180
[ ]  __slab_alloc+0x40/0x66
[ ]  __kmalloc_node+0xbd/0x270
[ ]  __rb_allocate_pages+0xae/0x180
[ ]  rb_allocate_cpu_buffer+0x204/0x2f0
[ ]  trace_rb_cpu_prepare+0x7e/0xc5
[ ]  cpuhp_invoke_callback+0x3ea/0x5c0
[ ]  ? init_idle+0x1a7/0x1c0
[ ]  ? ring_buffer_record_is_on+0x20/0x20
[ ]  _cpu_up+0xbc/0x190
[ ]  do_cpu_up+0x87/0xb0
[ ]  cpu_up+0x13/0x20
[ ]  smp_init+0x69/0xca
[ ]  kernel_init_freeable+0x115/0x244
[ ]  ? rest_init+0xb0/0xb0
[ ]  kernel_init+0xe/0x109
[ ]  ret_from_fork+0x25/0x30
[ ] Mem-Info:
[ ] active_anon:0 inactive_anon:0 isolated_anon:0
[ ]  active_file:0 inactive_file:0 isolated_file:0
[ ]  unevictable:0 dirty:0 writeback:0 unstable:0
[ ]  slab_reclaimable:1260 slab_unreclaimable:489185
[ ]  mapped:0 shmem:0 pagetables:0 bounce:0
[ ]  free:46 free_pcp:1421 free_cma:0
.
[ ] failed to allocate ring buffer on CPU 155

In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
"trace_buf_size=100M" is set.

When using trace_buf_size=100M, kernel allocates 100 MB memory
per CPU before calling free_are_init_core(). Kernel tries to
allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
at this time is about 16GB (2 GB * 8 nodes) due to the following commit:

  3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
                 if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")

So allocation failure occurs.

Thanks,
Yasuaki Ishimatsu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Allocation failure of ring buffer for trace
@ 2017-11-13 17:48 ` YASUAKI ISHIMATSU
  0 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-13 17:48 UTC (permalink / raw)
  To: Mel Gorman; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

When using trace_buf_size= boot option, memory allocation of ring buffer
for trace fails as follows:

[ ] x86: Booting SMP configuration:
[ ] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7   #8   #9  #10  #11  #12  #13  #14  #15  #16  #17  #18  #19  #20  #21  #22  #23
[ ] .... node  #1, CPUs:    #24  #25  #26  #27  #28  #29  #30  #31  #32  #33  #34  #35  #36  #37  #38  #39  #40  #41  #42  #43  #44  #45  #46  #47
[ ] .... node  #2, CPUs:    #48  #49  #50  #51  #52  #53  #54  #55  #56  #57  #58  #59  #60  #61  #62  #63  #64  #65  #66  #67  #68  #69  #70  #71
[ ] .... node  #3, CPUs:    #72  #73  #74  #75  #76  #77  #78  #79  #80  #81  #82  #83  #84  #85  #86  #87  #88  #89  #90  #91  #92  #93  #94  #95
[ ] .... node  #4, CPUs:    #96  #97  #98  #99 #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
[ ] .... node  #5, CPUs:   #120 #121 #122 #123 #124 #125 #126 #127 #128 #129 #130 #131 #132 #133 #134 #135 #136 #137 #138 #139 #140 #141 #142 #143
[ ] .... node  #6, CPUs:   #144 #145 #146 #147 #148 #149 #150 #151 #152 #153 #154
[ ] swapper/0: page allocation failure: order:0, mode:0x16004c0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOTRACK), nodemask=(null)
[ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc8+ #13
[ ] Hardware name: ...
[ ] Call Trace:
[ ]  dump_stack+0x63/0x89
[ ]  warn_alloc+0x114/0x1c0
[ ]  ? _find_next_bit+0x60/0x60
[ ]  __alloc_pages_slowpath+0x9a6/0xba7
[ ]  __alloc_pages_nodemask+0x26a/0x290
[ ]  new_slab+0x297/0x500
[ ]  ___slab_alloc+0x335/0x4a0
[ ]  ? __rb_allocate_pages+0xae/0x180
[ ]  ? __rb_allocate_pages+0xae/0x180
[ ]  __slab_alloc+0x40/0x66
[ ]  __kmalloc_node+0xbd/0x270
[ ]  __rb_allocate_pages+0xae/0x180
[ ]  rb_allocate_cpu_buffer+0x204/0x2f0
[ ]  trace_rb_cpu_prepare+0x7e/0xc5
[ ]  cpuhp_invoke_callback+0x3ea/0x5c0
[ ]  ? init_idle+0x1a7/0x1c0
[ ]  ? ring_buffer_record_is_on+0x20/0x20
[ ]  _cpu_up+0xbc/0x190
[ ]  do_cpu_up+0x87/0xb0
[ ]  cpu_up+0x13/0x20
[ ]  smp_init+0x69/0xca
[ ]  kernel_init_freeable+0x115/0x244
[ ]  ? rest_init+0xb0/0xb0
[ ]  kernel_init+0xe/0x109
[ ]  ret_from_fork+0x25/0x30
[ ] Mem-Info:
[ ] active_anon:0 inactive_anon:0 isolated_anon:0
[ ]  active_file:0 inactive_file:0 isolated_file:0
[ ]  unevictable:0 dirty:0 writeback:0 unstable:0
[ ]  slab_reclaimable:1260 slab_unreclaimable:489185
[ ]  mapped:0 shmem:0 pagetables:0 bounce:0
[ ]  free:46 free_pcp:1421 free_cma:0
.
[ ] failed to allocate ring buffer on CPU 155

In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
"trace_buf_size=100M" is set.

When using trace_buf_size=100M, kernel allocates 100 MB memory
per CPU before calling free_are_init_core(). Kernel tries to
allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
at this time is about 16GB (2 GB * 8 nodes) due to the following commit:

  3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
                 if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")

So allocation failure occurs.

Thanks,
Yasuaki Ishimatsu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
  2017-11-13 17:48 ` YASUAKI ISHIMATSU
  (?)
@ 2017-11-13 23:53 ` Yang Shi
  -1 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2017-11-13 23:53 UTC (permalink / raw)
  To: YASUAKI ISHIMATSU
  Cc: Mel Gorman, rostedt, mingo, Linux Kernel Mailing List, linux-mm,
	koki.sanagi

[-- Attachment #1: Type: text/plain, Size: 3600 bytes --]

AFAIK, CONFIG_DEFERRED_STRUCT_PAGE_INIT will just initialize a small amount
of page structs, then defer the remaining page structs initialization to
kernel threads (one thread per node, called pgdatinit0/1/2/3). So, if your
trace buffer allocation is *before* the kernel threads finishing the page
struct initialization, you may run into this case.

Yang

On Mon, Nov 13, 2017 at 9:48 AM, YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
wrote:

> When using trace_buf_size= boot option, memory allocation of ring buffer
> for trace fails as follows:
>
> [ ] x86: Booting SMP configuration:
> [ ] .... node  #0, CPUs:          #1   #2   #3   #4   #5   #6   #7   #8
>  #9  #10  #11  #12  #13  #14  #15  #16  #17  #18  #19  #20  #21  #22  #23
> [ ] .... node  #1, CPUs:    #24  #25  #26  #27  #28  #29  #30  #31  #32
> #33  #34  #35  #36  #37  #38  #39  #40  #41  #42  #43  #44  #45  #46  #47
> [ ] .... node  #2, CPUs:    #48  #49  #50  #51  #52  #53  #54  #55  #56
> #57  #58  #59  #60  #61  #62  #63  #64  #65  #66  #67  #68  #69  #70  #71
> [ ] .... node  #3, CPUs:    #72  #73  #74  #75  #76  #77  #78  #79  #80
> #81  #82  #83  #84  #85  #86  #87  #88  #89  #90  #91  #92  #93  #94  #95
> [ ] .... node  #4, CPUs:    #96  #97  #98  #99 #100 #101 #102 #103 #104
> #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
> [ ] .... node  #5, CPUs:   #120 #121 #122 #123 #124 #125 #126 #127 #128
> #129 #130 #131 #132 #133 #134 #135 #136 #137 #138 #139 #140 #141 #142 #143
> [ ] .... node  #6, CPUs:   #144 #145 #146 #147 #148 #149 #150 #151 #152
> #153 #154
> [ ] swapper/0: page allocation failure: order:0,
> mode:0x16004c0(GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOTRACK),
> nodemask=(null)
> [ ] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc8+ #13
> [ ] Hardware name: ...
> [ ] Call Trace:
> [ ]  dump_stack+0x63/0x89
> [ ]  warn_alloc+0x114/0x1c0
> [ ]  ? _find_next_bit+0x60/0x60
> [ ]  __alloc_pages_slowpath+0x9a6/0xba7
> [ ]  __alloc_pages_nodemask+0x26a/0x290
> [ ]  new_slab+0x297/0x500
> [ ]  ___slab_alloc+0x335/0x4a0
> [ ]  ? __rb_allocate_pages+0xae/0x180
> [ ]  ? __rb_allocate_pages+0xae/0x180
> [ ]  __slab_alloc+0x40/0x66
> [ ]  __kmalloc_node+0xbd/0x270
> [ ]  __rb_allocate_pages+0xae/0x180
> [ ]  rb_allocate_cpu_buffer+0x204/0x2f0
> [ ]  trace_rb_cpu_prepare+0x7e/0xc5
> [ ]  cpuhp_invoke_callback+0x3ea/0x5c0
> [ ]  ? init_idle+0x1a7/0x1c0
> [ ]  ? ring_buffer_record_is_on+0x20/0x20
> [ ]  _cpu_up+0xbc/0x190
> [ ]  do_cpu_up+0x87/0xb0
> [ ]  cpu_up+0x13/0x20
> [ ]  smp_init+0x69/0xca
> [ ]  kernel_init_freeable+0x115/0x244
> [ ]  ? rest_init+0xb0/0xb0
> [ ]  kernel_init+0xe/0x109
> [ ]  ret_from_fork+0x25/0x30
> [ ] Mem-Info:
> [ ] active_anon:0 inactive_anon:0 isolated_anon:0
> [ ]  active_file:0 inactive_file:0 isolated_file:0
> [ ]  unevictable:0 dirty:0 writeback:0 unstable:0
> [ ]  slab_reclaimable:1260 slab_unreclaimable:489185
> [ ]  mapped:0 shmem:0 pagetables:0 bounce:0
> [ ]  free:46 free_pcp:1421 free_cma:0
> .
> [ ] failed to allocate ring buffer on CPU 155
>
> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
> "trace_buf_size=100M" is set.
>
> When using trace_buf_size=100M, kernel allocates 100 MB memory
> per CPU before calling free_are_init_core(). Kernel tries to
> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
>
>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
>
> So allocation failure occurs.
>
> Thanks,
> Yasuaki Ishimatsu
>

[-- Attachment #2: Type: text/html, Size: 4520 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
  2017-11-13 17:48 ` YASUAKI ISHIMATSU
@ 2017-11-14 11:46   ` Mel Gorman
  -1 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2017-11-14 11:46 UTC (permalink / raw)
  To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
> When using trace_buf_size= boot option, memory allocation of ring buffer
> for trace fails as follows:
> 
> [ ] x86: Booting SMP configuration:
> <SNIP>
> 
> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
> "trace_buf_size=100M" is set.
> 
> When using trace_buf_size=100M, kernel allocates 100 MB memory
> per CPU before calling free_are_init_core(). Kernel tries to
> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
> 
>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
> 

1. What is the use case for such a large trace buffer being allocated at
   boot time?
2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
   option for you given that it's a custom-built kernel and not a
   distribution kernel?

Basically, as the allocation context is within smp_init(), there are no
opportunities to do the deferred meminit early. Furthermore, the partial
initialisation of memory occurs before the size of the trace buffers is
set so there is no opportunity to adjust the amount of memory that is
pre-initialised. We could potentially catch when memory is low during
system boot and adjust the amount that is initialised serially but the
complexity would be high. Given that deferred meminit is basically a minor
optimisation that only affects very large machines and trace_buf_size being
used is somewhat specialised, I think the most straight-forward option is
to go back to serialised meminit if trace_buf_size is specified like this;

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 710143741eb5..6ef0ab13f774 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+extern void __init disable_deferred_meminit(void);
+extern void page_alloc_init_late_prepare(void);
+#else
+static inline void disable_deferred_meminit(void)
+{
+}
+
+static inline void page_alloc_init_late_prepare(void)
+{
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
  * GFP flags are used before interrupts are enabled. Once interrupts are
diff --git a/init/main.c b/init/main.c
index 0ee9c6866ada..0248b8b5bc3a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void)
 	do_pre_smp_initcalls();
 	lockup_detector_init();
 
+	page_alloc_init_late_prepare();
+
 	smp_init();
 	sched_init_smp();
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 752e5daf0896..cfa7175ff093 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str)
 	if (buf_size == 0)
 		return 0;
 	trace_buf_size = buf_size;
+
+	/*
+	 * The size of buffers are unpredictable so initialise all memory
+	 * before the allocation attempt occurs.
+	 */
+	disable_deferred_meminit();
+
 	return 1;
 }
 __setup("trace_buf_size=", set_buf_size);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c5c57b..4dd0e153b0f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes);
 int page_group_by_mobility_disabled __read_mostly;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+bool __initdata deferred_meminit_disabled;
+
+/*
+ * Allow deferred meminit to be disabled by subsystems that require large
+ * allocations before the memory allocator is fully initialised. It should
+ * only be used in cases where the size of the allocation may not fit into
+ * the 2G per node that is allocated serially.
+ */
+void __init disable_deferred_meminit(void)
+{
+	deferred_meminit_disabled = true;
+}
+
 static inline void reset_deferred_meminit(pg_data_t *pgdat)
 {
 	unsigned long max_initialise;
@@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data)
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/*
+ * Serialised init of remaining memory if large buffers of unknown size
+ * are required that might fail before parallelised meminit can start
+ */
+void __init page_alloc_init_late_prepare(void)
+{
+	int nid;
+
+	if (!deferred_meminit_disabled)
+		return;
+
+	for_each_node_state(nid, N_MEMORY)
+		deferred_init_memmap(NODE_DATA(nid));
+}
+#endif
+
 void __init page_alloc_init_late(void)
 {
 	struct zone *zone;

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
@ 2017-11-14 11:46   ` Mel Gorman
  0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2017-11-14 11:46 UTC (permalink / raw)
  To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
> When using trace_buf_size= boot option, memory allocation of ring buffer
> for trace fails as follows:
> 
> [ ] x86: Booting SMP configuration:
> <SNIP>
> 
> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
> "trace_buf_size=100M" is set.
> 
> When using trace_buf_size=100M, kernel allocates 100 MB memory
> per CPU before calling free_are_init_core(). Kernel tries to
> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
> 
>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
> 

1. What is the use case for such a large trace buffer being allocated at
   boot time?
2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
   option for you given that it's a custom-built kernel and not a
   distribution kernel?

Basically, as the allocation context is within smp_init(), there are no
opportunities to do the deferred meminit early. Furthermore, the partial
initialisation of memory occurs before the size of the trace buffers is
set so there is no opportunity to adjust the amount of memory that is
pre-initialised. We could potentially catch when memory is low during
system boot and adjust the amount that is initialised serially but the
complexity would be high. Given that deferred meminit is basically a minor
optimisation that only affects very large machines and trace_buf_size being
used is somewhat specialised, I think the most straight-forward option is
to go back to serialised meminit if trace_buf_size is specified like this;

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 710143741eb5..6ef0ab13f774 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+extern void __init disable_deferred_meminit(void);
+extern void page_alloc_init_late_prepare(void);
+#else
+static inline void disable_deferred_meminit(void)
+{
+}
+
+static inline void page_alloc_init_late_prepare(void)
+{
+}
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
  * GFP flags are used before interrupts are enabled. Once interrupts are
diff --git a/init/main.c b/init/main.c
index 0ee9c6866ada..0248b8b5bc3a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void)
 	do_pre_smp_initcalls();
 	lockup_detector_init();
 
+	page_alloc_init_late_prepare();
+
 	smp_init();
 	sched_init_smp();
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 752e5daf0896..cfa7175ff093 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str)
 	if (buf_size == 0)
 		return 0;
 	trace_buf_size = buf_size;
+
+	/*
+	 * The size of buffers are unpredictable so initialise all memory
+	 * before the allocation attempt occurs.
+	 */
+	disable_deferred_meminit();
+
 	return 1;
 }
 __setup("trace_buf_size=", set_buf_size);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e4d3c5c57b..4dd0e153b0f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes);
 int page_group_by_mobility_disabled __read_mostly;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+bool __initdata deferred_meminit_disabled;
+
+/*
+ * Allow deferred meminit to be disabled by subsystems that require large
+ * allocations before the memory allocator is fully initialised. It should
+ * only be used in cases where the size of the allocation may not fit into
+ * the 2G per node that is allocated serially.
+ */
+void __init disable_deferred_meminit(void)
+{
+	deferred_meminit_disabled = true;
+}
+
 static inline void reset_deferred_meminit(pg_data_t *pgdat)
 {
 	unsigned long max_initialise;
@@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data)
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/*
+ * Serialised init of remaining memory if large buffers of unknown size
+ * are required that might fail before parallelised meminit can start
+ */
+void __init page_alloc_init_late_prepare(void)
+{
+	int nid;
+
+	if (!deferred_meminit_disabled)
+		return;
+
+	for_each_node_state(nid, N_MEMORY)
+		deferred_init_memmap(NODE_DATA(nid));
+}
+#endif
+
 void __init page_alloc_init_late(void)
 {
 	struct zone *zone;

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
  2017-11-14 11:46   ` Mel Gorman
@ 2017-11-14 15:39     ` YASUAKI ISHIMATSU
  -1 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-14 15:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu



On 11/14/2017 06:46 AM, Mel Gorman wrote:
> On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
>> When using trace_buf_size= boot option, memory allocation of ring buffer
>> for trace fails as follows:
>>
>> [ ] x86: Booting SMP configuration:
>> <SNIP>
>>
>> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
>> "trace_buf_size=100M" is set.
>>
>> When using trace_buf_size=100M, kernel allocates 100 MB memory
>> per CPU before calling free_are_init_core(). Kernel tries to
>> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
>> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
>>
>>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
>>
> 
> 1. What is the use case for such a large trace buffer being allocated at
>    boot time?

I'm not sure the use case. I found the following commit log:

  commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374
  Author: Michal Hocko <mhocko@suse.com>
  Date:   Fri Jun 2 14:46:49 2017 -0700

      mm: consider memblock reservations for deferred memory initialization sizing

So I thought similar memory exhaustion may occurs on other boot option.
And I reproduced the issue.


> 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
>    option for you given that it's a custom-built kernel and not a
>    distribution kernel?

The issue also occurred on distribution kernels. So we have to fix the issue.

Thanks,
Yasuaki Ishimatsu

> 
> Basically, as the allocation context is within smp_init(), there are no
> opportunities to do the deferred meminit early. Furthermore, the partial
> initialisation of memory occurs before the size of the trace buffers is
> set so there is no opportunity to adjust the amount of memory that is
> pre-initialised. We could potentially catch when memory is low during
> system boot and adjust the amount that is initialised serially but the
> complexity would be high. Given that deferred meminit is basically a minor
> optimisation that only affects very large machines and trace_buf_size being
> used is somewhat specialised, I think the most straight-forward option is
> to go back to serialised meminit if trace_buf_size is specified like this;
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 710143741eb5..6ef0ab13f774 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone);
>  
>  void page_alloc_init_late(void);
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +extern void __init disable_deferred_meminit(void);
> +extern void page_alloc_init_late_prepare(void);
> +#else
> +static inline void disable_deferred_meminit(void)
> +{
> +}
> +
> +static inline void page_alloc_init_late_prepare(void)
> +{
> +}
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> +
>  /*
>   * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
>   * GFP flags are used before interrupts are enabled. Once interrupts are
> diff --git a/init/main.c b/init/main.c
> index 0ee9c6866ada..0248b8b5bc3a 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void)
>  	do_pre_smp_initcalls();
>  	lockup_detector_init();
>  
> +	page_alloc_init_late_prepare();
> +
>  	smp_init();
>  	sched_init_smp();
>  
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 752e5daf0896..cfa7175ff093 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str)
>  	if (buf_size == 0)
>  		return 0;
>  	trace_buf_size = buf_size;
> +
> +	/*
> +	 * The size of buffers are unpredictable so initialise all memory
> +	 * before the allocation attempt occurs.
> +	 */
> +	disable_deferred_meminit();
> +
>  	return 1;
>  }
>  __setup("trace_buf_size=", set_buf_size);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 77e4d3c5c57b..4dd0e153b0f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes);
>  int page_group_by_mobility_disabled __read_mostly;
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +bool __initdata deferred_meminit_disabled;
> +
> +/*
> + * Allow deferred meminit to be disabled by subsystems that require large
> + * allocations before the memory allocator is fully initialised. It should
> + * only be used in cases where the size of the allocation may not fit into
> + * the 2G per node that is allocated serially.
> + */
> +void __init disable_deferred_meminit(void)
> +{
> +	deferred_meminit_disabled = true;
> +}
> +
>  static inline void reset_deferred_meminit(pg_data_t *pgdat)
>  {
>  	unsigned long max_initialise;
> @@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data)
>  }
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/*
> + * Serialised init of remaining memory if large buffers of unknown size
> + * are required that might fail before parallelised meminit can start
> + */
> +void __init page_alloc_init_late_prepare(void)
> +{
> +	int nid;
> +
> +	if (!deferred_meminit_disabled)
> +		return;
> +
> +	for_each_node_state(nid, N_MEMORY)
> +		deferred_init_memmap(NODE_DATA(nid));
> +}
> +#endif
> +
>  void __init page_alloc_init_late(void)
>  {
>  	struct zone *zone;
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
@ 2017-11-14 15:39     ` YASUAKI ISHIMATSU
  0 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-14 15:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu



On 11/14/2017 06:46 AM, Mel Gorman wrote:
> On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
>> When using trace_buf_size= boot option, memory allocation of ring buffer
>> for trace fails as follows:
>>
>> [ ] x86: Booting SMP configuration:
>> <SNIP>
>>
>> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
>> "trace_buf_size=100M" is set.
>>
>> When using trace_buf_size=100M, kernel allocates 100 MB memory
>> per CPU before calling free_are_init_core(). Kernel tries to
>> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
>> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
>>
>>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
>>
> 
> 1. What is the use case for such a large trace buffer being allocated at
>    boot time?

I'm not sure the use case. I found the following commit log:

  commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374
  Author: Michal Hocko <mhocko@suse.com>
  Date:   Fri Jun 2 14:46:49 2017 -0700

      mm: consider memblock reservations for deferred memory initialization sizing

So I thought similar memory exhaustion may occurs on other boot option.
And I reproduced the issue.


> 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
>    option for you given that it's a custom-built kernel and not a
>    distribution kernel?

The issue also occurred on distribution kernels. So we have to fix the issue.

Thanks,
Yasuaki Ishimatsu

> 
> Basically, as the allocation context is within smp_init(), there are no
> opportunities to do the deferred meminit early. Furthermore, the partial
> initialisation of memory occurs before the size of the trace buffers is
> set so there is no opportunity to adjust the amount of memory that is
> pre-initialised. We could potentially catch when memory is low during
> system boot and adjust the amount that is initialised serially but the
> complexity would be high. Given that deferred meminit is basically a minor
> optimisation that only affects very large machines and trace_buf_size being
> used is somewhat specialised, I think the most straight-forward option is
> to go back to serialised meminit if trace_buf_size is specified like this;
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 710143741eb5..6ef0ab13f774 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone);
>  
>  void page_alloc_init_late(void);
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +extern void __init disable_deferred_meminit(void);
> +extern void page_alloc_init_late_prepare(void);
> +#else
> +static inline void disable_deferred_meminit(void)
> +{
> +}
> +
> +static inline void page_alloc_init_late_prepare(void)
> +{
> +}
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> +
>  /*
>   * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
>   * GFP flags are used before interrupts are enabled. Once interrupts are
> diff --git a/init/main.c b/init/main.c
> index 0ee9c6866ada..0248b8b5bc3a 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void)
>  	do_pre_smp_initcalls();
>  	lockup_detector_init();
>  
> +	page_alloc_init_late_prepare();
> +
>  	smp_init();
>  	sched_init_smp();
>  
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 752e5daf0896..cfa7175ff093 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str)
>  	if (buf_size == 0)
>  		return 0;
>  	trace_buf_size = buf_size;
> +
> +	/*
> +	 * The size of buffers are unpredictable so initialise all memory
> +	 * before the allocation attempt occurs.
> +	 */
> +	disable_deferred_meminit();
> +
>  	return 1;
>  }
>  __setup("trace_buf_size=", set_buf_size);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 77e4d3c5c57b..4dd0e153b0f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes);
>  int page_group_by_mobility_disabled __read_mostly;
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +bool __initdata deferred_meminit_disabled;
> +
> +/*
> + * Allow deferred meminit to be disabled by subsystems that require large
> + * allocations before the memory allocator is fully initialised. It should
> + * only be used in cases where the size of the allocation may not fit into
> + * the 2G per node that is allocated serially.
> + */
> +void __init disable_deferred_meminit(void)
> +{
> +	deferred_meminit_disabled = true;
> +}
> +
>  static inline void reset_deferred_meminit(pg_data_t *pgdat)
>  {
>  	unsigned long max_initialise;
> @@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data)
>  }
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/*
> + * Serialised init of remaining memory if large buffers of unknown size
> + * are required that might fail before parallelised meminit can start
> + */
> +void __init page_alloc_init_late_prepare(void)
> +{
> +	int nid;
> +
> +	if (!deferred_meminit_disabled)
> +		return;
> +
> +	for_each_node_state(nid, N_MEMORY)
> +		deferred_init_memmap(NODE_DATA(nid));
> +}
> +#endif
> +
>  void __init page_alloc_init_late(void)
>  {
>  	struct zone *zone;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
  2017-11-14 15:39     ` YASUAKI ISHIMATSU
@ 2017-11-14 15:53       ` Mel Gorman
  -1 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2017-11-14 15:53 UTC (permalink / raw)
  To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

On Tue, Nov 14, 2017 at 10:39:19AM -0500, YASUAKI ISHIMATSU wrote:
> 
> 
> On 11/14/2017 06:46 AM, Mel Gorman wrote:
> > On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
> >> When using trace_buf_size= boot option, memory allocation of ring buffer
> >> for trace fails as follows:
> >>
> >> [ ] x86: Booting SMP configuration:
> >> <SNIP>
> >>
> >> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
> >> "trace_buf_size=100M" is set.
> >>
> >> When using trace_buf_size=100M, kernel allocates 100 MB memory
> >> per CPU before calling free_are_init_core(). Kernel tries to
> >> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
> >> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
> >>
> >>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
> >>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
> >>
> > 
> > 1. What is the use case for such a large trace buffer being allocated at
> >    boot time?
> 
> I'm not sure the use case. I found the following commit log:
> 
>   commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374
>   Author: Michal Hocko <mhocko@suse.com>
>   Date:   Fri Jun 2 14:46:49 2017 -0700
> 
>       mm: consider memblock reservations for deferred memory initialization sizing
> 
> So I thought similar memory exhaustion may occurs on other boot option.
> And I reproduced the issue.
> 

That was different, it was a premature OOM caused by reservations that
were of a known size. It's not related to trace_buf_size in any fashion.

> 
> > 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
> >    option for you given that it's a custom-built kernel and not a
> >    distribution kernel?
> 
> The issue also occurred on distribution kernels. So we have to fix the issue.
> 

I'm aware of now bugs against a distribution kernel. However, does the
patch work for you?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
@ 2017-11-14 15:53       ` Mel Gorman
  0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2017-11-14 15:53 UTC (permalink / raw)
  To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

On Tue, Nov 14, 2017 at 10:39:19AM -0500, YASUAKI ISHIMATSU wrote:
> 
> 
> On 11/14/2017 06:46 AM, Mel Gorman wrote:
> > On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
> >> When using trace_buf_size= boot option, memory allocation of ring buffer
> >> for trace fails as follows:
> >>
> >> [ ] x86: Booting SMP configuration:
> >> <SNIP>
> >>
> >> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
> >> "trace_buf_size=100M" is set.
> >>
> >> When using trace_buf_size=100M, kernel allocates 100 MB memory
> >> per CPU before calling free_are_init_core(). Kernel tries to
> >> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
> >> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
> >>
> >>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
> >>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
> >>
> > 
> > 1. What is the use case for such a large trace buffer being allocated at
> >    boot time?
> 
> I'm not sure the use case. I found the following commit log:
> 
>   commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374
>   Author: Michal Hocko <mhocko@suse.com>
>   Date:   Fri Jun 2 14:46:49 2017 -0700
> 
>       mm: consider memblock reservations for deferred memory initialization sizing
> 
> So I thought similar memory exhaustion may occurs on other boot option.
> And I reproduced the issue.
> 

That was different, it was a premature OOM caused by reservations that
were of a known size. It's not related to trace_buf_size in any fashion.

> 
> > 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
> >    option for you given that it's a custom-built kernel and not a
> >    distribution kernel?
> 
> The issue also occurred on distribution kernels. So we have to fix the issue.
> 

I'm aware of now bugs against a distribution kernel. However, does the
patch work for you?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
  2017-11-14 15:53       ` Mel Gorman
@ 2017-11-14 16:40         ` YASUAKI ISHIMATSU
  -1 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-14 16:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu



On 11/14/2017 10:53 AM, Mel Gorman wrote:
> On Tue, Nov 14, 2017 at 10:39:19AM -0500, YASUAKI ISHIMATSU wrote:
>>
>>
>> On 11/14/2017 06:46 AM, Mel Gorman wrote:
>>> On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
>>>> When using trace_buf_size= boot option, memory allocation of ring buffer
>>>> for trace fails as follows:
>>>>
>>>> [ ] x86: Booting SMP configuration:
>>>> <SNIP>
>>>>
>>>> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
>>>> "trace_buf_size=100M" is set.
>>>>
>>>> When using trace_buf_size=100M, kernel allocates 100 MB memory
>>>> per CPU before calling free_are_init_core(). Kernel tries to
>>>> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
>>>> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
>>>>
>>>>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>>>>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
>>>>
>>>
>>> 1. What is the use case for such a large trace buffer being allocated at
>>>    boot time?
>>
>> I'm not sure the use case. I found the following commit log:
>>
>>   commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374
>>   Author: Michal Hocko <mhocko@suse.com>
>>   Date:   Fri Jun 2 14:46:49 2017 -0700
>>
>>       mm: consider memblock reservations for deferred memory initialization sizing
>>
>> So I thought similar memory exhaustion may occurs on other boot option.
>> And I reproduced the issue.
>>
> 
> That was different, it was a premature OOM caused by reservations that
> were of a known size. It's not related to trace_buf_size in any fashion.

Yes. I know there are different bugs. I thought memory exhaustion at boot time
may occur by other boot option. So I tried trace_buf_size boot option.

> 
>>
>>> 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
>>>    option for you given that it's a custom-built kernel and not a
>>>    distribution kernel?
>>
>> The issue also occurred on distribution kernels. So we have to fix the issue.
>>
> 
> I'm aware of now bugs against a distribution kernel. However, does the
> patch work for you?
> 

I'll apply it.

Thanks,
Yasuaki Ishimatsu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
@ 2017-11-14 16:40         ` YASUAKI ISHIMATSU
  0 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-14 16:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu



On 11/14/2017 10:53 AM, Mel Gorman wrote:
> On Tue, Nov 14, 2017 at 10:39:19AM -0500, YASUAKI ISHIMATSU wrote:
>>
>>
>> On 11/14/2017 06:46 AM, Mel Gorman wrote:
>>> On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
>>>> When using trace_buf_size= boot option, memory allocation of ring buffer
>>>> for trace fails as follows:
>>>>
>>>> [ ] x86: Booting SMP configuration:
>>>> <SNIP>
>>>>
>>>> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
>>>> "trace_buf_size=100M" is set.
>>>>
>>>> When using trace_buf_size=100M, kernel allocates 100 MB memory
>>>> per CPU before calling free_are_init_core(). Kernel tries to
>>>> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
>>>> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
>>>>
>>>>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>>>>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
>>>>
>>>
>>> 1. What is the use case for such a large trace buffer being allocated at
>>>    boot time?
>>
>> I'm not sure the use case. I found the following commit log:
>>
>>   commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374
>>   Author: Michal Hocko <mhocko@suse.com>
>>   Date:   Fri Jun 2 14:46:49 2017 -0700
>>
>>       mm: consider memblock reservations for deferred memory initialization sizing
>>
>> So I thought similar memory exhaustion may occurs on other boot option.
>> And I reproduced the issue.
>>
> 
> That was different, it was a premature OOM caused by reservations that
> were of a known size. It's not related to trace_buf_size in any fashion.

Yes. I know there are different bugs. I thought memory exhaustion at boot time
may occur by other boot option. So I tried trace_buf_size boot option.

> 
>>
>>> 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
>>>    option for you given that it's a custom-built kernel and not a
>>>    distribution kernel?
>>
>> The issue also occurred on distribution kernels. So we have to fix the issue.
>>
> 
> I'm aware of now bugs against a distribution kernel. However, does the
> patch work for you?
> 

I'll apply it.

Thanks,
Yasuaki Ishimatsu

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
  2017-11-14 15:53       ` Mel Gorman
@ 2017-11-14 17:05         ` Mel Gorman
  -1 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2017-11-14 17:05 UTC (permalink / raw)
  To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

On Tue, Nov 14, 2017 at 03:53:27PM +0000, Mel Gorman wrote:
> > The issue also occurred on distribution kernels. So we have to fix the issue.
> > 
> 
> I'm aware of now bugs against a distribution kernel.

I don't know what happened there. I'm *not* aware of any bugs against a
distribution kernel.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
@ 2017-11-14 17:05         ` Mel Gorman
  0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2017-11-14 17:05 UTC (permalink / raw)
  To: YASUAKI ISHIMATSU; +Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi

On Tue, Nov 14, 2017 at 03:53:27PM +0000, Mel Gorman wrote:
> > The issue also occurred on distribution kernels. So we have to fix the issue.
> > 
> 
> I'm aware of now bugs against a distribution kernel.

I don't know what happened there. I'm *not* aware of any bugs against a
distribution kernel.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
  2017-11-14 11:46   ` Mel Gorman
@ 2017-11-15  4:11     ` YASUAKI ISHIMATSU
  -1 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-15  4:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu

Hi Mel,

Your patch works good.

Here are the results of your patch.

- boot up without trace_buf_size boot option

When system boots up without trace_buf_size boot option, deferred_init_memmap()
runs after booting SMP configuration. There no change of boot sequence between
4.14.0 and 4.14.0 with your patch.

[    0.256285] x86: Booting SMP configuration:
...
[    5.313195] node 0 initialised, 15530251 pages in 653ms
[    5.330691] node 1 initialised, 15988494 pages in 670ms
[    5.331746] node 2 initialised, 15988493 pages in 671ms
[    5.332166] node 6 initialised, 15982779 pages in 670ms
[    5.332673] node 3 initialised, 15988494 pages in 671ms
[    5.332618] node 4 initialised, 15988494 pages in 672ms
[    5.334187] node 7 initialised, 15987304 pages in 672ms
[    5.334976] node 5 initialised, 15988494 pages in 673ms

- boot up with trace_buf_size boot option

When system boots up with trace_buf_size boot option, deferred_init_memmap()
runs before booting SMP configuration. So every memory on all nodes is
initialised before allocating trace buffer. And system can boot up even if
we set trace_buf_size boot option.

[    0.932114] node 0 initialised, 15530251 pages in 684ms
[    1.604918] node 1 initialised, 15988494 pages in 671ms
[    2.278933] node 2 initialised, 15988494 pages in 673ms
[    2.965076] node 3 initialised, 15988494 pages in 686ms
[    3.669064] node 4 initialised, 15988494 pages in 703ms
[    4.354983] node 5 initialised, 15988493 pages in 684ms
[    5.028681] node 6 initialised, 15982779 pages in 673ms
[    5.716102] node 7 initialised, 15987304 pages in 687ms
[    5.727855] smp: Bringing up secondary CPUs ...
[    5.745937] x86: Booting SMP configuration:

Thanks,
Yasuaki Ishimatsu

On 11/14/2017 06:46 AM, Mel Gorman wrote:
> On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
>> When using trace_buf_size= boot option, memory allocation of ring buffer
>> for trace fails as follows:
>>
>> [ ] x86: Booting SMP configuration:
>> <SNIP>
>>
>> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
>> "trace_buf_size=100M" is set.
>>
>> When using trace_buf_size=100M, kernel allocates 100 MB memory
>> per CPU before calling free_are_init_core(). Kernel tries to
>> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
>> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
>>
>>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
>>
> 
> 1. What is the use case for such a large trace buffer being allocated at
>    boot time?
> 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
>    option for you given that it's a custom-built kernel and not a
>    distribution kernel?
> 
> Basically, as the allocation context is within smp_init(), there are no
> opportunities to do the deferred meminit early. Furthermore, the partial
> initialisation of memory occurs before the size of the trace buffers is
> set so there is no opportunity to adjust the amount of memory that is
> pre-initialised. We could potentially catch when memory is low during
> system boot and adjust the amount that is initialised serially but the
> complexity would be high. Given that deferred meminit is basically a minor
> optimisation that only affects very large machines and trace_buf_size being
> used is somewhat specialised, I think the most straight-forward option is
> to go back to serialised meminit if trace_buf_size is specified like this;
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 710143741eb5..6ef0ab13f774 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone);
>  
>  void page_alloc_init_late(void);
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +extern void __init disable_deferred_meminit(void);
> +extern void page_alloc_init_late_prepare(void);
> +#else
> +static inline void disable_deferred_meminit(void)
> +{
> +}
> +
> +static inline void page_alloc_init_late_prepare(void)
> +{
> +}
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> +
>  /*
>   * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
>   * GFP flags are used before interrupts are enabled. Once interrupts are
> diff --git a/init/main.c b/init/main.c
> index 0ee9c6866ada..0248b8b5bc3a 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void)
>  	do_pre_smp_initcalls();
>  	lockup_detector_init();
>  
> +	page_alloc_init_late_prepare();
> +
>  	smp_init();
>  	sched_init_smp();
>  
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 752e5daf0896..cfa7175ff093 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str)
>  	if (buf_size == 0)
>  		return 0;
>  	trace_buf_size = buf_size;
> +
> +	/*
> +	 * The size of buffers are unpredictable so initialise all memory
> +	 * before the allocation attempt occurs.
> +	 */
> +	disable_deferred_meminit();
> +
>  	return 1;
>  }
>  __setup("trace_buf_size=", set_buf_size);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 77e4d3c5c57b..4dd0e153b0f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes);
>  int page_group_by_mobility_disabled __read_mostly;
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +bool __initdata deferred_meminit_disabled;
> +
> +/*
> + * Allow deferred meminit to be disabled by subsystems that require large
> + * allocations before the memory allocator is fully initialised. It should
> + * only be used in cases where the size of the allocation may not fit into
> + * the 2G per node that is allocated serially.
> + */
> +void __init disable_deferred_meminit(void)
> +{
> +	deferred_meminit_disabled = true;
> +}
> +
>  static inline void reset_deferred_meminit(pg_data_t *pgdat)
>  {
>  	unsigned long max_initialise;
> @@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data)
>  }
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/*
> + * Serialised init of remaining memory if large buffers of unknown size
> + * are required that might fail before parallelised meminit can start
> + */
> +void __init page_alloc_init_late_prepare(void)
> +{
> +	int nid;
> +
> +	if (!deferred_meminit_disabled)
> +		return;
> +
> +	for_each_node_state(nid, N_MEMORY)
> +		deferred_init_memmap(NODE_DATA(nid));
> +}
> +#endif
> +
>  void __init page_alloc_init_late(void)
>  {
>  	struct zone *zone;
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Allocation failure of ring buffer for trace
@ 2017-11-15  4:11     ` YASUAKI ISHIMATSU
  0 siblings, 0 replies; 15+ messages in thread
From: YASUAKI ISHIMATSU @ 2017-11-15  4:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: rostedt, mingo, linux-kernel, linux-mm, koki.sanagi, yasu.isimatu

Hi Mel,

Your patch works good.

Here are the results of your patch.

- boot up without trace_buf_size boot option

When system boots up without trace_buf_size boot option, deferred_init_memmap()
runs after booting SMP configuration. There no change of boot sequence between
4.14.0 and 4.14.0 with your patch.

[    0.256285] x86: Booting SMP configuration:
...
[    5.313195] node 0 initialised, 15530251 pages in 653ms
[    5.330691] node 1 initialised, 15988494 pages in 670ms
[    5.331746] node 2 initialised, 15988493 pages in 671ms
[    5.332166] node 6 initialised, 15982779 pages in 670ms
[    5.332673] node 3 initialised, 15988494 pages in 671ms
[    5.332618] node 4 initialised, 15988494 pages in 672ms
[    5.334187] node 7 initialised, 15987304 pages in 672ms
[    5.334976] node 5 initialised, 15988494 pages in 673ms

- boot up with trace_buf_size boot option

When system boots up with trace_buf_size boot option, deferred_init_memmap()
runs before booting SMP configuration. So every memory on all nodes is
initialised before allocating trace buffer. And system can boot up even if
we set trace_buf_size boot option.

[    0.932114] node 0 initialised, 15530251 pages in 684ms
[    1.604918] node 1 initialised, 15988494 pages in 671ms
[    2.278933] node 2 initialised, 15988494 pages in 673ms
[    2.965076] node 3 initialised, 15988494 pages in 686ms
[    3.669064] node 4 initialised, 15988494 pages in 703ms
[    4.354983] node 5 initialised, 15988493 pages in 684ms
[    5.028681] node 6 initialised, 15982779 pages in 673ms
[    5.716102] node 7 initialised, 15987304 pages in 687ms
[    5.727855] smp: Bringing up secondary CPUs ...
[    5.745937] x86: Booting SMP configuration:

Thanks,
Yasuaki Ishimatsu

On 11/14/2017 06:46 AM, Mel Gorman wrote:
> On Mon, Nov 13, 2017 at 12:48:36PM -0500, YASUAKI ISHIMATSU wrote:
>> When using trace_buf_size= boot option, memory allocation of ring buffer
>> for trace fails as follows:
>>
>> [ ] x86: Booting SMP configuration:
>> <SNIP>
>>
>> In my server, there are 384 CPUs, 512 GB memory and 8 nodes. And
>> "trace_buf_size=100M" is set.
>>
>> When using trace_buf_size=100M, kernel allocates 100 MB memory
>> per CPU before calling free_are_init_core(). Kernel tries to
>> allocates 38.4GB (100 MB * 384 CPU) memory. But available memory
>> at this time is about 16GB (2 GB * 8 nodes) due to the following commit:
>>
>>   3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages
>>                  if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
>>
> 
> 1. What is the use case for such a large trace buffer being allocated at
>    boot time?
> 2. Is disabling CONFIG_DEFERRED_STRUCT_PAGE_INIT at compile time an
>    option for you given that it's a custom-built kernel and not a
>    distribution kernel?
> 
> Basically, as the allocation context is within smp_init(), there are no
> opportunities to do the deferred meminit early. Furthermore, the partial
> initialisation of memory occurs before the size of the trace buffers is
> set so there is no opportunity to adjust the amount of memory that is
> pre-initialised. We could potentially catch when memory is low during
> system boot and adjust the amount that is initialised serially but the
> complexity would be high. Given that deferred meminit is basically a minor
> optimisation that only affects very large machines and trace_buf_size being
> used is somewhat specialised, I think the most straight-forward option is
> to go back to serialised meminit if trace_buf_size is specified like this;
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 710143741eb5..6ef0ab13f774 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -558,6 +558,19 @@ void drain_local_pages(struct zone *zone);
>  
>  void page_alloc_init_late(void);
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +extern void __init disable_deferred_meminit(void);
> +extern void page_alloc_init_late_prepare(void);
> +#else
> +static inline void disable_deferred_meminit(void)
> +{
> +}
> +
> +static inline void page_alloc_init_late_prepare(void)
> +{
> +}
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> +
>  /*
>   * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
>   * GFP flags are used before interrupts are enabled. Once interrupts are
> diff --git a/init/main.c b/init/main.c
> index 0ee9c6866ada..0248b8b5bc3a 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1058,6 +1058,8 @@ static noinline void __init kernel_init_freeable(void)
>  	do_pre_smp_initcalls();
>  	lockup_detector_init();
>  
> +	page_alloc_init_late_prepare();
> +
>  	smp_init();
>  	sched_init_smp();
>  
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 752e5daf0896..cfa7175ff093 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1115,6 +1115,13 @@ static int __init set_buf_size(char *str)
>  	if (buf_size == 0)
>  		return 0;
>  	trace_buf_size = buf_size;
> +
> +	/*
> +	 * The size of buffers are unpredictable so initialise all memory
> +	 * before the allocation attempt occurs.
> +	 */
> +	disable_deferred_meminit();
> +
>  	return 1;
>  }
>  __setup("trace_buf_size=", set_buf_size);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 77e4d3c5c57b..4dd0e153b0f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -290,6 +290,19 @@ EXPORT_SYMBOL(nr_online_nodes);
>  int page_group_by_mobility_disabled __read_mostly;
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +bool __initdata deferred_meminit_disabled;
> +
> +/*
> + * Allow deferred meminit to be disabled by subsystems that require large
> + * allocations before the memory allocator is fully initialised. It should
> + * only be used in cases where the size of the allocation may not fit into
> + * the 2G per node that is allocated serially.
> + */
> +void __init disable_deferred_meminit(void)
> +{
> +	deferred_meminit_disabled = true;
> +}
> +
>  static inline void reset_deferred_meminit(pg_data_t *pgdat)
>  {
>  	unsigned long max_initialise;
> @@ -1567,6 +1580,23 @@ static int __init deferred_init_memmap(void *data)
>  }
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/*
> + * Serialised init of remaining memory if large buffers of unknown size
> + * are required that might fail before parallelised meminit can start
> + */
> +void __init page_alloc_init_late_prepare(void)
> +{
> +	int nid;
> +
> +	if (!deferred_meminit_disabled)
> +		return;
> +
> +	for_each_node_state(nid, N_MEMORY)
> +		deferred_init_memmap(NODE_DATA(nid));
> +}
> +#endif
> +
>  void __init page_alloc_init_late(void)
>  {
>  	struct zone *zone;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-11-15  4:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-13 17:48 Allocation failure of ring buffer for trace YASUAKI ISHIMATSU
2017-11-13 17:48 ` YASUAKI ISHIMATSU
2017-11-13 23:53 ` Yang Shi
2017-11-14 11:46 ` Mel Gorman
2017-11-14 11:46   ` Mel Gorman
2017-11-14 15:39   ` YASUAKI ISHIMATSU
2017-11-14 15:39     ` YASUAKI ISHIMATSU
2017-11-14 15:53     ` Mel Gorman
2017-11-14 15:53       ` Mel Gorman
2017-11-14 16:40       ` YASUAKI ISHIMATSU
2017-11-14 16:40         ` YASUAKI ISHIMATSU
2017-11-14 17:05       ` Mel Gorman
2017-11-14 17:05         ` Mel Gorman
2017-11-15  4:11   ` YASUAKI ISHIMATSU
2017-11-15  4:11     ` YASUAKI ISHIMATSU

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.