linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
@ 2017-06-30 14:18 Michal Hocko
  2017-06-30 15:42 ` Michal Hocko
  2017-07-03 11:48 ` Vlastimil Babka
  0 siblings, 2 replies; 15+ messages in thread
From: Michal Hocko @ 2017-06-30 14:18 UTC (permalink / raw)
  To: Yang Shi
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, Vlastimil Babka, linux-mm, LKML

fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem
to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp.
CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with
!defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
I am not sure how widely is this used but such a code is tricky. I see
how catching early allocations during defered initialization might be
useful but a subtly broken code sounds like a problem to me.  So is
fe53ca54270a worth this or we should revert it?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-06-30 14:18 "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations? Michal Hocko
@ 2017-06-30 15:42 ` Michal Hocko
  2017-06-30 15:44   ` Michal Hocko
  2017-07-03 11:48 ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-06-30 15:42 UTC (permalink / raw)
  To: Yang Shi
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, Vlastimil Babka, linux-mm, LKML

On Fri 30-06-17 16:18:47, Michal Hocko wrote:
> fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem
> to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp.
> CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with
> !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> I am not sure how widely is this used but such a code is tricky. I see
> how catching early allocations during defered initialization might be
> useful but a subtly broken code sounds like a problem to me.  So is
> fe53ca54270a worth this or we should revert it?

I've dug little bit further. It seems that only s390 and ia64 select
HAVE_ARCH_EARLY_PFN_TO_NID. Much more architectures enabled
HAVE_MEMBLOCK_NODE_MAP though but still
alpha, arc, arm, avr32, blackfin, c6x, cris, frv, h8300, hexagon,
Kconfig, m32r, m68k, mn10300, nios2, openrisc, parisc, tile, um,
unicore32, xtensa do not. I can only see alpha having NUMA and even that
is marked BROKEN. So it seems that this is not a real problem after all.

Still subtle, so I guess we want to have the following. What do you
think?

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 16532fa0bb64..894697c1e6f5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1055,6 +1055,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 	!defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
 static inline unsigned long early_pfn_to_nid(unsigned long pfn)
 {
+	BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA));
 	return 0;
 }
 #endif
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-06-30 15:42 ` Michal Hocko
@ 2017-06-30 15:44   ` Michal Hocko
  2017-07-04  5:11     ` Joonsoo Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-06-30 15:44 UTC (permalink / raw)
  To: Yang Shi
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, Vlastimil Babka, linux-mm, LKML

On Fri 30-06-17 17:42:24, Michal Hocko wrote:
[...]
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 16532fa0bb64..894697c1e6f5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1055,6 +1055,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
>  	!defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
>  static inline unsigned long early_pfn_to_nid(unsigned long pfn)
>  {
> +	BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA));

Err, this should read BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA)) of course

>  	return 0;
>  }
>  #endif

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-06-30 14:18 "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations? Michal Hocko
  2017-06-30 15:42 ` Michal Hocko
@ 2017-07-03 11:48 ` Vlastimil Babka
  2017-07-03 14:18   ` Vlastimil Babka
  2017-07-04  5:17   ` Joonsoo Kim
  1 sibling, 2 replies; 15+ messages in thread
From: Vlastimil Babka @ 2017-07-03 11:48 UTC (permalink / raw)
  To: Michal Hocko, Yang Shi
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, linux-mm, LKML

On 06/30/2017 04:18 PM, Michal Hocko wrote:
> fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem
> to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp.
> CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with
> !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> I am not sure how widely is this used but such a code is tricky. I see
> how catching early allocations during defered initialization might be
> useful but a subtly broken code sounds like a problem to me.  So is
> fe53ca54270a worth this or we should revert it?

There might be more issues with fe53ca54270a, I think. This I've
observed on our 4.4-based kernel, which has deferred page struct init,
but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all
struct pages are initialized") nor aforementioned fe53ca54270a:

[    0.000000] allocated 421003264 bytes of page_ext
[    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
[    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
[    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
[    0.000000] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    0.000000] IP: [<ffffffff811f090a>] init_page_owner+0x12a/0x240
[    0.000000] PGD 0 
[    0.000000] Oops: 0000 [#1] SMP 
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7
[    0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
[    0.000000] task: ffffffff81e104c0 ti: ffffffff81e00000 task.ti: ffffffff81e00000
[    0.000000] RIP: 0010:[<ffffffff811f090a>]  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
[    0.000000] RSP: 0000:ffffffff81e03ed0  EFLAGS: 00010046
[    0.000000] RAX: 0000000000000000 RBX: ffff88083ffe0210 RCX: ffffea0013000000
[    0.000000] RDX: 0000000000000300 RSI: ffffffff81f57437 RDI: 00000000004c0000
[    0.000000] RBP: ffffffff81e03f20 R08: ffffffff81e03e90 R09: 0000000000000000
[    0.000000] R10: 00000000004c0200 R11: 0000000000000000 R12: ffffea0000000000
[    0.000000] R13: 00000000004c0200 R14: 00000000004c0000 R15: 0000000000840000
[    0.000000] FS:  0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 0000000000000000 CR3: 0000000001e0b000 CR4: 00000000000406b0
[    0.000000] Stack:
[    0.000000]  0000000000000206 ffff88083ffe0f90 ffff88083ffdf000 0000000000003181
[    0.000000]  ffffea0013000000 0000000000000040 ffffea0000000000 0000000000840000
[    0.000000]  0000000000840000 000000008e000010 ffffffff81e03f50 ffffffff81f84145
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff81f84145>] page_ext_init+0x15e/0x167
[    0.000000]  [<ffffffff81f57e6a>] start_kernel+0x351/0x418
[    0.000000]  [<ffffffff81f57120>] ? early_idt_handler_array+0x120/0x120
[    0.000000]  [<ffffffff81f57309>] x86_64_start_reservations+0x2a/0x2c
[    0.000000]  [<ffffffff81f57437>] x86_64_start_kernel+0x12c/0x13b
[    0.000000] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 d5 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 <48> 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 
[    0.000000] RIP  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
[    0.000000]  RSP <ffffffff81e03ed0>
[    0.000000] CR2: 0000000000000000
[    0.000000] ---[ end trace 19e05592f03a690f ]---

Note that this is different backtrace than in b8f1a75d61d8 log.

Still, backporting b8f1a75d61d8 fixes this:

[    1.538379] allocated 738197504 bytes of page_ext
[    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
[    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
[    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
[    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages

No panic, notice how it allocated more for page_ext, and found smaller number of
early allocated pages.

Now backporting fe53ca54270a on top:

[    0.000000] allocated 738197504 bytes of page_ext
[    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
[    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
[    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
[    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages

Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
struct pages that have not been yet initialized. It doesn't end up crashing, but
still doesn't seem correct?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-03 11:48 ` Vlastimil Babka
@ 2017-07-03 14:18   ` Vlastimil Babka
  2017-07-04  5:23     ` Joonsoo Kim
  2017-07-04  5:17   ` Joonsoo Kim
  1 sibling, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2017-07-03 14:18 UTC (permalink / raw)
  To: Michal Hocko, Yang Shi
  Cc: Joonsoo Kim, Mel Gorman, Andrew Morton, linux-mm, LKML

On 07/03/2017 01:48 PM, Vlastimil Babka wrote:
> On 06/30/2017 04:18 PM, Michal Hocko wrote:
>> fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem
>> to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp.
>> CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with
>> !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
>> I am not sure how widely is this used but such a code is tricky. I see
>> how catching early allocations during defered initialization might be
>> useful but a subtly broken code sounds like a problem to me.  So is
>> fe53ca54270a worth this or we should revert it?
> 
> There might be more issues with fe53ca54270a, I think. This I've
> observed on our 4.4-based kernel, which has deferred page struct init,
> but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all
> struct pages are initialized") nor aforementioned fe53ca54270a:
> 
> [    0.000000] allocated 421003264 bytes of page_ext
> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> [    0.000000] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [    0.000000] IP: [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> [    0.000000] PGD 0 
> [    0.000000] Oops: 0000 [#1] SMP 
> [    0.000000] Modules linked in:
> [    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7
> [    0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [    0.000000] task: ffffffff81e104c0 ti: ffffffff81e00000 task.ti: ffffffff81e00000
> [    0.000000] RIP: 0010:[<ffffffff811f090a>]  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> [    0.000000] RSP: 0000:ffffffff81e03ed0  EFLAGS: 00010046
> [    0.000000] RAX: 0000000000000000 RBX: ffff88083ffe0210 RCX: ffffea0013000000
> [    0.000000] RDX: 0000000000000300 RSI: ffffffff81f57437 RDI: 00000000004c0000
> [    0.000000] RBP: ffffffff81e03f20 R08: ffffffff81e03e90 R09: 0000000000000000
> [    0.000000] R10: 00000000004c0200 R11: 0000000000000000 R12: ffffea0000000000
> [    0.000000] R13: 00000000004c0200 R14: 00000000004c0000 R15: 0000000000840000
> [    0.000000] FS:  0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.000000] CR2: 0000000000000000 CR3: 0000000001e0b000 CR4: 00000000000406b0
> [    0.000000] Stack:
> [    0.000000]  0000000000000206 ffff88083ffe0f90 ffff88083ffdf000 0000000000003181
> [    0.000000]  ffffea0013000000 0000000000000040 ffffea0000000000 0000000000840000
> [    0.000000]  0000000000840000 000000008e000010 ffffffff81e03f50 ffffffff81f84145
> [    0.000000] Call Trace:
> [    0.000000]  [<ffffffff81f84145>] page_ext_init+0x15e/0x167
> [    0.000000]  [<ffffffff81f57e6a>] start_kernel+0x351/0x418
> [    0.000000]  [<ffffffff81f57120>] ? early_idt_handler_array+0x120/0x120
> [    0.000000]  [<ffffffff81f57309>] x86_64_start_reservations+0x2a/0x2c
> [    0.000000]  [<ffffffff81f57437>] x86_64_start_kernel+0x12c/0x13b
> [    0.000000] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 d5 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 <48> 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 
> [    0.000000] RIP  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> [    0.000000]  RSP <ffffffff81e03ed0>
> [    0.000000] CR2: 0000000000000000
> [    0.000000] ---[ end trace 19e05592f03a690f ]---
> 
> Note that this is different backtrace than in b8f1a75d61d8 log.
> 
> Still, backporting b8f1a75d61d8 fixes this:
> 
> [    1.538379] allocated 738197504 bytes of page_ext
> [    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
> [    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
> [    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
> [    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages
> 
> No panic, notice how it allocated more for page_ext, and found smaller number of
> early allocated pages.
> 
> Now backporting fe53ca54270a on top:
> 
> [    0.000000] allocated 738197504 bytes of page_ext
> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> [    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages
> 
> Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
> seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
> struct pages that have not been yet initialized. It doesn't end up crashing, but
> still doesn't seem correct?

Hmm, and for the record, this is recent mainline. Wonder why the "early
allocated" looks much more sane there. But there's a warning nevertheless.

[    0.001000] allocated 201326592 bytes of page_ext
[    0.001000] ------------[ cut here ]------------
[    0.001000] WARNING: CPU: 0 PID: 0 at mm/page_alloc.c:2467 drain_all_pages+0x1a9/0x1d0
[    0.001000] Modules linked in:
[    0.001000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.12.0-rc7+ #179
[    0.001000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
[    0.001000] task: ffffffff8200f4c0 task.stack: ffffffff82000000
[    0.001000] RIP: 0010:drain_all_pages+0x1a9/0x1d0
[    0.001000] RSP: 0000:ffffffff82003e18 EFLAGS: 00010246
[    0.001000] RAX: ffffffff8200f4c0 RBX: 0000000000000040 RCX: 0000000000000000
[    0.001000] RDX: ffffffff8200f4c0 RSI: 0000000000000000 RDI: 0000000000000000
[    0.001000] RBP: ffffffff82003e80 R08: 0000000000000040 R09: 0000000000000001
[    0.001000] R10: 0000000000000080 R11: ffffffff81086356 R12: 0000000000840000
[    0.001000] R13: 0000000000000040 R14: 0000000000840000 R15: 000000008e000010
[    0.001000] FS:  0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
[    0.001000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.001000] CR2: 00000000ffffffff CR3: 000000000200a000 CR4: 00000000000406b0
[    0.001000] Call Trace:
[    0.001000]  ? init_page_owner+0x39/0x250
[    0.001000]  ? printk+0x3e/0x46
[    0.001000]  page_ext_init+0x195/0x19e
[    0.001000]  start_kernel+0x31a/0x3dc
[    0.001000]  ? early_idt_handler_array+0x120/0x120
[    0.001000]  x86_64_start_reservations+0x2a/0x2c
[    0.001000]  x86_64_start_kernel+0x12d/0x13c
[    0.001000]  secondary_startup_64+0x9f/0x9f
[    0.001000] Code: 52 82 48 63 d2 e8 a8 09 26 00 3b 05 d6 fb fa 00 89 c3 7c cd 48 c7 c7 c0 02 06 82 e8 b2 05 87 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> ff c3 4d 85 e4 74 ed 48 c7 c7 c0 02 06 82 e8 83 0b 87 00 e9 
[    0.001000] ---[ end trace 392dd6a55122ccf6 ]---
[    0.001000] Node 0, zone      DMA: page owner found early allocated 0 pages
[    0.001000] Node 0, zone    DMA32: page owner found early allocated 33 pages
[    0.001000] Node 0, zone   Normal: page owner found early allocated 24732 pages
[    0.001000] Node 1, zone   Normal: page owner found early allocated 24577 pages



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-06-30 15:44   ` Michal Hocko
@ 2017-07-04  5:11     ` Joonsoo Kim
  2017-07-04  6:51       ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Joonsoo Kim @ 2017-07-04  5:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, Mel Gorman, Andrew Morton, Vlastimil Babka, linux-mm, LKML

On Fri, Jun 30, 2017 at 05:44:16PM +0200, Michal Hocko wrote:
> On Fri 30-06-17 17:42:24, Michal Hocko wrote:
> [...]
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 16532fa0bb64..894697c1e6f5 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1055,6 +1055,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
> >  	!defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> >  static inline unsigned long early_pfn_to_nid(unsigned long pfn)
> >  {
> > +	BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA));
> 
> Err, this should read BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA)) of course

Agreed.

However, AFAIK, ARM can set CONFIG_NUMA but it doesn't have
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID and CONFIG_HAVE_MEMBLOCK_NODE_MAP.

If page_ext uses early_pfn_to_nid(), it will cause build error in ARM.

Therefore, I suggest following change.
CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on proper early_pfn_to_nid().
So, following code will always work as long as
CONFIG_DEFERRED_STRUCT_PAGE_INIT works.

Thanks.

----------->8---------------
diff --git a/mm/page_ext.c b/mm/page_ext.c
index 88ccc044..e3db259 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -384,6 +384,7 @@ void __init page_ext_init(void)
 
        for_each_node_state(nid, N_MEMORY) {
                unsigned long start_pfn, end_pfn;
+               int page_nid;
 
                start_pfn = node_start_pfn(nid);
                end_pfn = node_end_pfn(nid);
@@ -405,8 +406,15 @@ void __init page_ext_init(void)
                         *
                         * Take into account DEFERRED_STRUCT_PAGE_INIT.
                         */
-                       if (early_pfn_to_nid(pfn) != nid)
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+                       page_nid = early_pfn_to_nid(pfn);
+#else
+                       page_nid = pfn_to_nid(pfn);
+#endif
+
+                       if (page_nid != nid)
                                continue;
+
                        if (init_section_page_ext(pfn, nid))
                                goto oom;
                }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-03 11:48 ` Vlastimil Babka
  2017-07-03 14:18   ` Vlastimil Babka
@ 2017-07-04  5:17   ` Joonsoo Kim
  2017-07-07 12:00     ` Vlastimil Babka
  1 sibling, 1 reply; 15+ messages in thread
From: Joonsoo Kim @ 2017-07-04  5:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On Mon, Jul 03, 2017 at 01:48:05PM +0200, Vlastimil Babka wrote:
> On 06/30/2017 04:18 PM, Michal Hocko wrote:
> > fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem
> > to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp.
> > CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with
> > !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> > I am not sure how widely is this used but such a code is tricky. I see
> > how catching early allocations during defered initialization might be
> > useful but a subtly broken code sounds like a problem to me.  So is
> > fe53ca54270a worth this or we should revert it?
> 
> There might be more issues with fe53ca54270a, I think. This I've
> observed on our 4.4-based kernel, which has deferred page struct init,
> but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all
> struct pages are initialized") nor aforementioned fe53ca54270a:
> 
> [    0.000000] allocated 421003264 bytes of page_ext
> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> [    0.000000] BUG: unable to handle kernel NULL pointer dereference at           (null)
> [    0.000000] IP: [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> [    0.000000] PGD 0 
> [    0.000000] Oops: 0000 [#1] SMP 
> [    0.000000] Modules linked in:
> [    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7
> [    0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> [    0.000000] task: ffffffff81e104c0 ti: ffffffff81e00000 task.ti: ffffffff81e00000
> [    0.000000] RIP: 0010:[<ffffffff811f090a>]  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> [    0.000000] RSP: 0000:ffffffff81e03ed0  EFLAGS: 00010046
> [    0.000000] RAX: 0000000000000000 RBX: ffff88083ffe0210 RCX: ffffea0013000000
> [    0.000000] RDX: 0000000000000300 RSI: ffffffff81f57437 RDI: 00000000004c0000
> [    0.000000] RBP: ffffffff81e03f20 R08: ffffffff81e03e90 R09: 0000000000000000
> [    0.000000] R10: 00000000004c0200 R11: 0000000000000000 R12: ffffea0000000000
> [    0.000000] R13: 00000000004c0200 R14: 00000000004c0000 R15: 0000000000840000
> [    0.000000] FS:  0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.000000] CR2: 0000000000000000 CR3: 0000000001e0b000 CR4: 00000000000406b0
> [    0.000000] Stack:
> [    0.000000]  0000000000000206 ffff88083ffe0f90 ffff88083ffdf000 0000000000003181
> [    0.000000]  ffffea0013000000 0000000000000040 ffffea0000000000 0000000000840000
> [    0.000000]  0000000000840000 000000008e000010 ffffffff81e03f50 ffffffff81f84145
> [    0.000000] Call Trace:
> [    0.000000]  [<ffffffff81f84145>] page_ext_init+0x15e/0x167
> [    0.000000]  [<ffffffff81f57e6a>] start_kernel+0x351/0x418
> [    0.000000]  [<ffffffff81f57120>] ? early_idt_handler_array+0x120/0x120
> [    0.000000]  [<ffffffff81f57309>] x86_64_start_reservations+0x2a/0x2c
> [    0.000000]  [<ffffffff81f57437>] x86_64_start_kernel+0x12c/0x13b
> [    0.000000] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 d5 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 <48> 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 
> [    0.000000] RIP  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> [    0.000000]  RSP <ffffffff81e03ed0>
> [    0.000000] CR2: 0000000000000000
> [    0.000000] ---[ end trace 19e05592f03a690f ]---
> 
> Note that this is different backtrace than in b8f1a75d61d8 log.
> 
> Still, backporting b8f1a75d61d8 fixes this:
> 
> [    1.538379] allocated 738197504 bytes of page_ext
> [    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
> [    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
> [    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
> [    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages
> 
> No panic, notice how it allocated more for page_ext, and found smaller number of
> early allocated pages.
> 
> Now backporting fe53ca54270a on top:
> 
> [    0.000000] allocated 738197504 bytes of page_ext
> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> [    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages
> 
> Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
> seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
> struct pages that have not been yet initialized. It doesn't end up crashing, but
> still doesn't seem correct?

Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone()
called before page_alloc_init_late(). So, there would be many
uninitialized pages with PageReserved(). Page owner regarded these
PageReserved() page as allocated page.

We can change the message to "page owner found early reserved N pages"

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-03 14:18   ` Vlastimil Babka
@ 2017-07-04  5:23     ` Joonsoo Kim
  2017-07-04  9:39       ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Joonsoo Kim @ 2017-07-04  5:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On Mon, Jul 03, 2017 at 04:18:01PM +0200, Vlastimil Babka wrote:
> On 07/03/2017 01:48 PM, Vlastimil Babka wrote:
> > On 06/30/2017 04:18 PM, Michal Hocko wrote:
> >> fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init") seem
> >> to silently depend on CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID resp.
> >> CONFIG_HAVE_MEMBLOCK_NODE_MAP. early_pfn_to_nid is returning zero with
> >> !defined(CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID) && !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> >> I am not sure how widely is this used but such a code is tricky. I see
> >> how catching early allocations during defered initialization might be
> >> useful but a subtly broken code sounds like a problem to me.  So is
> >> fe53ca54270a worth this or we should revert it?
> > 
> > There might be more issues with fe53ca54270a, I think. This I've
> > observed on our 4.4-based kernel, which has deferred page struct init,
> > but doesn't have b8f1a75d61d8 ("mm: call page_ext_init() after all
> > struct pages are initialized") nor aforementioned fe53ca54270a:
> > 
> > [    0.000000] allocated 421003264 bytes of page_ext
> > [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> > [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> > [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> > [    0.000000] BUG: unable to handle kernel NULL pointer dereference at           (null)
> > [    0.000000] IP: [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> > [    0.000000] PGD 0 
> > [    0.000000] Oops: 0000 [#1] SMP 
> > [    0.000000] Modules linked in:
> > [    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.74+ #7
> > [    0.000000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
> > [    0.000000] task: ffffffff81e104c0 ti: ffffffff81e00000 task.ti: ffffffff81e00000
> > [    0.000000] RIP: 0010:[<ffffffff811f090a>]  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> > [    0.000000] RSP: 0000:ffffffff81e03ed0  EFLAGS: 00010046
> > [    0.000000] RAX: 0000000000000000 RBX: ffff88083ffe0210 RCX: ffffea0013000000
> > [    0.000000] RDX: 0000000000000300 RSI: ffffffff81f57437 RDI: 00000000004c0000
> > [    0.000000] RBP: ffffffff81e03f20 R08: ffffffff81e03e90 R09: 0000000000000000
> > [    0.000000] R10: 00000000004c0200 R11: 0000000000000000 R12: ffffea0000000000
> > [    0.000000] R13: 00000000004c0200 R14: 00000000004c0000 R15: 0000000000840000
> > [    0.000000] FS:  0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
> > [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    0.000000] CR2: 0000000000000000 CR3: 0000000001e0b000 CR4: 00000000000406b0
> > [    0.000000] Stack:
> > [    0.000000]  0000000000000206 ffff88083ffe0f90 ffff88083ffdf000 0000000000003181
> > [    0.000000]  ffffea0013000000 0000000000000040 ffffea0000000000 0000000000840000
> > [    0.000000]  0000000000840000 000000008e000010 ffffffff81e03f50 ffffffff81f84145
> > [    0.000000] Call Trace:
> > [    0.000000]  [<ffffffff81f84145>] page_ext_init+0x15e/0x167
> > [    0.000000]  [<ffffffff81f57e6a>] start_kernel+0x351/0x418
> > [    0.000000]  [<ffffffff81f57120>] ? early_idt_handler_array+0x120/0x120
> > [    0.000000]  [<ffffffff81f57309>] x86_64_start_reservations+0x2a/0x2c
> > [    0.000000]  [<ffffffff81f57437>] x86_64_start_kernel+0x12c/0x13b
> > [    0.000000] Code: 81 e2 00 fe ff ff 4d 39 fa 4d 0f 47 d7 4d 39 f2 4d 89 d5 77 34 eb 5e 48 8b 01 f6 c4 04 75 21 48 89 cf 48 89 4d d0 e8 b6 35 00 00 <48> 8b 00 a8 04 75 0e 48 8b 4d d0 e9 c2 00 00 00 48 83 45 c8 01 
> > [    0.000000] RIP  [<ffffffff811f090a>] init_page_owner+0x12a/0x240
> > [    0.000000]  RSP <ffffffff81e03ed0>
> > [    0.000000] CR2: 0000000000000000
> > [    0.000000] ---[ end trace 19e05592f03a690f ]---
> > 
> > Note that this is different backtrace than in b8f1a75d61d8 log.
> > 
> > Still, backporting b8f1a75d61d8 fixes this:
> > 
> > [    1.538379] allocated 738197504 bytes of page_ext
> > [    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
> > [    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
> > [    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
> > [    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages
> > 
> > No panic, notice how it allocated more for page_ext, and found smaller number of
> > early allocated pages.
> > 
> > Now backporting fe53ca54270a on top:
> > 
> > [    0.000000] allocated 738197504 bytes of page_ext
> > [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> > [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> > [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> > [    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages
> > 
> > Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
> > seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
> > struct pages that have not been yet initialized. It doesn't end up crashing, but
> > still doesn't seem correct?
> 
> Hmm, and for the record, this is recent mainline. Wonder why the "early

Hmm... I don't know the reason.

> allocated" looks much more sane there. But there's a warning nevertheless.

Warning would comes from the fact that drain_all_pages() is called
before mm_percpu_wq is initialised. We could remove WARN_ON_ONCE() and add
drain_local_page(zone) to fix the problem.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-04  5:11     ` Joonsoo Kim
@ 2017-07-04  6:51       ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2017-07-04  6:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Yang Shi, Mel Gorman, Andrew Morton, Vlastimil Babka, linux-mm, LKML

On Tue 04-07-17 14:11:41, Joonsoo Kim wrote:
> On Fri, Jun 30, 2017 at 05:44:16PM +0200, Michal Hocko wrote:
> > On Fri 30-06-17 17:42:24, Michal Hocko wrote:
> > [...]
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 16532fa0bb64..894697c1e6f5 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -1055,6 +1055,7 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
> > >  	!defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP)
> > >  static inline unsigned long early_pfn_to_nid(unsigned long pfn)
> > >  {
> > > +	BUILD_BUG_ON(!IS_ENABLED(CONFIG_NUMA));
> > 
> > Err, this should read BUILD_BUG_ON(IS_ENABLED(CONFIG_NUMA)) of course
> 
> Agreed.
> 
> However, AFAIK, ARM can set CONFIG_NUMA but it doesn't have
> CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID and CONFIG_HAVE_MEMBLOCK_NODE_MAP.

$ git grep "config NUMA\|select NUMA" arch/arm
$

Did you mean arch64? If yes this one looks ok
$ git grep "HAVE_MEMBLOCK_NODE_MAP\|HAVE_ARCH_EARLY_PFN_TO_NID" arch/arm64/
arch/arm64/Kconfig:     select HAVE_MEMBLOCK_NODE_MAP if NUMA

> If page_ext uses early_pfn_to_nid(), it will cause build error in ARM.

Which would be intentional if it doesn't provide a proper implementation
of the function.
 
> Therefore, I suggest following change.
> CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on proper early_pfn_to_nid().
> So, following code will always work as long as
> CONFIG_DEFERRED_STRUCT_PAGE_INIT works.

I haven't checked all other callers of early_pfn_to_nid yet but I have
run my original patch (with !IS_ENABLED...) just to see whether anybody
actually uses this function from an innvalid context and it hasn't blown
up. So I suspect that all current users simply use the function from the
proper context. So if nobody objects I would just post the patch for
inclusion. If the compilation breaks we can think of a proper
implementation.

> 
> Thanks.
> 
> ----------->8---------------
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 88ccc044..e3db259 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -384,6 +384,7 @@ void __init page_ext_init(void)
>  
>         for_each_node_state(nid, N_MEMORY) {
>                 unsigned long start_pfn, end_pfn;
> +               int page_nid;
>  
>                 start_pfn = node_start_pfn(nid);
>                 end_pfn = node_end_pfn(nid);
> @@ -405,8 +406,15 @@ void __init page_ext_init(void)
>                          *
>                          * Take into account DEFERRED_STRUCT_PAGE_INIT.
>                          */
> -                       if (early_pfn_to_nid(pfn) != nid)
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +                       page_nid = early_pfn_to_nid(pfn);
> +#else
> +                       page_nid = pfn_to_nid(pfn);
> +#endif

I cannot say I would be happy about this ifdefery. Especially when there
is no existing user which would need it. 

> +
> +                       if (page_nid != nid)
>                                 continue;
> +
>                         if (init_section_page_ext(pfn, nid))
>                                 goto oom;
>                 }

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-04  5:23     ` Joonsoo Kim
@ 2017-07-04  9:39       ` Vlastimil Babka
  2017-07-04 10:53         ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2017-07-04  9:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Michal Hocko, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On 07/04/2017 07:23 AM, Joonsoo Kim wrote:
> On Mon, Jul 03, 2017 at 04:18:01PM +0200, Vlastimil Babka wrote:
>> allocated" looks much more sane there. But there's a warning nevertheless.
> 
> Warning would comes from the fact that drain_all_pages() is called
> before mm_percpu_wq is initialised. We could remove WARN_ON_ONCE() and add
> drain_local_page(zone) to fix the problem.

Wouldn't that still leave some period during boot where kernel already
runs on multiple CPU's, but mm_percpu_wq is not yet initialized and
somebody tries to use it? We want to catch such cases, right?

> Thanks.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-04  9:39       ` Vlastimil Babka
@ 2017-07-04 10:53         ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2017-07-04 10:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On Tue 04-07-17 11:39:57, Vlastimil Babka wrote:
> On 07/04/2017 07:23 AM, Joonsoo Kim wrote:
> > On Mon, Jul 03, 2017 at 04:18:01PM +0200, Vlastimil Babka wrote:
> >> allocated" looks much more sane there. But there's a warning nevertheless.
> > 
> > Warning would comes from the fact that drain_all_pages() is called
> > before mm_percpu_wq is initialised. We could remove WARN_ON_ONCE() and add
> > drain_local_page(zone) to fix the problem.
> 
> Wouldn't that still leave some period during boot where kernel already
> runs on multiple CPU's, but mm_percpu_wq is not yet initialized and
> somebody tries to use it? We want to catch such cases, right?

I haven't checked the boot sequence but if we know that we need
mm_percpu_wq initialized earlier than moving it should be not a big deal
I guess.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-04  5:17   ` Joonsoo Kim
@ 2017-07-07 12:00     ` Vlastimil Babka
  2017-07-14  9:13       ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2017-07-07 12:00 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Michal Hocko, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On 07/04/2017 07:17 AM, Joonsoo Kim wrote:
>> 
>> Still, backporting b8f1a75d61d8 fixes this:
>> 
>> [    1.538379] allocated 738197504 bytes of page_ext
>> [    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
>> [    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
>> [    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
>> [    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages
>> 
>> No panic, notice how it allocated more for page_ext, and found smaller number of
>> early allocated pages.
>> 
>> Now backporting fe53ca54270a on top:
>> 
>> [    0.000000] allocated 738197504 bytes of page_ext
>> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
>> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
>> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
>> [    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages
>> 
>> Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
>> seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
>> struct pages that have not been yet initialized. It doesn't end up crashing, but
>> still doesn't seem correct?
> 
> Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone()
> called before page_alloc_init_late(). So, there would be many
> uninitialized pages with PageReserved(). Page owner regarded these
> PageReserved() page as allocated page.

That seems incorrect for two reasons:
- init_pages_in_zone() actually skips PageReserved() pages
- the pages don't have PageReserved() flag, until the deferred struct page init
thread processes them via deferred_init_memmap() -> __init_single_page() AFAICS

Now I've found out why upstream reports much less early allocated pages than our
kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range overlapping
check") which adds a "page_zone(page) != zone" check. I think this only works
because the pages are not initialized and thus have no nid/zone links. Probably
page_zone() only doesn't break because it's all zeroed. I don't think it's safe
to rely on this?

> We can change the message to "page owner found early reserved N pages"
> 
> Thanks.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-07 12:00     ` Vlastimil Babka
@ 2017-07-14  9:13       ` Michal Hocko
  2017-07-14  9:34         ` Vlastimil Babka
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2017-07-14  9:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On Fri 07-07-17 14:00:03, Vlastimil Babka wrote:
> On 07/04/2017 07:17 AM, Joonsoo Kim wrote:
> >> 
> >> Still, backporting b8f1a75d61d8 fixes this:
> >> 
> >> [    1.538379] allocated 738197504 bytes of page_ext
> >> [    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
> >> [    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
> >> [    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
> >> [    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages
> >> 
> >> No panic, notice how it allocated more for page_ext, and found smaller number of
> >> early allocated pages.
> >> 
> >> Now backporting fe53ca54270a on top:
> >> 
> >> [    0.000000] allocated 738197504 bytes of page_ext
> >> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> >> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> >> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> >> [    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages
> >> 
> >> Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
> >> seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
> >> struct pages that have not been yet initialized. It doesn't end up crashing, but
> >> still doesn't seem correct?
> > 
> > Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone()
> > called before page_alloc_init_late(). So, there would be many
> > uninitialized pages with PageReserved(). Page owner regarded these
> > PageReserved() page as allocated page.
> 
> That seems incorrect for two reasons:
> - init_pages_in_zone() actually skips PageReserved() pages
> - the pages don't have PageReserved() flag, until the deferred struct page init
> thread processes them via deferred_init_memmap() -> __init_single_page() AFAICS
> 
> Now I've found out why upstream reports much less early allocated pages than our
> kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range overlapping
> check") which adds a "page_zone(page) != zone" check. I think this only works
> because the pages are not initialized and thus have no nid/zone links. Probably
> page_zone() only doesn't break because it's all zeroed. I don't think it's safe
> to rely on this?

Yes, if anything PageReserved should be checked before the zone check.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-14  9:13       ` Michal Hocko
@ 2017-07-14  9:34         ` Vlastimil Babka
  2017-07-14 11:35           ` Michal Hocko
  0 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2017-07-14  9:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On 07/14/2017 11:13 AM, Michal Hocko wrote:
> On Fri 07-07-17 14:00:03, Vlastimil Babka wrote:
>> On 07/04/2017 07:17 AM, Joonsoo Kim wrote:
>>>>
>>>> Still, backporting b8f1a75d61d8 fixes this:
>>>>
>>>> [    1.538379] allocated 738197504 bytes of page_ext
>>>> [    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
>>>> [    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
>>>> [    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
>>>> [    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages
>>>>
>>>> No panic, notice how it allocated more for page_ext, and found smaller number of
>>>> early allocated pages.
>>>>
>>>> Now backporting fe53ca54270a on top:
>>>>
>>>> [    0.000000] allocated 738197504 bytes of page_ext
>>>> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
>>>> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
>>>> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
>>>> [    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages
>>>>
>>>> Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
>>>> seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
>>>> struct pages that have not been yet initialized. It doesn't end up crashing, but
>>>> still doesn't seem correct?
>>>
>>> Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone()
>>> called before page_alloc_init_late(). So, there would be many
>>> uninitialized pages with PageReserved(). Page owner regarded these
>>> PageReserved() page as allocated page.
>>
>> That seems incorrect for two reasons:
>> - init_pages_in_zone() actually skips PageReserved() pages
>> - the pages don't have PageReserved() flag, until the deferred struct page init
>> thread processes them via deferred_init_memmap() -> __init_single_page() AFAICS
>>
>> Now I've found out why upstream reports much less early allocated pages than our
>> kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range overlapping
>> check") which adds a "page_zone(page) != zone" check. I think this only works
>> because the pages are not initialized and thus have no nid/zone links. Probably
>> page_zone() only doesn't break because it's all zeroed. I don't think it's safe
>> to rely on this?
> 
> Yes, if anything PageReserved should be checked before the zone check.

That wouldn't change anything, because we skip PageReserved and it's not
set. Perhaps we could skip pages that have the raw page flags value
zero, but then a) we should make sure that the allocation of the struct
page array zeroes the range, and b) the first modification of struct
page in the initialization is setting the PageReserved flag.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations?
  2017-07-14  9:34         ` Vlastimil Babka
@ 2017-07-14 11:35           ` Michal Hocko
  0 siblings, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2017-07-14 11:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Yang Shi, Mel Gorman, Andrew Morton, linux-mm, LKML

On Fri 14-07-17 11:34:31, Vlastimil Babka wrote:
> On 07/14/2017 11:13 AM, Michal Hocko wrote:
> > On Fri 07-07-17 14:00:03, Vlastimil Babka wrote:
> >> On 07/04/2017 07:17 AM, Joonsoo Kim wrote:
> >>>>
> >>>> Still, backporting b8f1a75d61d8 fixes this:
> >>>>
> >>>> [    1.538379] allocated 738197504 bytes of page_ext
> >>>> [    1.539340] Node 0, zone      DMA: page owner found early allocated 0 pages
> >>>> [    1.540179] Node 0, zone    DMA32: page owner found early allocated 33 pages
> >>>> [    1.611173] Node 0, zone   Normal: page owner found early allocated 96755 pages
> >>>> [    1.683167] Node 1, zone   Normal: page owner found early allocated 96575 pages
> >>>>
> >>>> No panic, notice how it allocated more for page_ext, and found smaller number of
> >>>> early allocated pages.
> >>>>
> >>>> Now backporting fe53ca54270a on top:
> >>>>
> >>>> [    0.000000] allocated 738197504 bytes of page_ext
> >>>> [    0.000000] Node 0, zone      DMA: page owner found early allocated 0 pages
> >>>> [    0.000000] Node 0, zone    DMA32: page owner found early allocated 33 pages
> >>>> [    0.000000] Node 0, zone   Normal: page owner found early allocated 2842622 pages
> >>>> [    0.000000] Node 1, zone   Normal: page owner found early allocated 3694362 pages
> >>>>
> >>>> Again no panic, and same amount of page_ext usage. But the "early allocated" numbers
> >>>> seem bogus to me. I think it's because init_pages_in_zone() is running and inspecting
> >>>> struct pages that have not been yet initialized. It doesn't end up crashing, but
> >>>> still doesn't seem correct?
> >>>
> >>> Numbers looks sane to me. fe53ca54270a makes init_pages_in_zone()
> >>> called before page_alloc_init_late(). So, there would be many
> >>> uninitialized pages with PageReserved(). Page owner regarded these
> >>> PageReserved() page as allocated page.
> >>
> >> That seems incorrect for two reasons:
> >> - init_pages_in_zone() actually skips PageReserved() pages
> >> - the pages don't have PageReserved() flag, until the deferred struct page init
> >> thread processes them via deferred_init_memmap() -> __init_single_page() AFAICS
> >>
> >> Now I've found out why upstream reports much less early allocated pages than our
> >> kernel. We're missing 9d43f5aec950 ("mm/page_owner: add zone range overlapping
> >> check") which adds a "page_zone(page) != zone" check. I think this only works
> >> because the pages are not initialized and thus have no nid/zone links. Probably
> >> page_zone() only doesn't break because it's all zeroed. I don't think it's safe
> >> to rely on this?
> > 
> > Yes, if anything PageReserved should be checked before the zone check.
> 
> That wouldn't change anything, because we skip PageReserved and it's not
> set.

I thought they were still marked reserved from the bootmem allocator I
would have to go through the initialization code again to be sure.

> Perhaps we could skip pages that have the raw page flags value
> zero, but then a) we should make sure that the allocation of the struct
> page array zeroes the range, and b) the first modification of struct
> page in the initialization is setting the PageReserved flag.

I would rather not depend on the page state. There are plans to not
initialize the struct page (even to 0 during memmap init) until
__init_single_page.

Either the page is fully initialized or we are touching invalid pfn
range. end_pfn = pfn + zone->spanned_pages but I guess we should in fact
consider first_deferred_pfn as well (calculate_node_totalpages is not
deffered initialization aware).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-07-14 11:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-30 14:18 "mm: use early_pfn_to_nid in page_ext_init" broken on some configurations? Michal Hocko
2017-06-30 15:42 ` Michal Hocko
2017-06-30 15:44   ` Michal Hocko
2017-07-04  5:11     ` Joonsoo Kim
2017-07-04  6:51       ` Michal Hocko
2017-07-03 11:48 ` Vlastimil Babka
2017-07-03 14:18   ` Vlastimil Babka
2017-07-04  5:23     ` Joonsoo Kim
2017-07-04  9:39       ` Vlastimil Babka
2017-07-04 10:53         ` Michal Hocko
2017-07-04  5:17   ` Joonsoo Kim
2017-07-07 12:00     ` Vlastimil Babka
2017-07-14  9:13       ` Michal Hocko
2017-07-14  9:34         ` Vlastimil Babka
2017-07-14 11:35           ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).