linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: skip checking poison pattern for page_to_nid()
       [not found] <1545172285.18411.26.camel@lca.pw>
@ 2018-12-19  1:57 ` Qian Cai
  2018-12-19 10:20   ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2018-12-19  1:57 UTC (permalink / raw)
  To: akpm; +Cc: mingo, mhocko, hpa, mgorman, tglx, linux-mm, linux-kernel, Qian Cai

Kernel panic with page_owner=on

CONFIG_DEBUG_VM_PGFLAGS=y
PAGE_OWNER=y
NODE_NOT_IN_PAGE_FLAGS=n

This is due to f165b378bbd (mm: uninitialized struct page poisoning
sanity checking) shoots itself in the foot.

[   11.917212] page:ffffea0004200000 is uninitialized and poisoned
[   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   11.926498] page_owner info is not active (free page?)
[   12.329560] kernel BUG at include/linux/mm.h:990!
[   12.337632] RIP: 0010:init_page_owner+0x486/0x520

At first,

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            sparse_buffer_init
              memblock_virt_alloc_try_nid_raw

It poisons all the allocated pages there.

memset(ptr, PAGE_POISON_PATTERN, size)

Later,

page_ext_init
  invoke_init_callbacks
    init_section_page_ext
      init_page_owner
        init_early_allocated_pages
          init_zones_in_node
            init_pages_in_zone
              lookup_page_ext
                page_to_nid
                  PF_POISONED_CHECK <--- panic here.

This because all allocated pages are not initialized until later.

init_pages_in_zone
  __set_page_owner_handle

Fixes: f165b378bbd (mm: uninitialized struct page poisoning
sanity checking)
Signed-off-by: Qian Cai <cai@lca.pw>
---
 include/linux/mm.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..f083f366ea90 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -985,9 +985,7 @@ extern int page_to_nid(const struct page *page);
 #else
 static inline int page_to_nid(const struct page *page)
 {
-	struct page *p = (struct page *)page;
-
-	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
+	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
 }
 #endif
 
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] mm: skip checking poison pattern for page_to_nid()
  2018-12-19  1:57 ` [PATCH] mm: skip checking poison pattern for page_to_nid() Qian Cai
@ 2018-12-19 10:20   ` Michal Hocko
  2018-12-19 12:46     ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-12-19 10:20 UTC (permalink / raw)
  To: Qian Cai; +Cc: akpm, mingo, hpa, mgorman, tglx, linux-mm, linux-kernel

On Tue 18-12-18 20:57:32, Qian Cai wrote:
[...]
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5411de93a363..f083f366ea90 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -985,9 +985,7 @@ extern int page_to_nid(const struct page *page);
>  #else
>  static inline int page_to_nid(const struct page *page)
>  {
> -	struct page *p = (struct page *)page;
> -
> -	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
> +	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
>  }
>  #endif

I didn't get to think about a proper fix but this is clearly worng. If
the page is still poisoned then flags are clearly bogus and the node you
get is a garbage as well. Have you actually tested this patch?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] mm: skip checking poison pattern for page_to_nid()
  2018-12-19 10:20   ` Michal Hocko
@ 2018-12-19 12:46     ` Qian Cai
  2018-12-20  6:03       ` [PATCH] mm/page_owner: fix for deferred struct page init Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2018-12-19 12:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, mingo, hpa, mgorman, tglx, linux-mm, linux-kernel

On 12/19/18 5:20 AM, Michal Hocko wrote:
> On Tue 18-12-18 20:57:32, Qian Cai wrote:
> [...]
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 5411de93a363..f083f366ea90 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -985,9 +985,7 @@ extern int page_to_nid(const struct page *page);
>>  #else
>>  static inline int page_to_nid(const struct page *page)
>>  {
>> -	struct page *p = (struct page *)page;
>> -
>> -	return (PF_POISONED_CHECK(p)->flags >> NODES_PGSHIFT) & NODES_MASK;
>> +	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
>>  }
>>  #endif
> 
> I didn't get to think about a proper fix but this is clearly worng. If
> the page is still poisoned then flags are clearly bogus and the node you
> get is a garbage as well. Have you actually tested this patch?
> 

Yes, I did notice that after running for a while triggering some UBSAN
out-of-bounds access warnings. I am still trying to figure out how those
uninitialized page flags survived though after

mm_init
  mem_init
    memblock_free_all
      init_single_page()

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] mm/page_owner: fix for deferred struct page init
  2018-12-19 12:46     ` Qian Cai
@ 2018-12-20  6:03       ` Qian Cai
  2018-12-20  9:22         ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2018-12-20  6:03 UTC (permalink / raw)
  To: akpm
  Cc: Pavel.Tatashin, mingo, mhocko, hpa, mgorman, tglx, linux-mm,
	linux-kernel, Qian Cai

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[    8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
  mem_init
    memblock_free_all
      free_low_memory_core_early
        reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
	continue;

However it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[   11.917212] page:ffffea0004200000 is uninitialized and poisoned
[   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   11.926498] page_owner info is not active (free page?)
[   12.329560] kernel BUG at include/linux/mm.h:990!
[   12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since init_pages_in_zone() has already had the node information, there
is no need to call page_to_nid() at all during the page ext lookup, and
also replace calls that could incorrectly checked for poisoned page
structs.

It ends up wasting some memory to allocate page ext for those already
freed pages, but there is no sane ways to tell those freed pages apart
from uninitialized valid pages due to DEFERRED_STRUCT_PAGE_INIT. It
looks quite reasonable on an arm64 server though.

allocated 83230720 bytes of page_ext
Node 0, zone    DMA32: page owner found early allocated 0 pages
Node 0, zone   Normal: page owner found early allocated 2048214 pages
Node 1, zone   Normal: page owner found early allocated 2080641 pages

Used more memory on a x86_64 server.

allocated 334233600 bytes of page_ext
Node 0, zone      DMA: page owner found early allocated 2 pages
Node 0, zone    DMA32: page owner found early allocated 24303 pages
Node 0, zone   Normal: page owner found early allocated 7545357 pages
Node 1, zone   Normal: page owner found early allocated 8331279 pages

Finally, rename get_entry() to get_ext_entry(), so it can be exported
without a naming collision.

Signed-off-by: Qian Cai <cai@lca.pw>
---
 include/linux/page_ext.h |  6 ++++++
 mm/page_ext.c            |  8 ++++----
 mm/page_owner.c          | 39 ++++++++++++++++++++++++++++++++-------
 3 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index f84f167ec04c..e95cb6198014 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -51,6 +51,7 @@ static inline void page_ext_init(void)
 #endif
 
 struct page_ext *lookup_page_ext(const struct page *page);
+struct page_ext *get_ext_entry(void *base, unsigned long index);
 
 #else /* !CONFIG_PAGE_EXTENSION */
 struct page_ext;
@@ -64,6 +65,11 @@ static inline struct page_ext *lookup_page_ext(const struct page *page)
 	return NULL;
 }
 
+static inline struct page_ext *get_ext_entry(void *base, unsigned long index)
+{
+	return NULL;
+}
+
 static inline void page_ext_init(void)
 {
 }
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..3cd8f0c13057 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -107,7 +107,7 @@ static unsigned long get_entry_size(void)
 	return sizeof(struct page_ext) + extra_mem;
 }
 
-static inline struct page_ext *get_entry(void *base, unsigned long index)
+struct page_ext *get_ext_entry(void *base, unsigned long index)
 {
 	return base + get_entry_size() * index;
 }
@@ -137,7 +137,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
 		return NULL;
 	index = pfn - round_down(node_start_pfn(page_to_nid(page)),
 					MAX_ORDER_NR_PAGES);
-	return get_entry(base, index);
+	return get_ext_entry(base, index);
 }
 
 static int __init alloc_node_page_ext(int nid)
@@ -207,7 +207,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
 	 */
 	if (!section->page_ext)
 		return NULL;
-	return get_entry(section->page_ext, pfn);
+	return get_ext_entry(section->page_ext, pfn);
 }
 
 static void *__meminit alloc_page_ext(size_t size, int nid)
@@ -285,7 +285,7 @@ static void __free_page_ext(unsigned long pfn)
 	ms = __pfn_to_section(pfn);
 	if (!ms || !ms->page_ext)
 		return;
-	base = get_entry(ms->page_ext, pfn);
+	base = get_ext_entry(ms->page_ext, pfn);
 	free_page_ext(base);
 	ms->page_ext = NULL;
 }
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 87bc0dfdb52b..c27712c9a764 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -531,6 +531,7 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
 	unsigned long pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(zone);
 	unsigned long count = 0;
+	struct page_ext *base;
 
 	/*
 	 * Walk the zone in pageblock_nr_pages steps. If a page block spans
@@ -555,11 +556,11 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
 			if (!pfn_valid_within(pfn))
 				continue;
 
-			page = pfn_to_page(pfn);
-
-			if (page_zone(page) != zone)
+			if (pfn < zone->zone_start_pfn || pfn >= end_pfn)
 				continue;
 
+			page = pfn_to_page(pfn);
+
 			/*
 			 * To avoid having to grab zone->lock, be a little
 			 * careful when reading buddy page order. The only
@@ -575,13 +576,37 @@ static void init_pages_in_zone(pg_data_t *pgdat, struct zone *zone)
 				continue;
 			}
 
-			if (PageReserved(page))
+#ifdef CONFIG_SPARSEMEM
+			base = __pfn_to_section(pfn)->page_ext;
+#else
+			base = pgdat->node_page_ext;
+#endif
+			/*
+			 * The sanity checks the page allocator does upon
+			 * freeing a page can reach here before the page_ext
+			 * arrays are allocated when feeding a range of pages to
+			 * the allocator for the first time during bootup or
+			 * memory hotplug.
+			 */
+			if (unlikely(!base))
 				continue;
 
-			page_ext = lookup_page_ext(page);
-			if (unlikely(!page_ext))
+			/*
+			 * Those pages reached here might had already been freed
+			 * due to the deferred struct page init.
+			 */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+			if (pfn < pgdat->first_deferred_pfn)
+#endif
+			if (PageReserved(page))
 				continue;
-
+#ifdef CONFIG_SPARSEMEM
+			page_ext = get_ext_entry(base, pfn);
+#else
+			page_ext = get_ext_entry(base, pfn -
+						 round_down(pgdat->node_start_pfn,
+							    MAX_ORDER_NR_PAGES));
+#endif
 			/* Maybe overlapping zone */
 			if (test_bit(PAGE_EXT_OWNER, &page_ext->flags))
 				continue;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] mm/page_owner: fix for deferred struct page init
  2018-12-20  6:03       ` [PATCH] mm/page_owner: fix for deferred struct page init Qian Cai
@ 2018-12-20  9:22         ` Michal Hocko
  2018-12-20 18:50           ` [PATCH v2] " Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2018-12-20  9:22 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, hpa, mgorman, tglx, linux-mm, linux-kernel

On Thu 20-12-18 01:03:03, Qian Cai wrote:
> When booting a system with "page_owner=on",
> 
> start_kernel
>   page_ext_init
>     invoke_init_callbacks
>       init_section_page_ext
>         init_page_owner
>           init_early_allocated_pages
>             init_zones_in_node
>               init_pages_in_zone
>                 lookup_page_ext
>                   page_to_nid
> 
> The issue here is that page_to_nid() will not work since some page
> flags have no node information until later in page_alloc_init_late() due
> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> access with an invalid nid.
> 
> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> [    8.672603] index 7 is out of range for type 'zone [5]'
> 
> Also, kernel will panic since flags were poisoned earlier with,
> 
> CONFIG_DEBUG_VM_PGFLAGS=y
> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
> 
> start_kernel
>   setup_arch
>     pagetable_init
>       paging_init
>         sparse_init
>           sparse_init_nid
>             memblock_alloc_try_nid_raw
> 
> Although later it tries to set page flags for pages in reserved bootmem
> regions,
> 
> mm_init
>   mem_init
>     memblock_free_all
>       free_low_memory_core_early
>         reserve_bootmem_region
> 
> there could still have some freed pages from the page allocator but yet
> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> dealt with a bit in page_ext_init().

Is there any reason why we cannot postpone page_ext initialization to
after all the memory is initialized?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2] mm/page_owner: fix for deferred struct page init
  2018-12-20  9:22         ` Michal Hocko
@ 2018-12-20 18:50           ` Qian Cai
  2018-12-20 20:31             ` [PATCH v3] " Qian Cai
  2019-01-09  7:34             ` [PATCH v2] " Michal Hocko
  0 siblings, 2 replies; 32+ messages in thread
From: Qian Cai @ 2018-12-20 18:50 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, Pavel.Tatashin, mingo, hpa, mgorman, tglx, linux-mm,
	linux-kernel, Qian Cai

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[    8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
  mem_init
    memblock_free_all
      free_low_memory_core_early
        reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
	continue;

However it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[   11.917212] page:ffffea0004200000 is uninitialized and poisoned
[   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   11.926498] page_owner info is not active (free page?)
[   12.329560] kernel BUG at include/linux/mm.h:990!
[   12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel() and no real benefit to call it so early, just move it
after page_alloc_init_late() to ensure that there is no deferred pages
need to de dealt with.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Qian Cai <cai@lca.pw>
---

v2: postpone init_page_ext() to after page_alloc_init_late().

 init/main.c   | 2 +-
 mm/page_ext.c | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/init/main.c b/init/main.c
index 2b7b7fe173c9..1aeb062b2cb7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -696,7 +696,6 @@ asmlinkage __visible void __init start_kernel(void)
 		initrd_start = 0;
 	}
 #endif
-	page_ext_init();
 	kmemleak_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();
@@ -1147,6 +1146,7 @@ static noinline void __init kernel_init_freeable(void)
 	sched_init_smp();
 
 	page_alloc_init_late();
+	page_ext_init();
 
 	do_basic_setup();
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..d76fd51e312a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -399,9 +399,8 @@ void __init page_ext_init(void)
 			 * -------------pfn-------------->
 			 * N0 | N1 | N2 | N0 | N1 | N2|....
 			 *
-			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
 			 */
-			if (early_pfn_to_nid(pfn) != nid)
+			if (pfn_to_nid(pfn) != nid)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3] mm/page_owner: fix for deferred struct page init
  2018-12-20 18:50           ` [PATCH v2] " Qian Cai
@ 2018-12-20 20:31             ` Qian Cai
  2018-12-20 21:00               ` William Kucharski
  2019-01-03 11:51               ` Michal Hocko
  2019-01-09  7:34             ` [PATCH v2] " Michal Hocko
  1 sibling, 2 replies; 32+ messages in thread
From: Qian Cai @ 2018-12-20 20:31 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, Pavel.Tatashin, mingo, hpa, mgorman, iamjoonsoo.kim,
	yang.shi, tglx, linux-mm, linux-kernel, Qian Cai

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[    8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
  mem_init
    memblock_free_all
      free_low_memory_core_early
        reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
	continue;

However, it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[   11.917212] page:ffffea0004200000 is uninitialized and poisoned
[   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   11.926498] page_owner info is not active (free page?)
[   12.329560] kernel BUG at include/linux/mm.h:990!
[   12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel(), just move it after page_alloc_init_late() to ensure that
there is no deferred pages need to de dealt with. If deselected
DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
earlier, so page owner could catch more early page allocation call
sites. This gives us a good compromise between catching good and bad
call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.

[1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/

Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Qian Cai <cai@lca.pw>
---

v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.

v2: postpone page_ext_init() to after page_alloc_init_late().

 init/main.c   | 5 +++++
 mm/page_ext.c | 3 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/init/main.c b/init/main.c
index 2b7b7fe173c9..5d9904370f76 100644
--- a/init/main.c
+++ b/init/main.c
@@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
 		initrd_start = 0;
 	}
 #endif
+#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	page_ext_init();
+#endif
 	kmemleak_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();
@@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
 	sched_init_smp();
 
 	page_alloc_init_late();
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+	page_ext_init();
+#endif
 
 	do_basic_setup();
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..d76fd51e312a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -399,9 +399,8 @@ void __init page_ext_init(void)
 			 * -------------pfn-------------->
 			 * N0 | N1 | N2 | N0 | N1 | N2|....
 			 *
-			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
 			 */
-			if (early_pfn_to_nid(pfn) != nid)
+			if (pfn_to_nid(pfn) != nid)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2018-12-20 20:31             ` [PATCH v3] " Qian Cai
@ 2018-12-20 21:00               ` William Kucharski
  2018-12-20 21:04                 ` Qian Cai
  2019-01-03 11:51               ` Michal Hocko
  1 sibling, 1 reply; 32+ messages in thread
From: William Kucharski @ 2018-12-20 21:00 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, mhocko, Pavel.Tatashin, mingo, hpa, mgorman,
	iamjoonsoo.kim, yang.shi, tglx, linux-mm, linux-kernel



> On Dec 20, 2018, at 1:31 PM, Qian Cai <cai@lca.pw> wrote:
> 
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> 			 * -------------pfn-------------->
> 			 * N0 | N1 | N2 | N0 | N1 | N2|....
> 			 *
> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> 			 */
> -			if (early_pfn_to_nid(pfn) != nid)
> +			if (pfn_to_nid(pfn) != nid)
> 				continue;
> 			if (init_section_page_ext(pfn, nid))
> 				goto oom;
> -- 
> 2.17.2 (Apple Git-113)
> 

Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the same issue seen in v2?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2018-12-20 21:00               ` William Kucharski
@ 2018-12-20 21:04                 ` Qian Cai
  0 siblings, 0 replies; 32+ messages in thread
From: Qian Cai @ 2018-12-20 21:04 UTC (permalink / raw)
  To: William Kucharski
  Cc: akpm, mhocko, Pavel.Tatashin, mingo, hpa, mgorman,
	iamjoonsoo.kim, tglx, linux-mm, linux-kernel

On Thu, 2018-12-20 at 14:00 -0700, William Kucharski wrote:
> > On Dec 20, 2018, at 1:31 PM, Qian Cai <cai@lca.pw> wrote:
> > 
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index ae44f7adbe07..d76fd51e312a 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> > 			 * -------------pfn-------------->
> > 			 * N0 | N1 | N2 | N0 | N1 | N2|....
> > 			 *
> > -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> > 			 */
> > -			if (early_pfn_to_nid(pfn) != nid)
> > +			if (pfn_to_nid(pfn) != nid)
> > 				continue;
> > 			if (init_section_page_ext(pfn, nid))
> > 				goto oom;
> > -- 
> > 2.17.2 (Apple Git-113)
> > 
> 
> Is there any danger in the fact that in the CONFIG_NUMA case in mmzone.h
> (around line 1261), pfn_to_nid() calls page_to_nid(), possibly causing the
> same issue seen in v2?
> 

No. If CONFIG_DEFERRED_STRUCT_PAGE_INIT=y, page_ext_init() is called after
page_alloc_init_late() where all the memory has already been initialized,
so page_to_nid() will work then.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2018-12-20 20:31             ` [PATCH v3] " Qian Cai
  2018-12-20 21:00               ` William Kucharski
@ 2019-01-03 11:51               ` Michal Hocko
  2019-01-03 16:38                 ` Qian Cai
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-03 11:51 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, hpa, mgorman, iamjoonsoo.kim,
	yang.shi, tglx, linux-mm, linux-kernel

On Thu 20-12-18 15:31:56, Qian Cai wrote:
> When booting a system with "page_owner=on",
> 
> start_kernel
>   page_ext_init
>     invoke_init_callbacks
>       init_section_page_ext
>         init_page_owner
>           init_early_allocated_pages
>             init_zones_in_node
>               init_pages_in_zone
>                 lookup_page_ext
>                   page_to_nid
> 
> The issue here is that page_to_nid() will not work since some page
> flags have no node information until later in page_alloc_init_late() due
> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> access with an invalid nid.
> 
> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> [    8.672603] index 7 is out of range for type 'zone [5]'
> 
> Also, kernel will panic since flags were poisoned earlier with,
> 
> CONFIG_DEBUG_VM_PGFLAGS=y
> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
> 
> start_kernel
>   setup_arch
>     pagetable_init
>       paging_init
>         sparse_init
>           sparse_init_nid
>             memblock_alloc_try_nid_raw
> 
> Although later it tries to set page flags for pages in reserved bootmem
> regions,
> 
> mm_init
>   mem_init
>     memblock_free_all
>       free_low_memory_core_early
>         reserve_bootmem_region
> 
> there could still have some freed pages from the page allocator but yet
> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> dealt with a bit in page_ext_init().
> 
> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> if (early_pfn_to_nid(pfn) != nid)
> 	continue;
> 
> However, it did not handle it well in init_pages_in_zone() which end up
> calling page_to_nid().
> 
> [   11.917212] page:ffffea0004200000 is uninitialized and poisoned
> [   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [   11.926498] page_owner info is not active (free page?)
> [   12.329560] kernel BUG at include/linux/mm.h:990!
> [   12.337632] RIP: 0010:init_page_owner+0x486/0x520
> 
> Since there is no other routines depend on page_ext_init() in
> start_kernel(), just move it after page_alloc_init_late() to ensure that
> there is no deferred pages need to de dealt with. If deselected
> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
> earlier, so page owner could catch more early page allocation call
> sites. This gives us a good compromise between catching good and bad
> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
> 
> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
> 
> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
> 
> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
> 
> v2: postpone page_ext_init() to after page_alloc_init_late().
> 
>  init/main.c   | 5 +++++
>  mm/page_ext.c | 3 +--
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/init/main.c b/init/main.c
> index 2b7b7fe173c9..5d9904370f76 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
>  		initrd_start = 0;
>  	}
>  #endif
> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  	page_ext_init();
> +#endif
>  	kmemleak_init();
>  	setup_per_cpu_pageset();
>  	numa_policy_init();
> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
>  	sched_init_smp();
>  
>  	page_alloc_init_late();
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +	page_ext_init();
> +#endif
>  
>  	do_basic_setup();

Is this really necessary? Why cannot we simply postpone page_ext_init
unconditioanally?

>  
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>  			 * -------------pfn-------------->
>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
>  			 *
> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
>  			 */
> -			if (early_pfn_to_nid(pfn) != nid)
> +			if (pfn_to_nid(pfn) != nid)
>  				continue;
>  			if (init_section_page_ext(pfn, nid))
>  				goto oom;

Also this doesn't seem to be related, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 11:51               ` Michal Hocko
@ 2019-01-03 16:38                 ` Qian Cai
  2019-01-03 16:59                   ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-03 16:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, hpa, mgorman, iamjoonsoo.kim,
	yang.shi, tglx, linux-mm, linux-kernel

On 1/3/19 6:51 AM, Michal Hocko wrote:
> On Thu 20-12-18 15:31:56, Qian Cai wrote:
>> When booting a system with "page_owner=on",
>>
>> start_kernel
>>   page_ext_init
>>     invoke_init_callbacks
>>       init_section_page_ext
>>         init_page_owner
>>           init_early_allocated_pages
>>             init_zones_in_node
>>               init_pages_in_zone
>>                 lookup_page_ext
>>                   page_to_nid
>>
>> The issue here is that page_to_nid() will not work since some page
>> flags have no node information until later in page_alloc_init_late() due
>> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
>> access with an invalid nid.
>>
>> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
>> [    8.672603] index 7 is out of range for type 'zone [5]'
>>
>> Also, kernel will panic since flags were poisoned earlier with,
>>
>> CONFIG_DEBUG_VM_PGFLAGS=y
>> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
>>
>> start_kernel
>>   setup_arch
>>     pagetable_init
>>       paging_init
>>         sparse_init
>>           sparse_init_nid
>>             memblock_alloc_try_nid_raw
>>
>> Although later it tries to set page flags for pages in reserved bootmem
>> regions,
>>
>> mm_init
>>   mem_init
>>     memblock_free_all
>>       free_low_memory_core_early
>>         reserve_bootmem_region
>>
>> there could still have some freed pages from the page allocator but yet
>> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
>> dealt with a bit in page_ext_init().
>>
>> * Take into account DEFERRED_STRUCT_PAGE_INIT.
>> */
>> if (early_pfn_to_nid(pfn) != nid)
>> 	continue;
>>
>> However, it did not handle it well in init_pages_in_zone() which end up
>> calling page_to_nid().
>>
>> [   11.917212] page:ffffea0004200000 is uninitialized and poisoned
>> [   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
>> ffffffffffffffff
>> [   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
>> ffffffffffffffff
>> [   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>> [   11.926498] page_owner info is not active (free page?)
>> [   12.329560] kernel BUG at include/linux/mm.h:990!
>> [   12.337632] RIP: 0010:init_page_owner+0x486/0x520
>>
>> Since there is no other routines depend on page_ext_init() in
>> start_kernel(), just move it after page_alloc_init_late() to ensure that
>> there is no deferred pages need to de dealt with. If deselected
>> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
>> earlier, so page owner could catch more early page allocation call
>> sites. This gives us a good compromise between catching good and bad
>> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
>>
>> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
>>
>> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
>> Suggested-by: Michal Hocko <mhocko@kernel.org>
>> Signed-off-by: Qian Cai <cai@lca.pw>
>> ---
>>
>> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
>>
>> v2: postpone page_ext_init() to after page_alloc_init_late().
>>
>>  init/main.c   | 5 +++++
>>  mm/page_ext.c | 3 +--
>>  2 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/init/main.c b/init/main.c
>> index 2b7b7fe173c9..5d9904370f76 100644
>> --- a/init/main.c
>> +++ b/init/main.c
>> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
>>  		initrd_start = 0;
>>  	}
>>  #endif
>> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>>  	page_ext_init();
>> +#endif
>>  	kmemleak_init();
>>  	setup_per_cpu_pageset();
>>  	numa_policy_init();
>> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
>>  	sched_init_smp();
>>  
>>  	page_alloc_init_late();
>> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> +	page_ext_init();
>> +#endif
>>  
>>  	do_basic_setup();
> 
> Is this really necessary? Why cannot we simply postpone page_ext_init
> unconditioanally?

As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
to call page_ext_init() earlier, so page owner could catch more early page
allocation call sites."

> 
>>  
>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>> index ae44f7adbe07..d76fd51e312a 100644
>> --- a/mm/page_ext.c
>> +++ b/mm/page_ext.c
>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>>  			 * -------------pfn-------------->
>>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
>>  			 *
>> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
>>  			 */
>> -			if (early_pfn_to_nid(pfn) != nid)
>> +			if (pfn_to_nid(pfn) != nid)
>>  				continue;
>>  			if (init_section_page_ext(pfn, nid))
>>  				goto oom;
> 
> Also this doesn't seem to be related, right?

No, it is related. Because of this patch, page_ext_init() is called after all
the memory has already been initialized,
so no longer necessary to call early_pfn_to_nid().

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 16:38                 ` Qian Cai
@ 2019-01-03 16:59                   ` Michal Hocko
  2019-01-03 17:38                     ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-03 16:59 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, hpa, mgorman, iamjoonsoo.kim,
	yang.shi, tglx, linux-mm, linux-kernel

On Thu 03-01-19 11:38:31, Qian Cai wrote:
> On 1/3/19 6:51 AM, Michal Hocko wrote:
> > On Thu 20-12-18 15:31:56, Qian Cai wrote:
> >> When booting a system with "page_owner=on",
> >>
> >> start_kernel
> >>   page_ext_init
> >>     invoke_init_callbacks
> >>       init_section_page_ext
> >>         init_page_owner
> >>           init_early_allocated_pages
> >>             init_zones_in_node
> >>               init_pages_in_zone
> >>                 lookup_page_ext
> >>                   page_to_nid
> >>
> >> The issue here is that page_to_nid() will not work since some page
> >> flags have no node information until later in page_alloc_init_late() due
> >> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> >> access with an invalid nid.
> >>
> >> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> >> [    8.672603] index 7 is out of range for type 'zone [5]'
> >>
> >> Also, kernel will panic since flags were poisoned earlier with,
> >>
> >> CONFIG_DEBUG_VM_PGFLAGS=y
> >> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
> >>
> >> start_kernel
> >>   setup_arch
> >>     pagetable_init
> >>       paging_init
> >>         sparse_init
> >>           sparse_init_nid
> >>             memblock_alloc_try_nid_raw
> >>
> >> Although later it tries to set page flags for pages in reserved bootmem
> >> regions,
> >>
> >> mm_init
> >>   mem_init
> >>     memblock_free_all
> >>       free_low_memory_core_early
> >>         reserve_bootmem_region
> >>
> >> there could still have some freed pages from the page allocator but yet
> >> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> >> dealt with a bit in page_ext_init().
> >>
> >> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >> */
> >> if (early_pfn_to_nid(pfn) != nid)
> >> 	continue;
> >>
> >> However, it did not handle it well in init_pages_in_zone() which end up
> >> calling page_to_nid().
> >>
> >> [   11.917212] page:ffffea0004200000 is uninitialized and poisoned
> >> [   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> >> ffffffffffffffff
> >> [   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> >> ffffffffffffffff
> >> [   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >> [   11.926498] page_owner info is not active (free page?)
> >> [   12.329560] kernel BUG at include/linux/mm.h:990!
> >> [   12.337632] RIP: 0010:init_page_owner+0x486/0x520
> >>
> >> Since there is no other routines depend on page_ext_init() in
> >> start_kernel(), just move it after page_alloc_init_late() to ensure that
> >> there is no deferred pages need to de dealt with. If deselected
> >> DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
> >> earlier, so page owner could catch more early page allocation call
> >> sites. This gives us a good compromise between catching good and bad
> >> call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.
> >>
> >> [1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
> >>
> >> Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
> >> Suggested-by: Michal Hocko <mhocko@kernel.org>
> >> Signed-off-by: Qian Cai <cai@lca.pw>
> >> ---
> >>
> >> v3: still call page_ext_init() earlier if DEFERRED_STRUCT_PAGE_INIT=n.
> >>
> >> v2: postpone page_ext_init() to after page_alloc_init_late().
> >>
> >>  init/main.c   | 5 +++++
> >>  mm/page_ext.c | 3 +--
> >>  2 files changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/init/main.c b/init/main.c
> >> index 2b7b7fe173c9..5d9904370f76 100644
> >> --- a/init/main.c
> >> +++ b/init/main.c
> >> @@ -696,7 +696,9 @@ asmlinkage __visible void __init start_kernel(void)
> >>  		initrd_start = 0;
> >>  	}
> >>  #endif
> >> +#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >>  	page_ext_init();
> >> +#endif
> >>  	kmemleak_init();
> >>  	setup_per_cpu_pageset();
> >>  	numa_policy_init();
> >> @@ -1147,6 +1149,9 @@ static noinline void __init kernel_init_freeable(void)
> >>  	sched_init_smp();
> >>  
> >>  	page_alloc_init_late();
> >> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> >> +	page_ext_init();
> >> +#endif
> >>  
> >>  	do_basic_setup();
> > 
> > Is this really necessary? Why cannot we simply postpone page_ext_init
> > unconditioanally?
> 
> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
> to call page_ext_init() earlier, so page owner could catch more early page
> allocation call sites."

Do you have any numbers to show how many allocation are we losing that
way? In other words, do we care enough to create an ugly code?

> >> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >> index ae44f7adbe07..d76fd51e312a 100644
> >> --- a/mm/page_ext.c
> >> +++ b/mm/page_ext.c
> >> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> >>  			 * -------------pfn-------------->
> >>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
> >>  			 *
> >> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >>  			 */
> >> -			if (early_pfn_to_nid(pfn) != nid)
> >> +			if (pfn_to_nid(pfn) != nid)
> >>  				continue;
> >>  			if (init_section_page_ext(pfn, nid))
> >>  				goto oom;
> > 
> > Also this doesn't seem to be related, right?
> 
> No, it is related. Because of this patch, page_ext_init() is called after all
> the memory has already been initialized,
> so no longer necessary to call early_pfn_to_nid().

Yes, but it looks like a follow up cleanup/optimization to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 16:59                   ` Michal Hocko
@ 2019-01-03 17:38                     ` Qian Cai
  2019-01-03 19:07                       ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-03 17:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, hpa, mgorman, iamjoonsoo.kim,
	yang.shi, tglx, linux-mm, linux-kernel

On 1/3/19 11:59 AM, Michal Hocko wrote:
>> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
>> to call page_ext_init() earlier, so page owner could catch more early page
>> allocation call sites."
> 
> Do you have any numbers to show how many allocation are we losing that
> way? In other words, do we care enough to create an ugly code?

Well, I don't have any numbers, but I read that Joonsoo did not really like to
defer page_ext_init() unconditionally.

"because deferring page_ext_init() would make page owner which uses page_ext
miss some early page allocation callsites. Although it already miss some early
page allocation callsites, we don't need to miss more."

https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/

>>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
>>>> index ae44f7adbe07..d76fd51e312a 100644
>>>> --- a/mm/page_ext.c
>>>> +++ b/mm/page_ext.c
>>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>>>>  			 * -------------pfn-------------->
>>>>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
>>>>  			 *
>>>> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
>>>>  			 */
>>>> -			if (early_pfn_to_nid(pfn) != nid)
>>>> +			if (pfn_to_nid(pfn) != nid)
>>>>  				continue;
>>>>  			if (init_section_page_ext(pfn, nid))
>>>>  				goto oom;
>>>
>>> Also this doesn't seem to be related, right?
>>
>> No, it is related. Because of this patch, page_ext_init() is called after all
>> the memory has already been initialized,
>> so no longer necessary to call early_pfn_to_nid().
> 
> Yes, but it looks like a follow up cleanup/optimization to me.

That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid
in page_ext_init) which also messed up the order of page_ext_init() in
start_kernel(), so this patch basically revert that commit.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 17:38                     ` Qian Cai
@ 2019-01-03 19:07                       ` Michal Hocko
  2019-01-03 19:53                         ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-03 19:07 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, hpa, mgorman, iamjoonsoo.kim,
	yang.shi, tglx, linux-mm, linux-kernel

On Thu 03-01-19 12:38:59, Qian Cai wrote:
> On 1/3/19 11:59 AM, Michal Hocko wrote:
> >> As mentioned above, "If deselected DEFERRED_STRUCT_PAGE_INIT, it is still better
> >> to call page_ext_init() earlier, so page owner could catch more early page
> >> allocation call sites."
> > 
> > Do you have any numbers to show how many allocation are we losing that
> > way? In other words, do we care enough to create an ugly code?
> 
> Well, I don't have any numbers, but I read that Joonsoo did not really like to
> defer page_ext_init() unconditionally.
> 
> "because deferring page_ext_init() would make page owner which uses page_ext
> miss some early page allocation callsites. Although it already miss some early
> page allocation callsites, we don't need to miss more."

This is quite unspecific.

> https://lore.kernel.org/lkml/20160524053714.GB32186@js1304-P5Q-DELUXE/
> 
> >>>> diff --git a/mm/page_ext.c b/mm/page_ext.c
> >>>> index ae44f7adbe07..d76fd51e312a 100644
> >>>> --- a/mm/page_ext.c
> >>>> +++ b/mm/page_ext.c
> >>>> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
> >>>>  			 * -------------pfn-------------->
> >>>>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
> >>>>  			 *
> >>>> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
> >>>>  			 */
> >>>> -			if (early_pfn_to_nid(pfn) != nid)
> >>>> +			if (pfn_to_nid(pfn) != nid)
> >>>>  				continue;
> >>>>  			if (init_section_page_ext(pfn, nid))
> >>>>  				goto oom;
> >>>
> >>> Also this doesn't seem to be related, right?
> >>
> >> No, it is related. Because of this patch, page_ext_init() is called after all
> >> the memory has already been initialized,
> >> so no longer necessary to call early_pfn_to_nid().
> > 
> > Yes, but it looks like a follow up cleanup/optimization to me.
> 
> That early_pfn_to_nid() was introduced in fe53ca54270 (mm: use early_pfn_to_nid
> in page_ext_init) which also messed up the order of page_ext_init() in
> start_kernel(), so this patch basically revert that commit.

So can we make the revert with an explanation that the patch was wrong?
If we want to make hacks to catch more objects to be tracked then it
would be great to have some numbers in hands.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 19:07                       ` Michal Hocko
@ 2019-01-03 19:53                         ` Qian Cai
  2019-01-03 20:22                           ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-03 19:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
explanation that the patch was wrong?
> If we want to make hacks to catch more objects to be tracked then it
> would be great to have some numbers in hands.

Well, those numbers are subject to change depends on future start_kernel()
order. Right now, there are many functions could be caught earlier by page owner.

	kmemleak_init();
	debug_objects_mem_init();
	setup_per_cpu_pageset();
	numa_policy_init();
	acpi_early_init();
	if (late_time_init)
		late_time_init();
	sched_clock_init();
	calibrate_delay();
	pid_idr_init();
	anon_vma_init();
#ifdef CONFIG_X86
	if (efi_enabled(EFI_RUNTIME_SERVICES))
		efi_enter_virtual_mode();
#endif
	thread_stack_cache_init();
	cred_init();
	fork_init();
	proc_caches_init();
	uts_ns_init();
	buffer_init();
	key_init();
	security_init();
	dbg_late_init();
	vfs_caches_init();
	pagecache_init();
	signals_init();
	seq_file_init();
	proc_root_init();
	nsfs_init();
	cpuset_init();
	cgroup_init();
	taskstats_init_early();
	delayacct_init();

	check_bugs();

	acpi_subsystem_init();
	arch_post_acpi_subsys_init();
	sfi_init_late();

	if (efi_enabled(EFI_RUNTIME_SERVICES)) {
		efi_free_boot_services();

	rcu_scheduler_starting();
	/*
	 * Wait until kthreadd is all set-up.
	 */
	wait_for_completion(&kthreadd_done);

	/* Now the scheduler is fully set up and can do blocking allocations */
	gfp_allowed_mask = __GFP_BITS_MASK;

	/*
	 * init can allocate pages on any node
	 */
	set_mems_allowed(node_states[N_MEMORY]);

	cad_pid = task_pid(current);

	smp_prepare_cpus(setup_max_cpus);

	workqueue_init();

	init_mm_internals();

	do_pre_smp_initcalls();
	lockup_detector_init();

	smp_init();
	sched_init_smp();

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 19:53                         ` Qian Cai
@ 2019-01-03 20:22                           ` Michal Hocko
  2019-01-03 22:22                             ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-03 20:22 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On Thu 03-01-19 14:53:47, Qian Cai wrote:
> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
> explanation that the patch was wrong?
> > If we want to make hacks to catch more objects to be tracked then it
> > would be great to have some numbers in hands.
> 
> Well, those numbers are subject to change depends on future start_kernel()
> order. Right now, there are many functions could be caught earlier by page owner.
> 
> 	kmemleak_init();
[...]
> 	sched_init_smp();

The kernel source dump will not tell us much of course. A ball park
number whether we are talking about dozen, hundreds or thousands of
allocations would tell us something at least, doesn't it.

Handwaving that it might help us some is not particurarly useful. We are
already losing some allocations already. Does it matter? Well, that
depends, sometimes we do want to catch an owner of particular page and
it is sad to find nothing. But how many times have you or somebody else
encountered that in practice. That is exactly a useful information to
judge an ugly ifdefery in the code. See my point?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 20:22                           ` Michal Hocko
@ 2019-01-03 22:22                             ` Qian Cai
  2019-01-04 13:09                               ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-03 22:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On 1/3/19 3:22 PM, Michal Hocko wrote:
> On Thu 03-01-19 14:53:47, Qian Cai wrote:
>> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
>> explanation that the patch was wrong?
>>> If we want to make hacks to catch more objects to be tracked then it
>>> would be great to have some numbers in hands.
>>
>> Well, those numbers are subject to change depends on future start_kernel()
>> order. Right now, there are many functions could be caught earlier by page owner.
>>
>> 	kmemleak_init();
> [...]
>> 	sched_init_smp();
> 
> The kernel source dump will not tell us much of course. A ball park
> number whether we are talking about dozen, hundreds or thousands of
> allocations would tell us something at least, doesn't it.
> 
> Handwaving that it might help us some is not particurarly useful. We are
> already losing some allocations already. Does it matter? Well, that
> depends, sometimes we do want to catch an owner of particular page and
> it is sad to find nothing. But how many times have you or somebody else
> encountered that in practice. That is exactly a useful information to
> judge an ugly ifdefery in the code. See my point?

Here is the number without DEFERRED_STRUCT_PAGE_INIT.

== page_ext_init() after page_alloc_init_late() ==
Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 7009 pages
Node 0, zone Normal: page owner found early allocated 85827 pages
Node 4, zone Normal: page owner found early allocated 75063 pages

== page_ext_init() before kmemleak_init() ==
Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 6654 pages
Node 0, zone Normal: page owner found early allocated 41907 pages
Node 4, zone Normal: page owner found early allocated 41356 pages

So, it told us that it will miss tens of thousands of early page allocation call
sites.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-03 22:22                             ` Qian Cai
@ 2019-01-04 13:09                               ` Michal Hocko
  2019-01-04 15:01                                 ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-04 13:09 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On Thu 03-01-19 17:22:29, Qian Cai wrote:
> On 1/3/19 3:22 PM, Michal Hocko wrote:
> > On Thu 03-01-19 14:53:47, Qian Cai wrote:
> >> On 1/3/19 2:07 PM, Michal Hocko wrote> So can we make the revert with an
> >> explanation that the patch was wrong?
> >>> If we want to make hacks to catch more objects to be tracked then it
> >>> would be great to have some numbers in hands.
> >>
> >> Well, those numbers are subject to change depends on future start_kernel()
> >> order. Right now, there are many functions could be caught earlier by page owner.
> >>
> >> 	kmemleak_init();
> > [...]
> >> 	sched_init_smp();
> > 
> > The kernel source dump will not tell us much of course. A ball park
> > number whether we are talking about dozen, hundreds or thousands of
> > allocations would tell us something at least, doesn't it.
> > 
> > Handwaving that it might help us some is not particurarly useful. We are
> > already losing some allocations already. Does it matter? Well, that
> > depends, sometimes we do want to catch an owner of particular page and
> > it is sad to find nothing. But how many times have you or somebody else
> > encountered that in practice. That is exactly a useful information to
> > judge an ugly ifdefery in the code. See my point?
> 
> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> 
> == page_ext_init() after page_alloc_init_late() ==
> Node 0, zone DMA: page owner found early allocated 0 pages
> Node 0, zone DMA32: page owner found early allocated 7009 pages
> Node 0, zone Normal: page owner found early allocated 85827 pages
> Node 4, zone Normal: page owner found early allocated 75063 pages
> 
> == page_ext_init() before kmemleak_init() ==
> Node 0, zone DMA: page owner found early allocated 0 pages
> Node 0, zone DMA32: page owner found early allocated 6654 pages
> Node 0, zone Normal: page owner found early allocated 41907 pages
> Node 4, zone Normal: page owner found early allocated 41356 pages
> 
> So, it told us that it will miss tens of thousands of early page allocation call
> sites.

This is an answer for the first part of the question (how much). The
second is _do_we_care_?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-04 13:09                               ` Michal Hocko
@ 2019-01-04 15:01                                 ` Qian Cai
  2019-01-04 15:17                                   ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-04 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On 1/4/19 8:09 AM, Michal Hocko wrote:
>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>
>> == page_ext_init() after page_alloc_init_late() ==
>> Node 0, zone DMA: page owner found early allocated 0 pages
>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>> Node 0, zone Normal: page owner found early allocated 85827 pages
>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>
>> == page_ext_init() before kmemleak_init() ==
>> Node 0, zone DMA: page owner found early allocated 0 pages
>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>> Node 0, zone Normal: page owner found early allocated 41907 pages
>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>
>> So, it told us that it will miss tens of thousands of early page allocation call
>> sites.
> 
> This is an answer for the first part of the question (how much). The
> second is _do_we_care_?

Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
start to miss tens of thousands early page allocation call sites.

The other option I can think of to not hurt your eyes is to rewrite the whole
page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
However, I have a hard-time to convince myself it is a sensible thing to do.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-04 15:01                                 ` Qian Cai
@ 2019-01-04 15:17                                   ` Michal Hocko
  2019-01-04 15:25                                     ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-04 15:17 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On Fri 04-01-19 10:01:40, Qian Cai wrote:
> On 1/4/19 8:09 AM, Michal Hocko wrote:
> >> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> >>
> >> == page_ext_init() after page_alloc_init_late() ==
> >> Node 0, zone DMA: page owner found early allocated 0 pages
> >> Node 0, zone DMA32: page owner found early allocated 7009 pages
> >> Node 0, zone Normal: page owner found early allocated 85827 pages
> >> Node 4, zone Normal: page owner found early allocated 75063 pages
> >>
> >> == page_ext_init() before kmemleak_init() ==
> >> Node 0, zone DMA: page owner found early allocated 0 pages
> >> Node 0, zone DMA32: page owner found early allocated 6654 pages
> >> Node 0, zone Normal: page owner found early allocated 41907 pages
> >> Node 4, zone Normal: page owner found early allocated 41356 pages
> >>
> >> So, it told us that it will miss tens of thousands of early page allocation call
> >> sites.
> > 
> > This is an answer for the first part of the question (how much). The
> > second is _do_we_care_?
> 
> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
> start to miss tens of thousands early page allocation call sites.

I am pretty sure we will hear about that when that happens. And act
accordingly.

> The other option I can think of to not hurt your eyes is to rewrite the whole
> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
> However, I have a hard-time to convince myself it is a sensible thing to do.

Or simply make the page_owner initialization only touch the already
initialized memory. Have you explored that option as well?

Look, I am trying to push for a clean solution. Hacks I have seen so
far are not convincing. You have identified a regression and as such I
would consider the most straightforward to revert the buggy commit. If
you want to improve the situation then I would suggest to think about
something that is more robust than ifdefed hacks. It might be more work
but it also might be a better thing long term.

If you think I am asking too much then you are free to ignore my
opinion. I am not a maintainer of the page_owner code.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-04 15:17                                   ` Michal Hocko
@ 2019-01-04 15:25                                     ` Qian Cai
  2019-01-04 15:32                                       ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-04 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On 1/4/19 10:17 AM, Michal Hocko wrote:
> On Fri 04-01-19 10:01:40, Qian Cai wrote:
>> On 1/4/19 8:09 AM, Michal Hocko wrote:
>>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>>>
>>>> == page_ext_init() after page_alloc_init_late() ==
>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>>>> Node 0, zone Normal: page owner found early allocated 85827 pages
>>>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>>>
>>>> == page_ext_init() before kmemleak_init() ==
>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>>>> Node 0, zone Normal: page owner found early allocated 41907 pages
>>>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>>>
>>>> So, it told us that it will miss tens of thousands of early page allocation call
>>>> sites.
>>>
>>> This is an answer for the first part of the question (how much). The
>>> second is _do_we_care_?
>>
>> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
>> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
>> start to miss tens of thousands early page allocation call sites.
> 
> I am pretty sure we will hear about that when that happens. And act
> accordingly.
> 
>> The other option I can think of to not hurt your eyes is to rewrite the whole
>> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
>> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
>> However, I have a hard-time to convince myself it is a sensible thing to do.
> 
> Or simply make the page_owner initialization only touch the already
> initialized memory. Have you explored that option as well?

Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
dealing with all the low-level details,

https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-04 15:25                                     ` Qian Cai
@ 2019-01-04 15:32                                       ` Michal Hocko
  2019-01-04 20:18                                         ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-04 15:32 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On Fri 04-01-19 10:25:12, Qian Cai wrote:
> On 1/4/19 10:17 AM, Michal Hocko wrote:
> > On Fri 04-01-19 10:01:40, Qian Cai wrote:
> >> On 1/4/19 8:09 AM, Michal Hocko wrote:
> >>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
> >>>>
> >>>> == page_ext_init() after page_alloc_init_late() ==
> >>>> Node 0, zone DMA: page owner found early allocated 0 pages
> >>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
> >>>> Node 0, zone Normal: page owner found early allocated 85827 pages
> >>>> Node 4, zone Normal: page owner found early allocated 75063 pages
> >>>>
> >>>> == page_ext_init() before kmemleak_init() ==
> >>>> Node 0, zone DMA: page owner found early allocated 0 pages
> >>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
> >>>> Node 0, zone Normal: page owner found early allocated 41907 pages
> >>>> Node 4, zone Normal: page owner found early allocated 41356 pages
> >>>>
> >>>> So, it told us that it will miss tens of thousands of early page allocation call
> >>>> sites.
> >>>
> >>> This is an answer for the first part of the question (how much). The
> >>> second is _do_we_care_?
> >>
> >> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
> >> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
> >> start to miss tens of thousands early page allocation call sites.
> > 
> > I am pretty sure we will hear about that when that happens. And act
> > accordingly.
> > 
> >> The other option I can think of to not hurt your eyes is to rewrite the whole
> >> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
> >> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
> >> However, I have a hard-time to convince myself it is a sensible thing to do.
> > 
> > Or simply make the page_owner initialization only touch the already
> > initialized memory. Have you explored that option as well?
> 
> Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
> dealing with all the low-level details,
> 
> https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/

That is obviously not what I've had in mind. We have __init_single_page
which initializes a single struct page. Is there any way to hook
page_ext initialization there?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-04 15:32                                       ` Michal Hocko
@ 2019-01-04 20:18                                         ` Qian Cai
  2019-01-07 18:43                                           ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-04 20:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On 1/4/19 10:32 AM, Michal Hocko wrote:
> On Fri 04-01-19 10:25:12, Qian Cai wrote:
>> On 1/4/19 10:17 AM, Michal Hocko wrote:
>>> On Fri 04-01-19 10:01:40, Qian Cai wrote:
>>>> On 1/4/19 8:09 AM, Michal Hocko wrote:
>>>>>> Here is the number without DEFERRED_STRUCT_PAGE_INIT.
>>>>>>
>>>>>> == page_ext_init() after page_alloc_init_late() ==
>>>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>>>> Node 0, zone DMA32: page owner found early allocated 7009 pages
>>>>>> Node 0, zone Normal: page owner found early allocated 85827 pages
>>>>>> Node 4, zone Normal: page owner found early allocated 75063 pages
>>>>>>
>>>>>> == page_ext_init() before kmemleak_init() ==
>>>>>> Node 0, zone DMA: page owner found early allocated 0 pages
>>>>>> Node 0, zone DMA32: page owner found early allocated 6654 pages
>>>>>> Node 0, zone Normal: page owner found early allocated 41907 pages
>>>>>> Node 4, zone Normal: page owner found early allocated 41356 pages
>>>>>>
>>>>>> So, it told us that it will miss tens of thousands of early page allocation call
>>>>>> sites.
>>>>>
>>>>> This is an answer for the first part of the question (how much). The
>>>>> second is _do_we_care_?
>>>>
>>>> Well, the purpose of this simple "ugly" ifdef is to avoid a regression for the
>>>> existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
>>>> start to miss tens of thousands early page allocation call sites.
>>>
>>> I am pretty sure we will hear about that when that happens. And act
>>> accordingly.
>>>
>>>> The other option I can think of to not hurt your eyes is to rewrite the whole
>>>> page_ext_init(), init_page_owner(), init_debug_guardpage() to use all early
>>>> functions, so it can work in both with DEFERRED_STRUCT_PAGE_INIT=y and without.
>>>> However, I have a hard-time to convince myself it is a sensible thing to do.
>>>
>>> Or simply make the page_owner initialization only touch the already
>>> initialized memory. Have you explored that option as well?
>>
>> Yes, a proof-of-concept version is v1 where ends up with more ifdefs due to
>> dealing with all the low-level details,
>>
>> https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/
> 
> That is obviously not what I've had in mind. We have __init_single_page
> which initializes a single struct page. Is there any way to hook
> page_ext initialization there?

Well, the current design is,

(1) allocate page_ext physical-contiguous pages for all
    nodes.
(2) page owner gathers all early allocation pages.
(3) page owner is fully operational.

It may be possible to move (1) as early as possible just after vmalloc_init() in
start_kernel() because it depends on vmalloc(), but it still needs to call (2)
as there have had many early allocation pages already.

Node 0, zone DMA: page owner found early allocated 0 pages
Node 0, zone DMA32: page owner found early allocated 6654 pages
Node 0, zone Normal: page owner found early allocated 40972 pages
Node 4, zone Normal: page owner found early allocated 40968 pages

But, (2) still depends on DEFERRED_STRUCT_PAGE_INIT. To get ride of this
dependency, it may be possible to add some checking in __init_single_page():

- if not done (1), save those pages to an array and defer
  page_owner early_handle initialization until done (1).

Though, I can't see any really benefit of this approach apart from "beautify"
the code.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-04 20:18                                         ` Qian Cai
@ 2019-01-07 18:43                                           ` Michal Hocko
  2019-01-08  1:53                                             ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-07 18:43 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On Fri 04-01-19 15:18:08, Qian Cai wrote:
[...]
> Though, I can't see any really benefit of this approach apart from "beautify"

This is not about beautifying! This is about making the code long term
maintainable. As you can see it is just too easy to break it with the
current scheme. And that is bad especially when the code is broken
because of an optimization.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-07 18:43                                           ` Michal Hocko
@ 2019-01-08  1:53                                             ` Qian Cai
  2019-01-08  8:20                                               ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-08  1:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel



On 1/7/19 1:43 PM, Michal Hocko wrote:
> On Fri 04-01-19 15:18:08, Qian Cai wrote:
> [...]
>> Though, I can't see any really benefit of this approach apart from "beautify"
> 
> This is not about beautifying! This is about making the code long term
> maintainable. As you can see it is just too easy to break it with the
> current scheme. And that is bad especially when the code is broken
> because of an optimization.
> 

Understood, but the code is now fixed. If there is something fundamentally
broken in the future, it may be a good time then to create a looks like
hundred-line cleanup patch for long-term maintenance at the same time to fix
real bugs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-08  1:53                                             ` Qian Cai
@ 2019-01-08  8:20                                               ` Michal Hocko
  2019-01-08 13:19                                                 ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-08  8:20 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On Mon 07-01-19 20:53:08, Qian Cai wrote:
> 
> 
> On 1/7/19 1:43 PM, Michal Hocko wrote:
> > On Fri 04-01-19 15:18:08, Qian Cai wrote:
> > [...]
> >> Though, I can't see any really benefit of this approach apart from "beautify"
> > 
> > This is not about beautifying! This is about making the code long term
> > maintainable. As you can see it is just too easy to break it with the
> > current scheme. And that is bad especially when the code is broken
> > because of an optimization.
> > 
> 
> Understood, but the code is now fixed. If there is something fundamentally
> broken in the future, it may be a good time then to create a looks like
> hundred-line cleanup patch for long-term maintenance at the same time to fix
> real bugs.

Yeah, so revert = fix and redisign the thing to make the code more
robust longterm + allow to catch more allocation. I really fail to see
why this has to be repeated several times in this thread. Really.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-08  8:20                                               ` Michal Hocko
@ 2019-01-08 13:19                                                 ` Qian Cai
  2019-01-08 22:02                                                   ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-08 13:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim, tglx,
	linux-mm, linux-kernel

On Tue, 2019-01-08 at 09:20 +0100, Michal Hocko wrote:
> On Mon 07-01-19 20:53:08, Qian Cai wrote:
> > 
> > 
> > On 1/7/19 1:43 PM, Michal Hocko wrote:
> > > On Fri 04-01-19 15:18:08, Qian Cai wrote:
> > > [...]
> > > > Though, I can't see any really benefit of this approach apart from
> > > > "beautify"
> > > 
> > > This is not about beautifying! This is about making the code long term
> > > maintainable. As you can see it is just too easy to break it with the
> > > current scheme. And that is bad especially when the code is broken
> > > because of an optimization.
> > > 
> > 
> > Understood, but the code is now fixed. If there is something fundamentally
> > broken in the future, it may be a good time then to create a looks like
> > hundred-line cleanup patch for long-term maintenance at the same time to fix
> > real bugs.
> 
> Yeah, so revert = fix and redisign the thing to make the code more
> robust longterm + allow to catch more allocation. I really fail to see
> why this has to be repeated several times in this thread. Really.
> 

Again, this will introduce a immediately regression (arguably small) that
existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
start to miss tens of thousands early page allocation call sites.

I think the disagreement comes from that you want to deal with this passively
rather than proactively that you said "I am pretty sure we will hear about that
when that happens. And act accordingly", but I think it is better to fix it now
rather than later with a 4-line ifdef which you don't like.

I suppose someone else needs to make a judgment call for this as we are in a
"you can't convince me and I can't convince you" situation right now.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-08 13:19                                                 ` Qian Cai
@ 2019-01-08 22:02                                                   ` Andrew Morton
  2019-01-08 22:13                                                     ` Qian Cai
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2019-01-08 22:02 UTC (permalink / raw)
  To: Qian Cai
  Cc: Michal Hocko, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim,
	tglx, linux-mm, linux-kernel


It's unclear (to me) where we stand with this patch.  Shold we proceed
with v3 for now, or is something else planned?


From: Qian Cai <cai@lca.pw>
Subject: mm/page_owner: fix for deferred struct page init

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[    8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

Although later it tries to set page flags for pages in reserved bootmem
regions,

mm_init
  mem_init
    memblock_free_all
      free_low_memory_core_early
        reserve_bootmem_region

there could still have some freed pages from the page allocator but yet
to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
dealt with a bit in page_ext_init().

* Take into account DEFERRED_STRUCT_PAGE_INIT.
*/
if (early_pfn_to_nid(pfn) != nid)
	continue;

However, it did not handle it well in init_pages_in_zone() which end up
calling page_to_nid().

[   11.917212] page:ffffea0004200000 is uninitialized and poisoned
[   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
ffffffffffffffff
[   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[   11.926498] page_owner info is not active (free page?)
[   12.329560] kernel BUG at include/linux/mm.h:990!
[   12.337632] RIP: 0010:init_page_owner+0x486/0x520

Since there is no other routines depend on page_ext_init() in
start_kernel(), just move it after page_alloc_init_late() to ensure that
there is no deferred pages need to de dealt with. If deselected
DEFERRED_STRUCT_PAGE_INIT, it is still better to call page_ext_init()
earlier, so page owner could catch more early page allocation call
sites. This gives us a good compromise between catching good and bad
call sites (See the v1 patch [1]) in case of DEFERRED_STRUCT_PAGE_INIT.

[1] https://lore.kernel.org/lkml/20181220060303.38686-1-cai@lca.pw/

Link: http://lkml.kernel.org/r/20181220203156.43441-1-cai@lca.pw
Fixes: fe53ca54270 (mm: use early_pfn_to_nid in page_ext_init)
Signed-off-by: Qian Cai <cai@lca.pw>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/main.c   |    5 +++++
 mm/page_ext.c |    3 +--
 2 files changed, 6 insertions(+), 2 deletions(-)

--- a/init/main.c~mm-page_owner-fix-for-deferred-struct-page-init
+++ a/init/main.c
@@ -695,7 +695,9 @@ asmlinkage __visible void __init start_k
 		initrd_start = 0;
 	}
 #endif
+#ifndef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	page_ext_init();
+#endif
 	kmemleak_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();
@@ -1131,6 +1133,9 @@ static noinline void __init kernel_init_
 	sched_init_smp();
 
 	page_alloc_init_late();
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+	page_ext_init();
+#endif
 
 	do_basic_setup();
 
--- a/mm/page_ext.c~mm-page_owner-fix-for-deferred-struct-page-init
+++ a/mm/page_ext.c
@@ -399,9 +399,8 @@ void __init page_ext_init(void)
 			 * -------------pfn-------------->
 			 * N0 | N1 | N2 | N0 | N1 | N2|....
 			 *
-			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
 			 */
-			if (early_pfn_to_nid(pfn) != nid)
+			if (pfn_to_nid(pfn) != nid)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;
_



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-08 22:02                                                   ` Andrew Morton
@ 2019-01-08 22:13                                                     ` Qian Cai
  2019-01-08 22:40                                                       ` Michal Hocko
  0 siblings, 1 reply; 32+ messages in thread
From: Qian Cai @ 2019-01-08 22:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim,
	tglx, linux-mm, linux-kernel



On 1/8/19 5:02 PM, Andrew Morton wrote:
> 
> It's unclear (to me) where we stand with this patch.  Shold we proceed
> with v3 for now, or is something else planned?

I don't have anything else plan for this right now. Michal particular don't like
that 4-line ifdef which supposes to avoid an immediately regression (arguably
small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected
that would start to miss tens of thousands early page allocation call sites. So,
feel free to choose v2 of this which has no ifdef if you agree with Michal too.
I am fine either way.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] mm/page_owner: fix for deferred struct page init
  2019-01-08 22:13                                                     ` Qian Cai
@ 2019-01-08 22:40                                                       ` Michal Hocko
  0 siblings, 0 replies; 32+ messages in thread
From: Michal Hocko @ 2019-01-08 22:40 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Pavel.Tatashin, mingo, mgorman, iamjoonsoo.kim,
	tglx, linux-mm, linux-kernel

On Tue 08-01-19 17:13:41, Qian Cai wrote:
> 
> 
> On 1/8/19 5:02 PM, Andrew Morton wrote:
> > 
> > It's unclear (to me) where we stand with this patch.  Shold we proceed
> > with v3 for now, or is something else planned?
> 
> I don't have anything else plan for this right now. Michal particular don't like
> that 4-line ifdef which supposes to avoid an immediately regression (arguably
> small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected
> that would start to miss tens of thousands early page allocation call sites. So,
> feel free to choose v2 of this which has no ifdef if you agree with Michal too.
> I am fine either way.

Yes I would prefer to revert the faulty commit (fe53ca54270) and work on
a more robust page_owner initialization to cover earlier allocations on
top of that.

Not that I would insist but this would be a more straightforward
approach and I hope it will result in a better long term maintainable
code in the end.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2] mm/page_owner: fix for deferred struct page init
  2018-12-20 18:50           ` [PATCH v2] " Qian Cai
  2018-12-20 20:31             ` [PATCH v3] " Qian Cai
@ 2019-01-09  7:34             ` Michal Hocko
  2019-01-15 20:28               ` [PATCH] Revert "mm: use early_pfn_to_nid in page_ext_init" Qian Cai
  1 sibling, 1 reply; 32+ messages in thread
From: Michal Hocko @ 2019-01-09  7:34 UTC (permalink / raw)
  To: Qian Cai
  Cc: akpm, Pavel.Tatashin, mingo, hpa, mgorman, tglx, linux-mm, linux-kernel

On Thu 20-12-18 13:50:31, Qian Cai wrote:
> When booting a system with "page_owner=on",
> 
> start_kernel
>   page_ext_init
>     invoke_init_callbacks
>       init_section_page_ext
>         init_page_owner
>           init_early_allocated_pages
>             init_zones_in_node
>               init_pages_in_zone
>                 lookup_page_ext
>                   page_to_nid
> 
> The issue here is that page_to_nid() will not work since some page
> flags have no node information until later in page_alloc_init_late() due
> to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
> access with an invalid nid.
> 
> [    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
> [    8.672603] index 7 is out of range for type 'zone [5]'
> 
> Also, kernel will panic since flags were poisoned earlier with,
> 
> CONFIG_DEBUG_VM_PGFLAGS=y
> CONFIG_NODE_NOT_IN_PAGE_FLAGS=n
> 
> start_kernel
>   setup_arch
>     pagetable_init
>       paging_init
>         sparse_init
>           sparse_init_nid
>             memblock_alloc_try_nid_raw
> 
> Although later it tries to set page flags for pages in reserved bootmem
> regions,
> 
> mm_init
>   mem_init
>     memblock_free_all
>       free_low_memory_core_early
>         reserve_bootmem_region
> 
> there could still have some freed pages from the page allocator but yet
> to be initialized due to DEFERRED_STRUCT_PAGE_INIT. It have already been
> dealt with a bit in page_ext_init().
> 
> * Take into account DEFERRED_STRUCT_PAGE_INIT.
> */
> if (early_pfn_to_nid(pfn) != nid)
> 	continue;
> 
> However it did not handle it well in init_pages_in_zone() which end up
> calling page_to_nid().
> 
> [   11.917212] page:ffffea0004200000 is uninitialized and poisoned
> [   11.917220] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [   11.921745] raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff
> ffffffffffffffff
> [   11.924523] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [   11.926498] page_owner info is not active (free page?)
> [   12.329560] kernel BUG at include/linux/mm.h:990!
> [   12.337632] RIP: 0010:init_page_owner+0x486/0x520
> 
> Since there is no other routines depend on page_ext_init() in
> start_kernel() and no real benefit to call it so early, just move it
> after page_alloc_init_late() to ensure that there is no deferred pages
> need to de dealt with.

The last paragraph should be updated. I would propose something like
this.

"
This means that assumptions behind fe53ca54270a ("mm: use
early_pfn_to_nid in page_ext_init") are incomplete. Therefore revert the
commit for now. A proper way to move the page_owner initialization to
sooner is to hook into memmap initialization.
"
> 

Fixes: fe53ca54270a ("mm: use early_pfn_to_nid in page_ext_init")

> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Qian Cai <cai@lca.pw>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> 
> v2: postpone init_page_ext() to after page_alloc_init_late().
> 
>  init/main.c   | 2 +-
>  mm/page_ext.c | 3 +--
>  2 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/init/main.c b/init/main.c
> index 2b7b7fe173c9..1aeb062b2cb7 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -696,7 +696,6 @@ asmlinkage __visible void __init start_kernel(void)
>  		initrd_start = 0;
>  	}
>  #endif
> -	page_ext_init();
>  	kmemleak_init();
>  	setup_per_cpu_pageset();
>  	numa_policy_init();
> @@ -1147,6 +1146,7 @@ static noinline void __init kernel_init_freeable(void)
>  	sched_init_smp();
>  
>  	page_alloc_init_late();
> +	page_ext_init();
>  
>  	do_basic_setup();
>  
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index ae44f7adbe07..d76fd51e312a 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -399,9 +399,8 @@ void __init page_ext_init(void)
>  			 * -------------pfn-------------->
>  			 * N0 | N1 | N2 | N0 | N1 | N2|....
>  			 *
> -			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
>  			 */
> -			if (early_pfn_to_nid(pfn) != nid)
> +			if (pfn_to_nid(pfn) != nid)
>  				continue;
>  			if (init_section_page_ext(pfn, nid))
>  				goto oom;
> -- 
> 2.17.2 (Apple Git-113)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] Revert "mm: use early_pfn_to_nid in page_ext_init"
  2019-01-09  7:34             ` [PATCH v2] " Michal Hocko
@ 2019-01-15 20:28               ` Qian Cai
  0 siblings, 0 replies; 32+ messages in thread
From: Qian Cai @ 2019-01-15 20:28 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, Pavel.Tatashin, mingo, hpa, mgorman, iamjoonsoo.kim,
	tglx, linux-mm, linux-kernel, Qian Cai

This reverts commit fe53ca54270a757f0a28ee6bf3a54d952b550ed0.

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
    invoke_init_callbacks
      init_section_page_ext
        init_page_owner
          init_early_allocated_pages
            init_zones_in_node
              init_pages_in_zone
                lookup_page_ext
                  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[    8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[    8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
    pagetable_init
      paging_init
        sparse_init
          sparse_init_nid
            memblock_alloc_try_nid_raw

It did not handle it well in init_pages_in_zone() which ends up calling
page_to_nid().

page:ffffea0004200000 is uninitialized and poisoned
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
page_owner info is not active (free page?)
kernel BUG at include/linux/mm.h:990!
RIP: 0010:init_page_owner+0x486/0x520

This means that assumptions behind commit fe53ca54270a ("mm: use
early_pfn_to_nid in page_ext_init") are incomplete. Therefore, revert
the commit for now. A proper way to move the page_owner initialization
to sooner is to hook into memmap initialization.

Acked-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Qian Cai <cai@lca.pw>
---
 init/main.c   | 3 ++-
 mm/page_ext.c | 4 +---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/init/main.c b/init/main.c
index e2e80ca3165a..c86a1c8f19f4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -695,7 +695,6 @@ asmlinkage __visible void __init start_kernel(void)
 		initrd_start = 0;
 	}
 #endif
-	page_ext_init();
 	kmemleak_init();
 	setup_per_cpu_pageset();
 	numa_policy_init();
@@ -1131,6 +1130,8 @@ static noinline void __init kernel_init_freeable(void)
 	sched_init_smp();
 
 	page_alloc_init_late();
+	/* Initialize page ext after all struct pages are initialized. */
+	page_ext_init();
 
 	do_basic_setup();
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..8c78b8d45117 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -398,10 +398,8 @@ void __init page_ext_init(void)
 			 * We know some arch can have a nodes layout such as
 			 * -------------pfn-------------->
 			 * N0 | N1 | N2 | N0 | N1 | N2|....
-			 *
-			 * Take into account DEFERRED_STRUCT_PAGE_INIT.
 			 */
-			if (early_pfn_to_nid(pfn) != nid)
+			if (pfn_to_nid(pfn) != nid)
 				continue;
 			if (init_section_page_ext(pfn, nid))
 				goto oom;
-- 
2.17.2 (Apple Git-113)


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-01-15 20:28 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1545172285.18411.26.camel@lca.pw>
2018-12-19  1:57 ` [PATCH] mm: skip checking poison pattern for page_to_nid() Qian Cai
2018-12-19 10:20   ` Michal Hocko
2018-12-19 12:46     ` Qian Cai
2018-12-20  6:03       ` [PATCH] mm/page_owner: fix for deferred struct page init Qian Cai
2018-12-20  9:22         ` Michal Hocko
2018-12-20 18:50           ` [PATCH v2] " Qian Cai
2018-12-20 20:31             ` [PATCH v3] " Qian Cai
2018-12-20 21:00               ` William Kucharski
2018-12-20 21:04                 ` Qian Cai
2019-01-03 11:51               ` Michal Hocko
2019-01-03 16:38                 ` Qian Cai
2019-01-03 16:59                   ` Michal Hocko
2019-01-03 17:38                     ` Qian Cai
2019-01-03 19:07                       ` Michal Hocko
2019-01-03 19:53                         ` Qian Cai
2019-01-03 20:22                           ` Michal Hocko
2019-01-03 22:22                             ` Qian Cai
2019-01-04 13:09                               ` Michal Hocko
2019-01-04 15:01                                 ` Qian Cai
2019-01-04 15:17                                   ` Michal Hocko
2019-01-04 15:25                                     ` Qian Cai
2019-01-04 15:32                                       ` Michal Hocko
2019-01-04 20:18                                         ` Qian Cai
2019-01-07 18:43                                           ` Michal Hocko
2019-01-08  1:53                                             ` Qian Cai
2019-01-08  8:20                                               ` Michal Hocko
2019-01-08 13:19                                                 ` Qian Cai
2019-01-08 22:02                                                   ` Andrew Morton
2019-01-08 22:13                                                     ` Qian Cai
2019-01-08 22:40                                                       ` Michal Hocko
2019-01-09  7:34             ` [PATCH v2] " Michal Hocko
2019-01-15 20:28               ` [PATCH] Revert "mm: use early_pfn_to_nid in page_ext_init" Qian Cai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).