mm-commits.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
@ 2020-12-06  0:54 akpm
  2020-12-06  6:48 ` Mike Rapoport
  2020-12-07  8:58 ` David Hildenbrand
  0 siblings, 2 replies; 10+ messages in thread
From: akpm @ 2020-12-06  0:54 UTC (permalink / raw)
  To: aarcange, bhe, cai, david, mgorman, mhocko, mm-commits, rppt,
	stable, vbabka


The patch titled
     Subject: mm: initialize struct pages in reserved regions outside of the zone ranges
has been added to the -mm tree.  Its filename is
     mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch

This patch should soon appear at
    https://ozlabs.org/~akpm/mmots/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
and later at
    https://ozlabs.org/~akpm/mmotm/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: mm: initialize struct pages in reserved regions outside of the zone ranges

Without this change, the pfn 0 isn't in any zone spanned range, and it's
also not in any memory.memblock range, so the struct page of pfn 0 wasn't
initialized and the PagePoison remained set when reserve_bootmem_region
called __SetPageReserved, inducing a silent boot failure with DEBUG_VM
(and correctly so, because the crash signaled the nodeid/nid of pfn 0
would be again wrong).

There's no enforcement that all memblock.reserved ranges must overlap
memblock.memory ranges, so the memblock.reserved ranges also require an
explicit initialization and the zones ranges need to be extended to
include all memblock.reserved ranges with struct pages too or they'll be
left uninitialized with PagePoison as it happened to pfn 0.

Link: https://lkml.kernel.org/r/20201205013238.21663-2-aarcange@redhat.com
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memblock.h |   17 ++++++++---
 mm/debug.c               |    3 +
 mm/memblock.c            |    4 +-
 mm/page_alloc.c          |   57 +++++++++++++++++++++++++++++--------
 4 files changed, 63 insertions(+), 18 deletions(-)

--- a/include/linux/memblock.h~mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges
+++ a/include/linux/memblock.h
@@ -251,7 +251,8 @@ static inline bool memblock_is_nomap(str
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
-			  unsigned long *out_end_pfn, int *out_nid);
+			  unsigned long *out_end_pfn, int *out_nid,
+			  struct memblock_type *type);
 
 /**
  * for_each_mem_pfn_range - early memory pfn range iterator
@@ -263,9 +264,17 @@ void __next_mem_pfn_range(int *idx, int
  *
  * Walks over configured memory ranges.
  */
-#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		\
-	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
-	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
+#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		  \
+	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.memory);		  \
+	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.memory))
+
+#define for_each_res_pfn_range(i, nid, p_start, p_end, p_nid)		  \
+	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.reserved);		  \
+	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \
+					  &memblock.reserved))
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
--- a/mm/debug.c~mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges
+++ a/mm/debug.c
@@ -64,7 +64,8 @@ void __dump_page(struct page *page, cons
 	 * dump_page() when detected.
 	 */
 	if (page_poisoned) {
-		pr_warn("page:%px is uninitialized and poisoned", page);
+		pr_warn("page:%px pfn:%ld is uninitialized and poisoned",
+			page, page_to_pfn(page));
 		goto hex_only;
 	}
 
--- a/mm/memblock.c~mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges
+++ a/mm/memblock.c
@@ -1198,9 +1198,9 @@ void __init_memblock __next_mem_range_re
  */
 void __init_memblock __next_mem_pfn_range(int *idx, int nid,
 				unsigned long *out_start_pfn,
-				unsigned long *out_end_pfn, int *out_nid)
+				unsigned long *out_end_pfn, int *out_nid,
+				struct memblock_type *type)
 {
-	struct memblock_type *type = &memblock.memory;
 	struct memblock_region *r;
 	int r_nid;
 
--- a/mm/page_alloc.c~mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges
+++ a/mm/page_alloc.c
@@ -1458,6 +1458,7 @@ static void __meminit init_reserved_page
 {
 	pg_data_t *pgdat;
 	int nid, zid;
+	bool found = false;
 
 	if (!early_page_uninitialised(pfn))
 		return;
@@ -1468,10 +1469,15 @@ static void __meminit init_reserved_page
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		struct zone *zone = &pgdat->node_zones[zid];
 
-		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) {
+			found = true;
 			break;
+		}
 	}
-	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	if (likely(found))
+		__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	else
+		WARN_ON_ONCE(1);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -6227,7 +6233,7 @@ void __init __weak memmap_init(unsigned
 			       unsigned long zone,
 			       unsigned long range_start_pfn)
 {
-	unsigned long start_pfn, end_pfn, next_pfn = 0;
+	unsigned long start_pfn, end_pfn, prev_pfn = 0;
 	unsigned long range_end_pfn = range_start_pfn + size;
 	u64 pgcnt = 0;
 	int i;
@@ -6235,7 +6241,7 @@ void __init __weak memmap_init(unsigned
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
 		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
-		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
+		prev_pfn = clamp(prev_pfn, range_start_pfn, range_end_pfn);
 
 		if (end_pfn > start_pfn) {
 			size = end_pfn - start_pfn;
@@ -6243,10 +6249,10 @@ void __init __weak memmap_init(unsigned
 					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
 		}
 
-		if (next_pfn < start_pfn)
-			pgcnt += init_unavailable_range(next_pfn, start_pfn,
+		if (prev_pfn < start_pfn)
+			pgcnt += init_unavailable_range(prev_pfn, start_pfn,
 							zone, nid);
-		next_pfn = end_pfn;
+		prev_pfn = end_pfn;
 	}
 
 	/*
@@ -6256,12 +6262,31 @@ void __init __weak memmap_init(unsigned
 	 * considered initialized. Make sure that memmap has a well defined
 	 * state.
 	 */
-	if (next_pfn < range_end_pfn)
-		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
+	if (prev_pfn < range_end_pfn)
+		pgcnt += init_unavailable_range(prev_pfn, range_end_pfn,
 						zone, nid);
 
+	/*
+	 * memblock.reserved isn't enforced to overlap with
+	 * memblock.memory so initialize the struct pages for
+	 * memblock.reserved too in case it wasn't overlapping.
+	 *
+	 * If any struct page associated with a memblock.reserved
+	 * range isn't overlapping with a zone range, it'll be left
+	 * uninitialized, ideally with PagePoison, and it'll be a more
+	 * easily detectable error.
+	 */
+	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
+		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
+		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
+
+		if (end_pfn > start_pfn)
+			pgcnt += init_unavailable_range(start_pfn, end_pfn,
+							zone, nid);
+	}
+
 	if (pgcnt)
-		pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
+		pr_info("%s: pages in unavailable ranges: %lld\n",
 			zone_names[zone], pgcnt);
 }
 
@@ -6499,6 +6524,10 @@ void __init get_pfn_range_for_nid(unsign
 		*start_pfn = min(*start_pfn, this_start_pfn);
 		*end_pfn = max(*end_pfn, this_end_pfn);
 	}
+	for_each_res_pfn_range(i, nid, &this_start_pfn, &this_end_pfn, NULL) {
+		*start_pfn = min(*start_pfn, this_start_pfn);
+		*end_pfn = max(*end_pfn, this_end_pfn);
+	}
 
 	if (*start_pfn == -1UL)
 		*start_pfn = 0;
@@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
  */
 unsigned long __init find_min_pfn_with_active_regions(void)
 {
-	return PHYS_PFN(memblock_start_of_DRAM());
+	/*
+	 * reserved regions must be included so that their page
+	 * structure can be part of a zone and obtain a valid zoneid
+	 * before __SetPageReserved().
+	 */
+	return min(PHYS_PFN(memblock_start_of_DRAM()),
+		   PHYS_PFN(memblock.reserved.regions[0].base));
 }
 
 /*
_

Patches currently in -mm which might be from aarcange@redhat.com are

mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-06  0:54 + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree akpm
@ 2020-12-06  6:48 ` Mike Rapoport
  2020-12-06 19:30   ` Andrea Arcangeli
  2020-12-07  8:58 ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Rapoport @ 2020-12-06  6:48 UTC (permalink / raw)
  To: akpm
  Cc: aarcange, bhe, cai, david, mgorman, mhocko, mm-commits, stable, vbabka

Hi,

On Sat, Dec 05, 2020 at 04:54:01PM -0800, akpm@linux-foundation.org wrote:
> 
> The patch titled
>      Subject: mm: initialize struct pages in reserved regions outside of the zone ranges
> has been added to the -mm tree.  Its filename is
>      mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> 
> This patch should soon appear at
>     https://ozlabs.org/~akpm/mmots/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> and later at
>     https://ozlabs.org/~akpm/mmotm/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> ------------------------------------------------------
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: mm: initialize struct pages in reserved regions outside of the zone ranges
> 
> Without this change, the pfn 0 isn't in any zone spanned range, and it's
> also not in any memory.memblock range, so the struct page of pfn 0 wasn't
> initialized and the PagePoison remained set when reserve_bootmem_region
> called __SetPageReserved, inducing a silent boot failure with DEBUG_VM
> (and correctly so, because the crash signaled the nodeid/nid of pfn 0
> would be again wrong).
> 
> There's no enforcement that all memblock.reserved ranges must overlap
> memblock.memory ranges, so the memblock.reserved ranges also require an
> explicit initialization and the zones ranges need to be extended to
> include all memblock.reserved ranges with struct pages too or they'll be
> left uninitialized with PagePoison as it happened to pfn 0.
> 
> Link: https://lkml.kernel.org/r/20201205013238.21663-2-aarcange@redhat.com
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/memblock.h |   17 ++++++++---
>  mm/debug.c               |    3 +
>  mm/memblock.c            |    4 +-
>  mm/page_alloc.c          |   57 +++++++++++++++++++++++++++++--------
>  4 files changed, 63 insertions(+), 18 deletions(-)

I don't see why we need all this complexity when a simple fixup was
enough.

> --- a/include/linux/memblock.h~mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges
> +++ a/include/linux/memblock.h

...

> --- a/mm/page_alloc.c~mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges
> +++ a/mm/page_alloc.c

...

> @@ -6227,7 +6233,7 @@ void __init __weak memmap_init(unsigned
>  			       unsigned long zone,
>  			       unsigned long range_start_pfn)
>  {
> -	unsigned long start_pfn, end_pfn, next_pfn = 0;
> +	unsigned long start_pfn, end_pfn, prev_pfn = 0;
>  	unsigned long range_end_pfn = range_start_pfn + size;
>  	u64 pgcnt = 0;
>  	int i;
> @@ -6235,7 +6241,7 @@ void __init __weak memmap_init(unsigned
>  	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>  		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
>  		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> -		next_pfn = clamp(next_pfn, range_start_pfn, range_end_pfn);
> +		prev_pfn = clamp(prev_pfn, range_start_pfn, range_end_pfn);
>  
>  		if (end_pfn > start_pfn) {
>  			size = end_pfn - start_pfn;
> @@ -6243,10 +6249,10 @@ void __init __weak memmap_init(unsigned
>  					 MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
>  		}
>  
> -		if (next_pfn < start_pfn)
> -			pgcnt += init_unavailable_range(next_pfn, start_pfn,
> +		if (prev_pfn < start_pfn)
> +			pgcnt += init_unavailable_range(prev_pfn, start_pfn,
>  							zone, nid);
> -		next_pfn = end_pfn;
> +		prev_pfn = end_pfn;
>  	}
>  
>  	/*
> @@ -6256,12 +6262,31 @@ void __init __weak memmap_init(unsigned
>  	 * considered initialized. Make sure that memmap has a well defined
>  	 * state.
>  	 */
> -	if (next_pfn < range_end_pfn)
> -		pgcnt += init_unavailable_range(next_pfn, range_end_pfn,
> +	if (prev_pfn < range_end_pfn)
> +		pgcnt += init_unavailable_range(prev_pfn, range_end_pfn,
>  						zone, nid);
>  
> +	/*
> +	 * memblock.reserved isn't enforced to overlap with
> +	 * memblock.memory so initialize the struct pages for
> +	 * memblock.reserved too in case it wasn't overlapping.
> +	 *
> +	 * If any struct page associated with a memblock.reserved
> +	 * range isn't overlapping with a zone range, it'll be left
> +	 * uninitialized, ideally with PagePoison, and it'll be a more
> +	 * easily detectable error.
> +	 */
> +	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> +
> +		if (end_pfn > start_pfn)
> +			pgcnt += init_unavailable_range(start_pfn, end_pfn,
> +							zone, nid);
> +	}

This means we are going iterate over all memory allocated before
free_area_ini() from memblock extra time. One time here and another time
in reserve_bootmem_region().
And this can be substantial for CMA and alloc_large_system_hash().

> +
>  	if (pgcnt)
> -		pr_info("%s: Zeroed struct page in unavailable ranges: %lld\n",
> +		pr_info("%s: pages in unavailable ranges: %lld\n",
>  			zone_names[zone], pgcnt);
>  }
>  
> @@ -6499,6 +6524,10 @@ void __init get_pfn_range_for_nid(unsign
>  		*start_pfn = min(*start_pfn, this_start_pfn);
>  		*end_pfn = max(*end_pfn, this_end_pfn);
>  	}
> +	for_each_res_pfn_range(i, nid, &this_start_pfn, &this_end_pfn, NULL) {
> +		*start_pfn = min(*start_pfn, this_start_pfn);
> +		*end_pfn = max(*end_pfn, this_end_pfn);
> +	}
>  
>  	if (*start_pfn == -1UL)
>  		*start_pfn = 0;
> @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
>   */
>  unsigned long __init find_min_pfn_with_active_regions(void)
>  {
> -	return PHYS_PFN(memblock_start_of_DRAM());
> +	/*
> +	 * reserved regions must be included so that their page
> +	 * structure can be part of a zone and obtain a valid zoneid
> +	 * before __SetPageReserved().
> +	 */
> +	return min(PHYS_PFN(memblock_start_of_DRAM()),
> +		   PHYS_PFN(memblock.reserved.regions[0].base));

So this implies that reserved memory starts before memory. Don't you
find this weird?

>  }
>  
>  /*
> _
> 
> Patches currently in -mm which might be from aarcange@redhat.com are
> 
> mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-06  6:48 ` Mike Rapoport
@ 2020-12-06 19:30   ` Andrea Arcangeli
  2020-12-07 16:50     ` Mike Rapoport
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2020-12-06 19:30 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, bhe, cai, david, mgorman, mhocko, mm-commits, stable, vbabka

On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote:
> I don't see why we need all this complexity when a simple fixup was
> enough.

Note the memblock bug crashed my systems:

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

Please note how zone_spans_pfn expands:

static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
{
	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
}

For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0.

How is the above not going to trigger a false positive once again with
the simple fix?

Not just pfn 0 in fact, any pfn reserved that happens to be the at the
start of each nid.

Doesn't the simple fix reintroduce the same broken invariant that was
meant to be fixed by the patch that required the simple fix or it
won't boot?

> > +	/*
> > +	 * memblock.reserved isn't enforced to overlap with
> > +	 * memblock.memory so initialize the struct pages for
> > +	 * memblock.reserved too in case it wasn't overlapping.
> > +	 *
> > +	 * If any struct page associated with a memblock.reserved
> > +	 * range isn't overlapping with a zone range, it'll be left
> > +	 * uninitialized, ideally with PagePoison, and it'll be a more
> > +	 * easily detectable error.
> > +	 */
> > +	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > +
> > +		if (end_pfn > start_pfn)
> > +			pgcnt += init_unavailable_range(start_pfn, end_pfn,
> > +							zone, nid);
> > +	}
> 
> This means we are going iterate over all memory allocated before
> free_area_ini() from memblock extra time. One time here and another time
> in reserve_bootmem_region().
> And this can be substantial for CMA and alloc_large_system_hash().

The above loops over memory.reserved only.

Now the simple fix also has to loops over some memblock.reserved or it
still wouldn't boot.

You worry about memory.reserved which is bound by RAM.

I worry about holes that aren't bound by RAM.

The simple fix won't just restrict itself to the RAM in
memblock.reserved, it'll also go goes over the holes (that's all it
can do in order to cover memblock.reserved ranges without looking at
it).

The simple fix doesn't only go over pfn 0, it can keep looping from
last page in the previous nid to the first page in the next nid across
regions that weren't even in memblock.reserved and that may be larger
than any memory reserved region.

Reserved regions aren't holes, so the complex fix is more likely to
work stable since unlike the simple fix, it won't waste any CPU over
any non-RAM hole that isn't part of memblock.reserved nor
memblock.memory.

In addition the "static unsigned long hole_start_pfn" of the simple
fix also pushed complexity into the common code caller: it made
memmap_init non thread safe and in addition it required the
memmap_init to be called in paddr physical order or it won't even boot
anymore.

Now maybe it's safe now, but these memmap already gets initialized in
the background during boot, so the moment more than one thread does it
in parallel the simple patch will break because it's non-reentrant.

> > @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
> >   */
> >  unsigned long __init find_min_pfn_with_active_regions(void)
> >  {
> > -	return PHYS_PFN(memblock_start_of_DRAM());
> > +	/*
> > +	 * reserved regions must be included so that their page
> > +	 * structure can be part of a zone and obtain a valid zoneid
> > +	 * before __SetPageReserved().
> > +	 */
> > +	return min(PHYS_PFN(memblock_start_of_DRAM()),
> > +		   PHYS_PFN(memblock.reserved.regions[0].base));
> 
> So this implies that reserved memory starts before memory. Don't you
> find this weird?

That happens because the current API doesn't require overlap.

	for (i = 0; i < e820_table->nr_entries; i++) {
		struct e820_entry *entry = &e820_table->entries[i];

		end = entry->addr + entry->size;
		if (end != (resource_size_t)end)
			continue;

		if (entry->type == E820_TYPE_SOFT_RESERVED)
			memblock_reserve(entry->addr, entry->size);

		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
			continue;

		memblock_add(entry->addr, entry->size);
	}

The above is a more complex way to write the below, which will make it
more clear there's never overlap:

		if (entry->type == E820_TYPE_SOFT_RESERVED)
			memblock_reserve(entry->addr, entry->size);
		else if (entry->type == E820_TYPE_RAM || entry->type == E820_TYPE_RESERVED_KERN)
			memblock_add(entry->addr, entry->size);

If you want to enforce that there is always overlap and in turn a zone
cannot start with a reserved pfn, then you could add a bugcheck if
memblock.reserved doesn't overlap to an already registered
memblock.memory range, and then the above e820 code above will break.

I thought the reason there couldn't be full overlap is the direct
mapping issue so there's a very good reason for it.

So given there's current zero overlap, I'm not exactly sure what's
weird that memblock.reserved starts before memblock.memory. The fact
pfn 0 is reserved is a bios thing, weird or not weird it's not in our
control.

I don't see any alternative on how to give a consistent zoneid to pfn
0, if zoneid 0 still starts at pfn 1 and you're not going to enforce
full overlap between memblock.reserved and memblock.memory either.

One only alternative I can see is to leave PagePoison in there and add
PagePoison or PageReserved checks all over the place starting from
reserve_bootmem_region like this:

void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
{
	unsigned long start_pfn = PFN_DOWN(start);
	unsigned long end_pfn = PFN_UP(end);

	for (; start_pfn < end_pfn; start_pfn++) {
		if (pfn_valid(start_pfn)) {
			struct page *page = pfn_to_page(start_pfn);

+			if (PagePoison(page))
+				continue

			init_reserved_page(start_pfn);

The above is still preferable since then we could then add PagePoison
checks all over the VM code like this:

       VM_BUG_ON(!PagePoison(page) && page_to_pfn(page) >= page_zone(page)->zone_start_pfn)

The above PagePoison mention assumes we'd need to extend PagePoison to
DEBUG_VM=n or switch to PageReserved or a NO_NODEID.

The whole point is we don't need to do any of the above since zones
can include all reserved pages too so once the kernel finished
booting, it'll be simpler if we have a valid zoneid so the invariant
bugchecks will pass for all valid pfn and they'll stop crashing
kernels with false positives.

Keeping all complexity in memblock common code and keeping the arch
caller raw is not a bad thing since memblock is common code and it'll
benefit all archs. I wouldn't go in the direction of
124049decbb121ec32742c94fb5d9d6bed8f24d8 once again.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-06  0:54 + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree akpm
  2020-12-06  6:48 ` Mike Rapoport
@ 2020-12-07  8:58 ` David Hildenbrand
  2020-12-07  9:45   ` Mike Rapoport
  1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2020-12-07  8:58 UTC (permalink / raw)
  To: akpm, aarcange, bhe, cai, mgorman, mhocko, mm-commits, rppt,
	stable, vbabka

On 06.12.20 01:54, akpm@linux-foundation.org wrote:
> 
> The patch titled
>      Subject: mm: initialize struct pages in reserved regions outside of the zone ranges
> has been added to the -mm tree.  Its filename is
>      mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> 
> This patch should soon appear at
>     https://ozlabs.org/~akpm/mmots/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> and later at
>     https://ozlabs.org/~akpm/mmotm/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> 
> The -mm tree is included into linux-next and is updated
> there every 3-4 working days
> 
> ------------------------------------------------------
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: mm: initialize struct pages in reserved regions outside of the zone ranges
> 
> Without this change, the pfn 0 isn't in any zone spanned range, and it's
> also not in any memory.memblock range, so the struct page of pfn 0 wasn't
> initialized and the PagePoison remained set when reserve_bootmem_region
> called __SetPageReserved, inducing a silent boot failure with DEBUG_VM
> (and correctly so, because the crash signaled the nodeid/nid of pfn 0
> would be again wrong).
> 
> There's no enforcement that all memblock.reserved ranges must overlap
> memblock.memory ranges, so the memblock.reserved ranges also require an
> explicit initialization and the zones ranges need to be extended to
> include all memblock.reserved ranges with struct pages too or they'll be
> left uninitialized with PagePoison as it happened to pfn 0.
> 
> Link: https://lkml.kernel.org/r/20201205013238.21663-2-aarcange@redhat.com
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

[...]

I've lost track which patches are we considering right now to solve the
overall issue. I'd appreciate a proper patch series with all relevant
patches.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-07  8:58 ` David Hildenbrand
@ 2020-12-07  9:45   ` Mike Rapoport
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Rapoport @ 2020-12-07  9:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, aarcange, bhe, cai, mgorman, mhocko, mm-commits, stable, vbabka

On Mon, Dec 07, 2020 at 09:58:37AM +0100, David Hildenbrand wrote:
> On 06.12.20 01:54, akpm@linux-foundation.org wrote:
> > 
> > The patch titled
> >      Subject: mm: initialize struct pages in reserved regions outside of the zone ranges
> > has been added to the -mm tree.  Its filename is
> >      mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> > 
> > This patch should soon appear at
> >     https://ozlabs.org/~akpm/mmots/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> > and later at
> >     https://ozlabs.org/~akpm/mmotm/broken-out/mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch
> > 
> > Before you just go and hit "reply", please:
> >    a) Consider who else should be cc'ed
> >    b) Prefer to cc a suitable mailing list as well
> >    c) Ideally: find the original patch on the mailing list and do a
> >       reply-to-all to that, adding suitable additional cc's
> > 
> > *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> > 
> > The -mm tree is included into linux-next and is updated
> > there every 3-4 working days
> > 
> > ------------------------------------------------------
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > Subject: mm: initialize struct pages in reserved regions outside of the zone ranges
> > 
> > Without this change, the pfn 0 isn't in any zone spanned range, and it's
> > also not in any memory.memblock range, so the struct page of pfn 0 wasn't
> > initialized and the PagePoison remained set when reserve_bootmem_region
> > called __SetPageReserved, inducing a silent boot failure with DEBUG_VM
> > (and correctly so, because the crash signaled the nodeid/nid of pfn 0
> > would be again wrong).
> > 
> > There's no enforcement that all memblock.reserved ranges must overlap
> > memblock.memory ranges, so the memblock.reserved ranges also require an
> > explicit initialization and the zones ranges need to be extended to
> > include all memblock.reserved ranges with struct pages too or they'll be
> > left uninitialized with PagePoison as it happened to pfn 0.
> > 
> > Link: https://lkml.kernel.org/r/20201205013238.21663-2-aarcange@redhat.com
> > Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Mike Rapoport <rppt@linux.ibm.com>
> > Cc: Baoquan He <bhe@redhat.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Qian Cai <cai@lca.pw>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> [...]
> 
> I've lost track which patches are we considering right now to solve the
> overall issue. I'd appreciate a proper patch series with all relevant
> patches.

I think none of the patches is 100% right.
I'll try to send a new version today.

> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-06 19:30   ` Andrea Arcangeli
@ 2020-12-07 16:50     ` Mike Rapoport
  2020-12-08  2:57       ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Rapoport @ 2020-12-07 16:50 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, bhe, cai, david, mgorman, mhocko, mm-commits, stable, vbabka

On Sun, Dec 06, 2020 at 02:30:11PM -0500, Andrea Arcangeli wrote:
> On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote:
> > I don't see why we need all this complexity when a simple fixup was
> > enough.
> 
> Note the memblock bug crashed my systems:
> 
> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> Please note how zone_spans_pfn expands:
> 
> static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> {
> 	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> }
> 
> For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0.

I don't know why you keep calling it "memblock bug" as pfn 0 was not
part of DMA zone at least for ten years.

If you backport this VM_BUG_ON_PAGE() to 2.6.36 that has no memblock on
x86 it will happily trigger there.

The problem, as I see it, in, hmm, questionable semantics for
!E820_TYPE_RAM ranges on x86.

I keep asking this simple question again and again and I still have no
clear "yes" or "no" answer:

	Is PFN 0 on x86 memory?

Let's step back for a second and instead of blaming memblock in lack of
enforcement for .memory and .reserved to overlap and talking about how
memblock core was not robust enough we'll get to the point where we have
clear defintions of what e820 ranges represent.

> > > +	/*
> > > +	 * memblock.reserved isn't enforced to overlap with
> > > +	 * memblock.memory so initialize the struct pages for
> > > +	 * memblock.reserved too in case it wasn't overlapping.
> > > +	 *
> > > +	 * If any struct page associated with a memblock.reserved
> > > +	 * range isn't overlapping with a zone range, it'll be left
> > > +	 * uninitialized, ideally with PagePoison, and it'll be a more
> > > +	 * easily detectable error.
> > > +	 */
> > > +	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > > +
> > > +		if (end_pfn > start_pfn)
> > > +			pgcnt += init_unavailable_range(start_pfn, end_pfn,
> > > +							zone, nid);
> > > +	}
> > 
> > This means we are going iterate over all memory allocated before
> > free_area_ini() from memblock extra time. One time here and another time
> > in reserve_bootmem_region().
> > And this can be substantial for CMA and alloc_large_system_hash().
> 
> The above loops over memory.reserved only.

Right, this loops over memory.reserved which happens to include CMA and
the memory allocated in alloc_large_system_hash(). This can be a lot.
And memmap_init() runs before threads are available so this cannot be
parallelized.
The deferred memmap initialization starts much later than this.

> > > @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
> > >   */
> > >  unsigned long __init find_min_pfn_with_active_regions(void)
> > >  {
> > > -	return PHYS_PFN(memblock_start_of_DRAM());
> > > +	/*
> > > +	 * reserved regions must be included so that their page
> > > +	 * structure can be part of a zone and obtain a valid zoneid
> > > +	 * before __SetPageReserved().
> > > +	 */
> > > +	return min(PHYS_PFN(memblock_start_of_DRAM()),
> > > +		   PHYS_PFN(memblock.reserved.regions[0].base));
> > 
> > So this implies that reserved memory starts before memory. Don't you
> > find this weird?
> 
> The above is a more complex way to write the below, which will make it
> more clear there's never overlap:
> 
> 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> 			memblock_reserve(entry->addr, entry->size);
> 		else if (entry->type == E820_TYPE_RAM || entry->type == E820_TYPE_RESERVED_KERN)
> 			memblock_add(entry->addr, entry->size);

So what about E820_TYPE_ACPI or E820_TYPE_NVS?
How should they be treated?

It happend that we memblock_reserve() the first page, but what if the
BIOS marks pfn 1 as E820_TYPE_ACPI? It's neither in .memory nor in
.reserved. Does it belong to zone 0 and node 0?

Or, if, say, node 1 starts at 4G and the BIOS claimed several first pages
of node 1 as type 20?
What should we do then?

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-07 16:50     ` Mike Rapoport
@ 2020-12-08  2:57       ` Andrea Arcangeli
  2020-12-08 21:46         ` Mike Rapoport
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2020-12-08  2:57 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, bhe, cai, david, mgorman, mhocko, mm-commits, stable, vbabka

On Mon, Dec 07, 2020 at 06:50:37PM +0200, Mike Rapoport wrote:
> On Sun, Dec 06, 2020 at 02:30:11PM -0500, Andrea Arcangeli wrote:
> > On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote:
> > > I don't see why we need all this complexity when a simple fixup was
> > > enough.
> > 
> > Note the memblock bug crashed my systems:
> > 
> > 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > 
> > Please note how zone_spans_pfn expands:
> > 
> > static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> > {
> > 	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> > }
> > 
> > For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0.
> 
> I don't know why you keep calling it "memblock bug" as pfn 0 was not
> part of DMA zone at least for ten years.
> 
> If you backport this VM_BUG_ON_PAGE() to 2.6.36 that has no memblock on
> x86 it will happily trigger there.

I guess it wasn't considered a bug until your patch that was meant to
enforce this invariant for all pfn_valid was merged in -mm?

> The problem, as I see it, in, hmm, questionable semantics for
> !E820_TYPE_RAM ranges on x86.
> 
> I keep asking this simple question again and again and I still have no
> clear "yes" or "no" answer:
> 
> 	Is PFN 0 on x86 memory?

The pfn is the same regardless if it's RAM or non RAM, so I don't
think it makes any difference in this respect if it's RAM or not.

> Let's step back for a second and instead of blaming memblock in lack of
> enforcement for .memory and .reserved to overlap and talking about how
> memblock core was not robust enough we'll get to the point where we have
> clear defintions of what e820 ranges represent.

I don't see why we should go back into the direction that the caller
needs improvement like in commit
124049decbb121ec32742c94fb5d9d6bed8f24d8 and to try to massage the raw
e820 data before giving it to the linux common code.

Even if you wanted to enforce overlap at all times (not possible
because of the direct mapping issue) you could do it simply by calling
memblock_add first thing in memblock_reserve (if it wasn't already
fully overlapping) without the need to add an explicit memblock_add in
the caller.

> > > > +	/*
> > > > +	 * memblock.reserved isn't enforced to overlap with
> > > > +	 * memblock.memory so initialize the struct pages for
> > > > +	 * memblock.reserved too in case it wasn't overlapping.
> > > > +	 *
> > > > +	 * If any struct page associated with a memblock.reserved
> > > > +	 * range isn't overlapping with a zone range, it'll be left
> > > > +	 * uninitialized, ideally with PagePoison, and it'll be a more
> > > > +	 * easily detectable error.
> > > > +	 */
> > > > +	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > > > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > > > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > > > +
> > > > +		if (end_pfn > start_pfn)
> > > > +			pgcnt += init_unavailable_range(start_pfn, end_pfn,
> > > > +							zone, nid);
> > > > +	}
> > > 
> > > This means we are going iterate over all memory allocated before
> > > free_area_ini() from memblock extra time. One time here and another time
> > > in reserve_bootmem_region().
> > > And this can be substantial for CMA and alloc_large_system_hash().
> > 
> > The above loops over memory.reserved only.
> 
> Right, this loops over memory.reserved which happens to include CMA and
> the memory allocated in alloc_large_system_hash(). This can be a lot.
> And memmap_init() runs before threads are available so this cannot be
> parallelized.
> The deferred memmap initialization starts much later than this.

Can you convert the memblock.memory and memblock.reserved to an
interval tree? interval tree shall work perfectly since the switch to
long mode, no need of dynamic RAM allocations for its inner
working. There's no need to keep memblock a O(N) list.

Such change would benefit the code even if we remove the loop over
memblock.reserved.

> > > > @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
> > > >   */
> > > >  unsigned long __init find_min_pfn_with_active_regions(void)
> > > >  {
> > > > -	return PHYS_PFN(memblock_start_of_DRAM());
> > > > +	/*
> > > > +	 * reserved regions must be included so that their page
> > > > +	 * structure can be part of a zone and obtain a valid zoneid
> > > > +	 * before __SetPageReserved().
> > > > +	 */
> > > > +	return min(PHYS_PFN(memblock_start_of_DRAM()),
> > > > +		   PHYS_PFN(memblock.reserved.regions[0].base));
> > > 
> > > So this implies that reserved memory starts before memory. Don't you
> > > find this weird?
> > 
> > The above is a more complex way to write the below, which will make it
> > more clear there's never overlap:
> > 
> > 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > 			memblock_reserve(entry->addr, entry->size);
> > 		else if (entry->type == E820_TYPE_RAM || entry->type == E820_TYPE_RESERVED_KERN)
> > 			memblock_add(entry->addr, entry->size);
> 
> So what about E820_TYPE_ACPI or E820_TYPE_NVS?
> How should they be treated?
> 
> It happend that we memblock_reserve() the first page, but what if the
> BIOS marks pfn 1 as E820_TYPE_ACPI? It's neither in .memory nor in
> .reserved. Does it belong to zone 0 and node 0?
> 
> Or, if, say, node 1 starts at 4G and the BIOS claimed several first pages
> of node 1 as type 20?
> What should we do then?

I see we still have dependencies on the caller to send down good data,
which would be ideal to remove.

I guess before worrying of pfn 1 not being added to memblock.reserved,
I'd worry about pfn 0 itself not being added to memblock.reserved.

My above worry about pfn 0 not being added to memblock.reserved
doesn't change that:

- we still need to enlarge the DMA zone, even if pfn 0 is not added to
  memblock.reserved, if we want to really enforce the invariant that
  you can go to the start of the pageblock and pass the invariant on
  the first pfn, if any pfn in the first pageblock is in
  memblock.memory. So we also need to round down the start of the zone by
  the pageblock size. And adjust all range walks accordingly.

- Even after the rounddown, we still have to initialize the reserved
  ranges with pfn_valid or reserve_bootmem_region won't work at all if
  they don't happen to covered by the rounddown described above.

Still with the complex patch applied, if something goes wrong
DEBUG_VM=y will make it reproducible, with the simple fix we return in
non-reproducible land.

So I agree there's room for improvement and I believe the rounddown to
the start of the pageblock is needed for those zones that have no
adjacent region.

The important thing is that uninitialized pages are now left with
PagePoison now, at least with DEBUG_VM=y. And when a zoneid is
assigned to a page, that page is really is part of the zone. Both were
false before, so things already improved in that direction :).

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-08  2:57       ` Andrea Arcangeli
@ 2020-12-08 21:46         ` Mike Rapoport
  2020-12-08 23:13           ` Andrea Arcangeli
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Rapoport @ 2020-12-08 21:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, bhe, cai, david, mgorman, mhocko, mm-commits, stable, vbabka

On Mon, Dec 07, 2020 at 09:57:37PM -0500, Andrea Arcangeli wrote:
> On Mon, Dec 07, 2020 at 06:50:37PM +0200, Mike Rapoport wrote:
> > On Sun, Dec 06, 2020 at 02:30:11PM -0500, Andrea Arcangeli wrote:
> > > On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote:
> > > > I don't see why we need all this complexity when a simple fixup was
> > > > enough.
> > > 
> > > Note the memblock bug crashed my systems:
> > > 
> > > 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > > 
> > > Please note how zone_spans_pfn expands:
> > > 
> > > static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> > > {
> > > 	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> > > }
> > > 
> > > For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0.
> > 
> > I don't know why you keep calling it "memblock bug" as pfn 0 was not
> > part of DMA zone at least for ten years.
> > 
> > If you backport this VM_BUG_ON_PAGE() to 2.6.36 that has no memblock on
> > x86 it will happily trigger there.
> 
> I guess it wasn't considered a bug until your patch that was meant to
> enforce this invariant for all pfn_valid was merged in -mm?

What was not considered a bug? That node 0 and DMA zone on x86 started
at pfn 1 and pfn 0 zone/node links were set by chance to the right
values?

There was no intention to engfoce any invariant of overlaping or not
ovelapping ranges in memblock.memory and memblock.reserved.

The change in 73a6e474cb376921a311786652782155eac2fdf0 was to change the
loop over all pfns in a zone that among other things checked if that pfn
is in memblock.memory to a loop over intersection of pfns in a
zone with memblock.memory.

This exposed that before 73a6e474cb376921a311786652782155eac2fdf0:
* small holes, like your "7a17b000-7a216fff : Unknown E820 type" get
their zone/node links right because of an optimizaiton in
__early_pfn_to_nid() that didn't check if a pfn is in memblock.memory
for every pfn but rather cached the results.
* on x86 pfn 0 was neither in node 0 nor in DMA zone because they both
started with pfn 1; it got zone/node links set to 0 because that's what
init_unavailable_range() did for all pfns in "unavailable ranges".

So, if you add something like this

static int check_pfn0(void)
{
	unsigned long pfn = 0;
	struct page *page = pfn_to_page(pfn);

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

	return 0;
}
late_initcall(check_pfn0);

to an older kernel, this will blow up.

So, my guess is that we've never got to this check for pfn 0 and I
really don't know how your system crashed with both my patches applied,
unless you had a patch that checked pfn 0.

> > The problem, as I see it, in, hmm, questionable semantics for
> > !E820_TYPE_RAM ranges on x86.
> > 
> > I keep asking this simple question again and again and I still have no
> > clear "yes" or "no" answer:
> > 
> > 	Is PFN 0 on x86 memory?
> 
> The pfn is the same regardless if it's RAM or non RAM, so I don't
> think it makes any difference in this respect if it's RAM or not.
> 
> > Let's step back for a second and instead of blaming memblock in lack of
> > enforcement for .memory and .reserved to overlap and talking about how
> > memblock core was not robust enough we'll get to the point where we have
> > clear defintions of what e820 ranges represent.
> 
> I don't see why we should go back into the direction that the caller
> needs improvement like in commit
> 124049decbb121ec32742c94fb5d9d6bed8f24d8 and to try to massage the raw
> e820 data before giving it to the linux common code.

That's the entire point that the e820 data is massaged before it gets to
the common code. 124049decbb121ec32742c94fb5d9d6bed8f24d8 tried to alter
the way the raw e820 was transformed before going to the common code and
my feeling it was a shot in the dark.

The current abstraction we have in the common code to describe physical
memory is memblock and I want to make it explicit that every range in
e820 table is a part of system memory and hence a part of
memblock.memory.

> Even if you wanted to enforce overlap at all times (not possible
> because of the direct mapping issue) you could do it simply by calling
> memblock_add first thing in memblock_reserve (if it wasn't already
> fully overlapping) without the need to add an explicit memblock_add in
> the caller.

The purpose here is not to enforce overlap at all times [and it is
possible despite the direct mapping issues ;-) ].
The idea is to have the abstract description of the physical memory
(i.e. memblock) as close as possible to actual layout. The "simple" code
in e820__memblock_setup() only made things obfuscated and inconsistent.

If you have memory populated at address 0 you would expect that
for_each_mem_pfn_range() will start from pfn 0.

In addition, and implicit call to memblock_add() in memblock_reserve()
would be a waste of CPU time as most of the times memblock_reserve() is
called as the part of memory allocation before page allocator is ready.

> > > > > +	/*
> > > > > +	 * memblock.reserved isn't enforced to overlap with
> > > > > +	 * memblock.memory so initialize the struct pages for
> > > > > +	 * memblock.reserved too in case it wasn't overlapping.
> > > > > +	 *
> > > > > +	 * If any struct page associated with a memblock.reserved
> > > > > +	 * range isn't overlapping with a zone range, it'll be left
> > > > > +	 * uninitialized, ideally with PagePoison, and it'll be a more
> > > > > +	 * easily detectable error.
> > > > > +	 */
> > > > > +	for_each_res_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > > > > +		start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> > > > > +		end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> > > > > +
> > > > > +		if (end_pfn > start_pfn)
> > > > > +			pgcnt += init_unavailable_range(start_pfn, end_pfn,
> > > > > +							zone, nid);
> > > > > +	}
> > > > 
> > > > This means we are going iterate over all memory allocated before
> > > > free_area_ini() from memblock extra time. One time here and another time
> > > > in reserve_bootmem_region().
> > > > And this can be substantial for CMA and alloc_large_system_hash().
> > > 
> > > The above loops over memory.reserved only.
> > 
> > Right, this loops over memory.reserved which happens to include CMA and
> > the memory allocated in alloc_large_system_hash(). This can be a lot.
> > And memmap_init() runs before threads are available so this cannot be
> > parallelized.
> > The deferred memmap initialization starts much later than this.
> 
> Can you convert the memblock.memory and memblock.reserved to an
> interval tree? interval tree shall work perfectly since the switch to
> long mode, no need of dynamic RAM allocations for its inner
> working. There's no need to keep memblock a O(N) list.

Sorry, I was not clear. The penalty here is not the traversal of
memblock.reserved array. The penalty is for redundant initialization of
struct page for ranges memblock.reserved describes.
For instance, the if user specified hugetlb_cma=nnG loop over
memblock.reserved will cause redundant initialization pass over the
pages in this nnG.

> Such change would benefit the code even if we remove the loop over
> memblock.reserved.
> 
> > > > > @@ -7126,7 +7155,13 @@ unsigned long __init node_map_pfn_alignm
> > > > >   */
> > > > >  unsigned long __init find_min_pfn_with_active_regions(void)
> > > > >  {
> > > > > -	return PHYS_PFN(memblock_start_of_DRAM());
> > > > > +	/*
> > > > > +	 * reserved regions must be included so that their page
> > > > > +	 * structure can be part of a zone and obtain a valid zoneid
> > > > > +	 * before __SetPageReserved().
> > > > > +	 */
> > > > > +	return min(PHYS_PFN(memblock_start_of_DRAM()),
> > > > > +		   PHYS_PFN(memblock.reserved.regions[0].base));
> > > > 
> > > > So this implies that reserved memory starts before memory. Don't you
> > > > find this weird?
> > > 
> > > The above is a more complex way to write the below, which will make it
> > > more clear there's never overlap:
> > > 
> > > 		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > > 			memblock_reserve(entry->addr, entry->size);
> > > 		else if (entry->type == E820_TYPE_RAM || entry->type == E820_TYPE_RESERVED_KERN)
> > > 			memblock_add(entry->addr, entry->size);
> > 
> > So what about E820_TYPE_ACPI or E820_TYPE_NVS?
> > How should they be treated?
> > 
> > It happend that we memblock_reserve() the first page, but what if the
> > BIOS marks pfn 1 as E820_TYPE_ACPI? It's neither in .memory nor in
> > .reserved. Does it belong to zone 0 and node 0?
> > 
> > Or, if, say, node 1 starts at 4G and the BIOS claimed several first pages
> > of node 1 as type 20?
> > What should we do then?
> 
> I see we still have dependencies on the caller to send down good data,
> which would be ideal to remove.

Regardless of particular implementation, the generic code cannot detect
physical memory layout, this is up to architecture to detect what memory
banks are populated and provide this information to the generic code.

So we'll always have what you call "dependencies on the caller" here.
The good thing is that we can fix "the caller" when we find it's wrong.

> I guess before worrying of pfn 1 not being added to memblock.reserved,
> I'd worry about pfn 0 itself not being added to memblock.reserved.

My point was that with a bit different e820 table we may again hit a
corner case that we didn't anticipate.

pfn 0 is in memblock.reserved by a coincidence and not by design and using
memblock.reserved to detect zone/node boundaries is not reliable.

Moreover, if you look for usage of for_each_mem_pfn_range() in
page_alloc.c, it's looks very much that it is presumed to iterate over
*all* physical memory, including mysterious pfn 0.

> Still with the complex patch applied, if something goes wrong
> DEBUG_VM=y will make it reproducible, with the simple fix we return in
> non-reproducible land.

I still think that the complex patch is going in the wrong direction.
We cannot rely on memblock.reserved to calculate zones and nodes span.

This:

+       /*
+        * reserved regions must be included so that their page
+        * structure can be part of a zone and obtain a valid zoneid
+        * before __SetPageReserved().
+        */
+       return min(PHYS_PFN(memblock_start_of_DRAM()),
+                  PHYS_PFN(memblock.reserved.regions[0].base));

is definitely a hack that worked because "the caller" has this:

	/*
	 * Make sure page 0 is always reserved because on systems with
	 * L1TF its contents can be leaked to user processes.
	 */
	memblock_reserve(0, PAGE_SIZE);

So, what would have happen if L1TF was not yet disclosed? ;-)

Anyway, I think we need somthing like:

* add everything in e820 to memblock.memory while keeping any
!E820_TYPE_RAM out of the direct map
* refactor initialization of struct page for holes in memory layout to
have proper state for struct pages not backed by actual memory
* add validation of memory/nodes/zones to make sure the generic code can
deal with data that architecture supplied; this maybe only for
DEBUG_VM=y case.

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-08 21:46         ` Mike Rapoport
@ 2020-12-08 23:13           ` Andrea Arcangeli
  2020-12-09 21:42             ` Mike Rapoport
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Arcangeli @ 2020-12-08 23:13 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, bhe, cai, david, mgorman, mhocko, mm-commits, stable, vbabka

On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
> On Mon, Dec 07, 2020 at 09:57:37PM -0500, Andrea Arcangeli wrote:
> > On Mon, Dec 07, 2020 at 06:50:37PM +0200, Mike Rapoport wrote:
> > > On Sun, Dec 06, 2020 at 02:30:11PM -0500, Andrea Arcangeli wrote:
> > > > On Sun, Dec 06, 2020 at 08:48:49AM +0200, Mike Rapoport wrote:
> > > > > I don't see why we need all this complexity when a simple fixup was
> > > > > enough.
> > > > 
> > > > Note the memblock bug crashed my systems:
> > > > 
> > > > 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > > > 
> > > > Please note how zone_spans_pfn expands:
> > > > 
> > > > static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
> > > > {
> > > > 	return zone->zone_start_pfn <= pfn && pfn < zone_end_pfn(zone);
> > > > }
> > > > 
> > > > For pfn_to_page(0), the zone_start_pfn is 1 for DMA zone. pfn is 0.
> > > 
> > > I don't know why you keep calling it "memblock bug" as pfn 0 was not
> > > part of DMA zone at least for ten years.
> > > 
> > > If you backport this VM_BUG_ON_PAGE() to 2.6.36 that has no memblock on
> > > x86 it will happily trigger there.
> > 
> > I guess it wasn't considered a bug until your patch that was meant to
> > enforce this invariant for all pfn_valid was merged in -mm?
> 
> What was not considered a bug? That node 0 and DMA zone on x86 started
> at pfn 1 and pfn 0 zone/node links were set by chance to the right
> values?

Well, I don't think zoneid 0 is the "right value" for the pfn 0 since zoneid
0 DMA start from pfn 1.

Now if you rephrase the above with "were set to the wrong values",
then the answer is yes.

> There was no intention to engfoce any invariant of overlaping or not
> ovelapping ranges in memblock.memory and memblock.reserved.
> 
> The change in 73a6e474cb376921a311786652782155eac2fdf0 was to change the
> loop over all pfns in a zone that among other things checked if that pfn
> is in memblock.memory to a loop over intersection of pfns in a
> zone with memblock.memory.
> 
> This exposed that before 73a6e474cb376921a311786652782155eac2fdf0:
> * small holes, like your "7a17b000-7a216fff : Unknown E820 type" get
> their zone/node links right because of an optimizaiton in
> __early_pfn_to_nid() that didn't check if a pfn is in memblock.memory
> for every pfn but rather cached the results.
> * on x86 pfn 0 was neither in node 0 nor in DMA zone because they both
> started with pfn 1; it got zone/node links set to 0 because that's what
> init_unavailable_range() did for all pfns in "unavailable ranges".
> 
> So, if you add something like this
> 
> static int check_pfn0(void)
> {
> 	unsigned long pfn = 0;
> 	struct page *page = pfn_to_page(pfn);
> 
> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> 	return 0;
> }
> late_initcall(check_pfn0);
> 
> to an older kernel, this will blow up.

Some older kernel wouldn't fail just at pfn 0. In v5.8/9 the above
bugcheck crashes on pfn > 0.

In all versions before your
https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org ,
worrying at pfn 0 and the first reserved pfn in each nid would be just
window dressing.

Now the moment you apply
https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org
then pfn 0 and all other pfn at the start of each nid that still break
the above bugcheck invariant, gets the focus because they're thing
still breaking the above bugcheck, the very bugcheck crashing v5.8/9.

> 
> So, my guess is that we've never got to this check for pfn 0 and I
> really don't know how your system crashed with both my patches applied,
> unless you had a patch that checked pfn 0.

I thought I already mentioned it's the __SetPageReserved that checks
PagePoison and the __SetPageReserved is issued on all struct page in
memblock.reserved ranges.

You're calling __SetPageReserved on each one of the reserved pages. If
you didn't go over each one of the struct page to initialize them and
__SetPageReserved them, then
https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org
would be capable of booting without the simple fix too.

https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is
correctly failing boot, because what was missing was a proper zoneid
initialization that would wipe the PagePoison away, on all
memblock.reserved ranges that aren't overlapping on memblock.memory
ranges, which is precisely what the complex patch does.

> > > The problem, as I see it, in, hmm, questionable semantics for
> > > !E820_TYPE_RAM ranges on x86.
> > > 
> > > I keep asking this simple question again and again and I still have no
> > > clear "yes" or "no" answer:
> > > 
> > > 	Is PFN 0 on x86 memory?
> > 
> > The pfn is the same regardless if it's RAM or non RAM, so I don't
> > think it makes any difference in this respect if it's RAM or not.
> > 
> > > Let's step back for a second and instead of blaming memblock in lack of
> > > enforcement for .memory and .reserved to overlap and talking about how
> > > memblock core was not robust enough we'll get to the point where we have
> > > clear defintions of what e820 ranges represent.
> > 
> > I don't see why we should go back into the direction that the caller
> > needs improvement like in commit
> > 124049decbb121ec32742c94fb5d9d6bed8f24d8 and to try to massage the raw
> > e820 data before giving it to the linux common code.
> 
> That's the entire point that the e820 data is massaged before it gets to
> the common code. 124049decbb121ec32742c94fb5d9d6bed8f24d8 tried to alter
> the way the raw e820 was transformed before going to the common code and
> my feeling it was a shot in the dark.
> 
> The current abstraction we have in the common code to describe physical
> memory is memblock and I want to make it explicit that every range in
> e820 table is a part of system memory and hence a part of
> memblock.memory.

What's the advantage of not initializing the struct pages in the
memblock.reserved ranges, if you add all memblock.reserved to
memblock.memory and you still initialize all memblock.memory then?

Isn't the cost of the loop and number of pfn being initialized then
identical to the complex patch then?

What you suggest would require adding extra information to flag which
ranges must not have a direct mapping, but that information is already
in memblock today, for each range in memblock_reserved but not in
memblock.memory or did I misunderstand how that no-direct-map detail works?

I see nothing wrong in adding memblock.reserved to memblock.memory,
but the total information in kernel memory will remain the same, it's
just a different representation that will require more CPU and RAM to
handle, not that it matters. All it matters is what is simpler in the
code and the current version looks simpler. Or do you expect the
version that enforces overlap to reduce the kernel code?

> The purpose here is not to enforce overlap at all times [and it is
> possible despite the direct mapping issues ;-) ].
> The idea is to have the abstract description of the physical memory
> (i.e. memblock) as close as possible to actual layout. The "simple" code
> in e820__memblock_setup() only made things obfuscated and inconsistent.
> 
> If you have memory populated at address 0 you would expect that
> for_each_mem_pfn_range() will start from pfn 0.

Specifically which code do you envision that needs to get RAM pfn in
physical order and cannot call:

for_each_mem_pfn_range(pfn))
	handle(pfn);
for_each_res_pfn_range(pfn))
	handle(pfn);

If you need it in physical order immediately then having
memblock.memory including memblock.reserve would simplify things for
sure, but who needs that when it runs even exactly at the same speed?

> In addition, and implicit call to memblock_add() in memblock_reserve()
> would be a waste of CPU time as most of the times memblock_reserve() is
> called as the part of memory allocation before page allocator is ready.

Well, given such CPU time is so little, I would still spend that CPU
time to do an enforcement and bugcheck and manual call of memblock_add
if any memblock_reserve has no full overlap.

So I'd rather spend the CPU time to do the work itself and not having
to risk kernel-crashing issue if any of the callers got a future patch
done wrong or if the bios maps aren't done fully right after a bios
upgrade.

How much CPU time we're talking about here? Those aren't the
computations on the pfn but on the ranges and there aren't going to be
million of e820 entries.

Even if you envision million of e820 entries, the first thing to do
then to save CPU time in memblock is drop all lists and replace them
with interval trees based on rbtrees, so the complexity of the later
loops that intersect the memblock ranges with the nid-zone ranges,
will go in O(log(N)) where N is in the number of e820 ranges. Leaving
memblock at risk from a buggy e820 caller is not going to save nearly
as much CPU as such data structure computation complexity enhancement
in my view. memblock_add then will also run in O(log(N)).

> Sorry, I was not clear. The penalty here is not the traversal of
> memblock.reserved array. The penalty is for redundant initialization of
> struct page for ranges memblock.reserved describes.

But that's the fix... How can you avoid removing the PagePoison for
all pfn_valid in memblock.reserved, when the crash of
https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is
in __SetPageReserved of reserve_bootmem_region?

> For instance, the if user specified hugetlb_cma=nnG loop over
> memblock.reserved will cause redundant initialization pass over the
> pages in this nnG.

And if that didn't happen, then I'm left wondering if
free_low_memory_core_early would crash again with specified
hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix
applied.

Where does PagePoison get removed from the CMA range, or is somehow
__SetPageReserved not called with the simple fix applied on the CMA
reserved range?

The important thing that the simple fix breaks (and get us back to
square one) is that after
https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org
we're finally guaranteed each pfn_valid gets either a "right zoneid"
or PagePosion. It is great in that respect and I'd rather not break
such guarantee again with the simple fix that may still not boot with
CMA region reservation and will also return breaking the check_pfn0
bugcheck above.

Could you test your simple fix with DEBUG_VM=y and hugetlb_cma=nnG,
does it boot? While I'm not entirely sure how the simple fix can boot
with DEBUG_VM=y and hugetlb_cma=nnG (I don't rule out it might boot),
I'm pretty confident my complex fix will still boot fine with
hugetlb_cma=nnG.

> Regardless of particular implementation, the generic code cannot detect
> physical memory layout, this is up to architecture to detect what memory
> banks are populated and provide this information to the generic code.
> 
> So we'll always have what you call "dependencies on the caller" here.
> The good thing is that we can fix "the caller" when we find it's wrong.

The caller that massages the e820 bios table, is still
software. Software can go from arch code to common code. There's no
asm or arch dependency in there.

The caller in my view needs to know what e820 is and send down the raw
info... it's a 1:1 API translation, the caller massaging of the data
can all go in the callee and benefit all callers.

Alternatively, if the caller logic doesn't go from arch code to common
code for whatever reason, the common code can still put the caller
logic in the common code and use it to verify the caller did
everything right, fixup if it didn't and warn in the logs about it,
using the same software algo the caller uses to massage the e820 data.

The benefit is once the callee does all validation, then all archs
that do a mistake will be told and boot will not fail and it'll be
more likely to boot stable even in presence of inconsistent hardware
tables.

The whole point of the above is to reduce the chance that we either
have to keep cut-and-pasting N times the same logic in all callers, or
that in the future a subtle change in the arch code or a subtle
difference in the bios tables (kind of precisely what's crashing v5.8
and v5.9) will cause the kernel to boot unstable in a non easily
reproducible way.

Looking the current e820 code it looks it's already expecting all
validation to happen in the callee.

> > I guess before worrying of pfn 1 not being added to memblock.reserved,
> > I'd worry about pfn 0 itself not being added to memblock.reserved.
> 
> My point was that with a bit different e820 table we may again hit a
> corner case that we didn't anticipate.
> 
> pfn 0 is in memblock.reserved by a coincidence and not by design and using
> memblock.reserved to detect zone/node boundaries is not reliable.
>
> Moreover, if you look for usage of for_each_mem_pfn_range() in
> page_alloc.c, it's looks very much that it is presumed to iterate over
> *all* physical memory, including mysterious pfn 0.

All the complex patch does it that it guarantees all the reserved
pages can get a "right" zoneid.

Then we certainly should consider rounding it down by the pageblock in
case pfn 0 isn't reserved.

In fact if you add all memblock.reserved ranges to memblock.memory,
you'll obtain exactly the same result in the zone_start_pfn as with
the complex patch in the zone_start_pfn calculation and the exact same
issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't
added to memblock.memory with { memblock_add; memblock_reserved }.
 
> > Still with the complex patch applied, if something goes wrong
> > DEBUG_VM=y will make it reproducible, with the simple fix we return in
> > non-reproducible land.
> 
> I still think that the complex patch is going in the wrong direction.
> We cannot rely on memblock.reserved to calculate zones and nodes span.
> 
> This:
> 
> +       /*
> +        * reserved regions must be included so that their page
> +        * structure can be part of a zone and obtain a valid zoneid
> +        * before __SetPageReserved().
> +        */
> +       return min(PHYS_PFN(memblock_start_of_DRAM()),
> +                  PHYS_PFN(memblock.reserved.regions[0].base));
> 
> is definitely a hack that worked because "the caller" has this:
> 
> 	/*
> 	 * Make sure page 0 is always reserved because on systems with
> 	 * L1TF its contents can be leaked to user processes.
> 	 */
> 	memblock_reserve(0, PAGE_SIZE);
> 
> So, what would have happen if L1TF was not yet disclosed? ;-)

It's actually fine thing to delete the above memblock_reserve(0,
PAGE_SIZE) for all testing so we can move into the remaining issues.

After deleting the memblock_reserve of pfn 0, the VM_BUG_ON_PAGE will
show PagePoison in page->flags in the bugcheck above, which is
preferable than having a wrong zone id. In addition PagePoison will be
more detectable error than a wrong zone id so it's more likely to
notify us something fundamentally went wrong in memblock.

With simple fix we'll be left wondering once again as it happened
before
https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org .

Second, the removal of memblock_reserve(0, PAGE_SIZE) is the right
test to do to verify and exercise the needed round_down to the start
of the pageblock (if there's no adjacent zone preceding it) orthogonal
enhancement that would be needed anyway even if we only looked at
memblock.memory there (again to avoid leaving PagePoison or setting ti
to a zoneid that won't pass the check_pfn0).

The end result of the "hack" is precisely what you obtain if you add
all memblock.reserved to memblock.memory, except it doesn't require to
add memblock.reserved to memblock.memory.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree
  2020-12-08 23:13           ` Andrea Arcangeli
@ 2020-12-09 21:42             ` Mike Rapoport
  0 siblings, 0 replies; 10+ messages in thread
From: Mike Rapoport @ 2020-12-09 21:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, bhe, cai, david, mgorman, mhocko, mm-commits, stable, vbabka

On Tue, Dec 08, 2020 at 06:13:47PM -0500, Andrea Arcangeli wrote:
> On Tue, Dec 08, 2020 at 11:46:14PM +0200, Mike Rapoport wrote:
>
> > Sorry, I was not clear. The penalty here is not the traversal of
> > memblock.reserved array. The penalty is for redundant initialization of
> > struct page for ranges memblock.reserved describes.
> 
> But that's the fix... How can you avoid removing the PagePoison for
> all pfn_valid in memblock.reserved, when the crash of
> https://lore.kernel.org/lkml/20201201181502.2340-1-rppt@kernel.org is
> in __SetPageReserved of reserve_bootmem_region?

With your fix the pages in the memblock.reserved that intersects
with memblock.memory will be initialized twice: first time in
for_each_mem_pfn_range() and the second time in for_each_res_pfn_range()

> > For instance, the if user specified hugetlb_cma=nnG loop over
> > memblock.reserved will cause redundant initialization pass over the
> > pages in this nnG.
> 
> And if that didn't happen, then I'm left wondering if
> free_low_memory_core_early would crash again with specified
> hugetlb_cma=nnG, not just in pfn 0 anymore even with the simple fix
> applied.
> 
> Where does PagePoison get removed from the CMA range, or is somehow
> __SetPageReserved not called with the simple fix applied on the CMA
> reserved range?

CMA range would be in both memblock.memory and memblock.reserved. So the
pages for it will be initialized in for_each_mem_pfn_range().

> > Regardless of particular implementation, the generic code cannot detect
> > physical memory layout, this is up to architecture to detect what memory
> > banks are populated and provide this information to the generic code.
> > 
> > So we'll always have what you call "dependencies on the caller" here.
> > The good thing is that we can fix "the caller" when we find it's wrong.
> 
> The caller that massages the e820 bios table, is still
> software. Software can go from arch code to common code. There's no
> asm or arch dependency in there.
> 
> The caller in my view needs to know what e820 is and send down the raw
> info... it's a 1:1 API translation, the caller massaging of the data
> can all go in the callee and benefit all callers.

Well, the translation between e820 and memblock could been 1:1, but it
is not. I think e820 in more broadly x86 memory initialization could
benifit from more straightforward translation to memblock. For instance,
I don't see why the pfn 0 should be marked as reserved in both e820 and
memblock and why some of the e820 reserved areas never make it to
memblock.reserved.

> The benefit is once the callee does all validation, then all archs
> that do a mistake will be told and boot will not fail and it'll be
> more likely to boot stable even in presence of inconsistent hardware
> tables.

That's true. Having the generic code more robust will benifit everybody.

> > > I guess before worrying of pfn 1 not being added to memblock.reserved,
> > > I'd worry about pfn 0 itself not being added to memblock.reserved.
> > 
> > My point was that with a bit different e820 table we may again hit a
> > corner case that we didn't anticipate.
> > 
> > pfn 0 is in memblock.reserved by a coincidence and not by design and using
> > memblock.reserved to detect zone/node boundaries is not reliable.
> >
> > Moreover, if you look for usage of for_each_mem_pfn_range() in
> > page_alloc.c, it's looks very much that it is presumed to iterate over
> > *all* physical memory, including mysterious pfn 0.
> 
> All the complex patch does it that it guarantees all the reserved
> pages can get a "right" zoneid.
> 
> Then we certainly should consider rounding it down by the pageblock in
> case pfn 0 isn't reserved.
> 
> In fact if you add all memblock.reserved ranges to memblock.memory,
> you'll obtain exactly the same result in the zone_start_pfn as with
> the complex patch in the zone_start_pfn calculation and the exact same
> issue of PagePoison being left on pfn 0 will happen if pfn 0 isn't
> added to memblock.memory with { memblock_add; memblock_reserved }.
>  
> > > Still with the complex patch applied, if something goes wrong
> > > DEBUG_VM=y will make it reproducible, with the simple fix we return in
> > > non-reproducible land.
> > 
> > I still think that the complex patch is going in the wrong direction.
> > We cannot rely on memblock.reserved to calculate zones and nodes span.
> > 
> > This:
> > 
> > +       /*
> > +        * reserved regions must be included so that their page
> > +        * structure can be part of a zone and obtain a valid zoneid
> > +        * before __SetPageReserved().
> > +        */
> > +       return min(PHYS_PFN(memblock_start_of_DRAM()),
> > +                  PHYS_PFN(memblock.reserved.regions[0].base));
> > 
> > is definitely a hack that worked because "the caller" has this:
> > 
> > 	/*
> > 	 * Make sure page 0 is always reserved because on systems with
> > 	 * L1TF its contents can be leaked to user processes.
> > 	 */
> > 	memblock_reserve(0, PAGE_SIZE);
> > 
> > So, what would have happen if L1TF was not yet disclosed? ;-)
> 
> It's actually fine thing to delete the above memblock_reserve(0,
> PAGE_SIZE) for all testing so we can move into the remaining issues.

There are few more places in arch/x86/kernel/setup.c that
memblock_reserve() one or several pages from address 0 :-)

> The end result of the "hack" is precisely what you obtain if you add
> all memblock.reserved to memblock.memory, except it doesn't require to
> add memblock.reserved to memblock.memory.

I still do not agree that there can be minimum between
memblock_start_of_DRAM() and anything else. I think it's semantically
wrong.

And I really dislike addition of memblock.reserved traversals, mostly
because of the areas that actually overlap.

However, I do agree that adding non-overlaping part of memblock.reserved
to memblock.memory will make the generic code more robust. I just don't
want to do this implicitly by calling memblock_add() from
memblock_reserve(). I think the cleaner way is to join them just before
free_area_init() starts zone/node size detection.

> Thanks,
> Andrea
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-12-09 21:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-06  0:54 + mm-initialize-struct-pages-in-reserved-regions-outside-of-the-zone-ranges.patch added to -mm tree akpm
2020-12-06  6:48 ` Mike Rapoport
2020-12-06 19:30   ` Andrea Arcangeli
2020-12-07 16:50     ` Mike Rapoport
2020-12-08  2:57       ` Andrea Arcangeli
2020-12-08 21:46         ` Mike Rapoport
2020-12-08 23:13           ` Andrea Arcangeli
2020-12-09 21:42             ` Mike Rapoport
2020-12-07  8:58 ` David Hildenbrand
2020-12-07  9:45   ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).