[v5,1/1] mm: refactor initialization of struct page for holes in memory layout
diff mbox series

Message ID 20210208110820.6269-1-rppt@kernel.org
State In Next
Commit 9c531325a228069fe8716bfa11a68ae4740c41fe
Headers show
Series
  • [v5,1/1] mm: refactor initialization of struct page for holes in memory layout
Related show

Commit Message

Mike Rapoport Feb. 8, 2021, 11:08 a.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

There could be struct pages that are not backed by actual physical memory.
This can happen when the actual memory bank is not a multiple of
SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem() function
that iterates through PFNs in holes in memblock.memory and if there is a
struct page corresponding to a PFN, the fields of this page are set to
default values and it is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

On a system that has firmware reserved holes in a zone above ZONE_DMA, for
instance in a configuration below:

	# grep -A1 E820 /proc/iomem
	7a17b000-7a216fff : Unknown E820 type
	7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.

Moreover, it is possible that the lowest node and zone start is not aligned
to the section boundarie, for example on x86:

[    0.078898] Zone ranges:
[    0.078899]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
...
[    0.078910] Early memory node ranges
[    0.078912]   node   0: [mem 0x0000000000001000-0x000000000009cfff]
[    0.078913]   node   0: [mem 0x0000000000100000-0x000000003fffffff]

and thus with SPARSEMEM memory model the beginning of the memory map will
have struct pages that are not spanned by any node and zone.

Update detection of node boundaries in get_pfn_range_for_nid() so that the
node range will be expanded to cover memory map section. Since zone spans
are derived from the node span, there always will be a zone that covers the
part of the memory map with unavailable pages.

Interleave initialization of the unavailable pages with the normal
initialization of memory map, so that zone and node information will be
properly set on struct pages that are not backed by the actual memory.

Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
that check each PFN")
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 160 +++++++++++++++++++++++-------------------------
 1 file changed, 75 insertions(+), 85 deletions(-)

Comments

Andrew Morton Feb. 8, 2021, 9:11 p.m. UTC | #1
On Mon,  8 Feb 2021 13:08:20 +0200 Mike Rapoport <rppt@kernel.org> wrote:

> There could be struct pages that are not backed by actual physical memory.
> This can happen when the actual memory bank is not a multiple of
> SECTION_SIZE or when an architecture does not register memory holes
> reserved by the firmware as memblock.memory.
> 
> Such pages are currently initialized using init_unavailable_mem() function
> that iterates through PFNs in holes in memblock.memory and if there is a
> struct page corresponding to a PFN, the fields of this page are set to
> default values and it is marked as Reserved.
> 
> init_unavailable_mem() does not take into account zone and node the page
> belongs to and sets both zone and node links in struct page to zero.
> 
> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> instance in a configuration below:
> 
> 	# grep -A1 E820 /proc/iomem
> 	7a17b000-7a216fff : Unknown E820 type
> 	7a217000-7bffffff : System RAM
> 
> unset zone link in struct page will trigger
> 
> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
> in struct page) in the same pageblock.
> 
> ...
>
> 
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> that check each PFN")

What are your thoughts on the priority of this (rather large!) fix? 
Are such systems sufficiently common to warrant a 5.11 merge?  -stable?
Mike Rapoport Feb. 8, 2021, 9:25 p.m. UTC | #2
On Mon, Feb 08, 2021 at 01:11:00PM -0800, Andrew Morton wrote:
> On Mon,  8 Feb 2021 13:08:20 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> 
> > There could be struct pages that are not backed by actual physical memory.
> > This can happen when the actual memory bank is not a multiple of
> > SECTION_SIZE or when an architecture does not register memory holes
> > reserved by the firmware as memblock.memory.
> > 
> > Such pages are currently initialized using init_unavailable_mem() function
> > that iterates through PFNs in holes in memblock.memory and if there is a
> > struct page corresponding to a PFN, the fields of this page are set to
> > default values and it is marked as Reserved.
> > 
> > init_unavailable_mem() does not take into account zone and node the page
> > belongs to and sets both zone and node links in struct page to zero.
> > 
> > On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> > instance in a configuration below:
> > 
> > 	# grep -A1 E820 /proc/iomem
> > 	7a17b000-7a216fff : Unknown E820 type
> > 	7a217000-7bffffff : System RAM
> > 
> > unset zone link in struct page will trigger
> > 
> > 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> > 
> > because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
> > in struct page) in the same pageblock.
> > 
> > ...
> >
> > 
> > Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> > that check each PFN")
> 
> What are your thoughts on the priority of this (rather large!) fix? 
> Are such systems sufficiently common to warrant a 5.11 merge?  -stable?

I don't know how common are such systems, but the bug is exposed only for
builds with DEBUG_VM=y, so after problems with previous versions discovered
by various CI systems I'd say to hold it off till 5.11 is out.

If this time the fix works it'll make it to -stable anyway :)
David Hildenbrand Feb. 12, 2021, 9:55 a.m. UTC | #3
On 08.02.21 12:08, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> There could be struct pages that are not backed by actual physical memory.
> This can happen when the actual memory bank is not a multiple of
> SECTION_SIZE or when an architecture does not register memory holes
> reserved by the firmware as memblock.memory.
> 
> Such pages are currently initialized using init_unavailable_mem() function
> that iterates through PFNs in holes in memblock.memory and if there is a
> struct page corresponding to a PFN, the fields of this page are set to
> default values and it is marked as Reserved.
> 
> init_unavailable_mem() does not take into account zone and node the page
> belongs to and sets both zone and node links in struct page to zero.
> 
> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> instance in a configuration below:
> 
> 	# grep -A1 E820 /proc/iomem
> 	7a17b000-7a216fff : Unknown E820 type
> 	7a217000-7bffffff : System RAM
> 
> unset zone link in struct page will trigger
> 
> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
> in struct page) in the same pageblock.
> 
> Moreover, it is possible that the lowest node and zone start is not aligned
> to the section boundarie, for example on x86:
> 
> [    0.078898] Zone ranges:
> [    0.078899]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> ...
> [    0.078910] Early memory node ranges
> [    0.078912]   node   0: [mem 0x0000000000001000-0x000000000009cfff]
> [    0.078913]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> 
> and thus with SPARSEMEM memory model the beginning of the memory map will
> have struct pages that are not spanned by any node and zone.
> 
> Update detection of node boundaries in get_pfn_range_for_nid() so that the
> node range will be expanded to cover memory map section. Since zone spans
> are derived from the node span, there always will be a zone that covers the
> part of the memory map with unavailable pages.
> 
> Interleave initialization of the unavailable pages with the normal
> initialization of memory map, so that zone and node information will be
> properly set on struct pages that are not backed by the actual memory.
> 
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> that check each PFN")
> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> ---
>   mm/page_alloc.c | 160 +++++++++++++++++++++++-------------------------
>   1 file changed, 75 insertions(+), 85 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6446778cbc6b..1c3f7521028f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6257,22 +6257,84 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>   	}
>   }
>   
> +#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
> +/*
> + * Only struct pages that correspond to ranges defined by memblock.memory
> + * are zeroed and initialized by going through __init_single_page() during
> + * memmap_init_zone().
> + *
> + * But, there could be struct pages that correspond to holes in
> + * memblock.memory. This can happen because of the following reasons:
> + * - phyiscal memory bank size is not necessarily the exact multiple of the
> + *   arbitrary section size
> + * - early reserved memory may not be listed in memblock.memory
> + * - memory layouts defined with memmap= kernel parameter may not align
> + *   nicely with memmap sections
> + *
> + * Explicitly initialize those struct pages so that:
> + * - PG_Reserved is set
> + * - zone and node links point to zone and node that span the page
> + */
> +static u64 __meminit init_unavailable_range(unsigned long spfn,
> +					    unsigned long epfn,
> +					    int zone, int node)
> +{
> +	unsigned long pfn;
> +	u64 pgcnt = 0;
> +
> +	for (pfn = spfn; pfn < epfn; pfn++) {
> +		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
> +			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
> +				+ pageblock_nr_pages - 1;
> +			continue;
> +		}
> +		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
> +		__SetPageReserved(pfn_to_page(pfn));
> +		pgcnt++;
> +	}
> +
> +	return pgcnt;
> +}
> +#else
> +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
> +					 int zone, int node)
> +{
> +	return 0;
> +}
> +#endif
> +
>   void __meminit __weak memmap_init_zone(struct zone *zone)
>   {
>   	unsigned long zone_start_pfn = zone->zone_start_pfn;
>   	unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
>   	int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
>   	unsigned long start_pfn, end_pfn;
> +	unsigned long hole_pfn = 0;
> +	u64 pgcnt = 0;
>   
>   	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>   		start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
>   		end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
> +		hole_pfn = clamp(hole_pfn, zone_start_pfn, zone_end_pfn);
>   
>   		if (end_pfn > start_pfn)
>   			memmap_init_range(end_pfn - start_pfn, nid,
>   					zone_id, start_pfn, zone_end_pfn,
>   					MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
> +
> +		if (hole_pfn < start_pfn)
> +			pgcnt += init_unavailable_range(hole_pfn, start_pfn,
> +							zone_id, nid);
> +		hole_pfn = end_pfn;
>   	}
> +
> +	if (hole_pfn < zone_end_pfn)
> +		pgcnt += init_unavailable_range(hole_pfn, zone_end_pfn,
> +						zone_id, nid);
> +
> +	if (pgcnt)
> +		pr_info("  %s zone: %lld pages in unavailable ranges\n",
> +			zone->name, pgcnt);
>   }
>   
>   static int zone_batchsize(struct zone *zone)
> @@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid,
>   		*end_pfn = max(*end_pfn, this_end_pfn);
>   	}
>   
> -	if (*start_pfn == -1UL)
> +	if (*start_pfn == -1UL) {
>   		*start_pfn = 0;
> +		return;
> +	}
> +
> +#ifdef CONFIG_SPARSEMEM
> +	/*
> +	 * Sections in the memory map may not match actual populated
> +	 * memory, extend the node span to cover the entire section.
> +	 */
> +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
> +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);

Does that mean that we might create overlapping zones when one node 
starts in the middle of a section and the other one ends in the middle 
of a section?

Could it be a problem? (e.g., would we have to look at neighboring nodes 
when making the decision to extend, and how far to extend?)
David Hildenbrand Feb. 12, 2021, 9:56 a.m. UTC | #4
On 12.02.21 10:55, David Hildenbrand wrote:
> On 08.02.21 12:08, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> There could be struct pages that are not backed by actual physical memory.
>> This can happen when the actual memory bank is not a multiple of
>> SECTION_SIZE or when an architecture does not register memory holes
>> reserved by the firmware as memblock.memory.
>>
>> Such pages are currently initialized using init_unavailable_mem() function
>> that iterates through PFNs in holes in memblock.memory and if there is a
>> struct page corresponding to a PFN, the fields of this page are set to
>> default values and it is marked as Reserved.
>>
>> init_unavailable_mem() does not take into account zone and node the page
>> belongs to and sets both zone and node links in struct page to zero.
>>
>> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
>> instance in a configuration below:
>>
>> 	# grep -A1 E820 /proc/iomem
>> 	7a17b000-7a216fff : Unknown E820 type
>> 	7a217000-7bffffff : System RAM
>>
>> unset zone link in struct page will trigger
>>
>> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
>>
>> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
>> in struct page) in the same pageblock.
>>
>> Moreover, it is possible that the lowest node and zone start is not aligned
>> to the section boundarie, for example on x86:
>>
>> [    0.078898] Zone ranges:
>> [    0.078899]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
>> ...
>> [    0.078910] Early memory node ranges
>> [    0.078912]   node   0: [mem 0x0000000000001000-0x000000000009cfff]
>> [    0.078913]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
>>
>> and thus with SPARSEMEM memory model the beginning of the memory map will
>> have struct pages that are not spanned by any node and zone.
>>
>> Update detection of node boundaries in get_pfn_range_for_nid() so that the
>> node range will be expanded to cover memory map section. Since zone spans
>> are derived from the node span, there always will be a zone that covers the
>> part of the memory map with unavailable pages.
>>
>> Interleave initialization of the unavailable pages with the normal
>> initialization of memory map, so that zone and node information will be
>> properly set on struct pages that are not backed by the actual memory.
>>
>> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
>> that check each PFN")
>> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
>> Cc: Baoquan He <bhe@redhat.com>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Qian Cai <cai@lca.pw>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>    mm/page_alloc.c | 160 +++++++++++++++++++++++-------------------------
>>    1 file changed, 75 insertions(+), 85 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6446778cbc6b..1c3f7521028f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6257,22 +6257,84 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>>    	}
>>    }
>>    
>> +#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
>> +/*
>> + * Only struct pages that correspond to ranges defined by memblock.memory
>> + * are zeroed and initialized by going through __init_single_page() during
>> + * memmap_init_zone().
>> + *
>> + * But, there could be struct pages that correspond to holes in
>> + * memblock.memory. This can happen because of the following reasons:
>> + * - phyiscal memory bank size is not necessarily the exact multiple of the
>> + *   arbitrary section size
>> + * - early reserved memory may not be listed in memblock.memory
>> + * - memory layouts defined with memmap= kernel parameter may not align
>> + *   nicely with memmap sections
>> + *
>> + * Explicitly initialize those struct pages so that:
>> + * - PG_Reserved is set
>> + * - zone and node links point to zone and node that span the page
>> + */
>> +static u64 __meminit init_unavailable_range(unsigned long spfn,
>> +					    unsigned long epfn,
>> +					    int zone, int node)
>> +{
>> +	unsigned long pfn;
>> +	u64 pgcnt = 0;
>> +
>> +	for (pfn = spfn; pfn < epfn; pfn++) {
>> +		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
>> +			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
>> +				+ pageblock_nr_pages - 1;
>> +			continue;
>> +		}
>> +		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
>> +		__SetPageReserved(pfn_to_page(pfn));
>> +		pgcnt++;
>> +	}
>> +
>> +	return pgcnt;
>> +}
>> +#else
>> +static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
>> +					 int zone, int node)
>> +{
>> +	return 0;
>> +}
>> +#endif
>> +
>>    void __meminit __weak memmap_init_zone(struct zone *zone)
>>    {
>>    	unsigned long zone_start_pfn = zone->zone_start_pfn;
>>    	unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
>>    	int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
>>    	unsigned long start_pfn, end_pfn;
>> +	unsigned long hole_pfn = 0;
>> +	u64 pgcnt = 0;
>>    
>>    	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>>    		start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
>>    		end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
>> +		hole_pfn = clamp(hole_pfn, zone_start_pfn, zone_end_pfn);
>>    
>>    		if (end_pfn > start_pfn)
>>    			memmap_init_range(end_pfn - start_pfn, nid,
>>    					zone_id, start_pfn, zone_end_pfn,
>>    					MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
>> +
>> +		if (hole_pfn < start_pfn)
>> +			pgcnt += init_unavailable_range(hole_pfn, start_pfn,
>> +							zone_id, nid);
>> +		hole_pfn = end_pfn;
>>    	}
>> +
>> +	if (hole_pfn < zone_end_pfn)
>> +		pgcnt += init_unavailable_range(hole_pfn, zone_end_pfn,
>> +						zone_id, nid);
>> +
>> +	if (pgcnt)
>> +		pr_info("  %s zone: %lld pages in unavailable ranges\n",
>> +			zone->name, pgcnt);
>>    }
>>    
>>    static int zone_batchsize(struct zone *zone)
>> @@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid,
>>    		*end_pfn = max(*end_pfn, this_end_pfn);
>>    	}
>>    
>> -	if (*start_pfn == -1UL)
>> +	if (*start_pfn == -1UL) {
>>    		*start_pfn = 0;
>> +		return;
>> +	}
>> +
>> +#ifdef CONFIG_SPARSEMEM
>> +	/*
>> +	 * Sections in the memory map may not match actual populated
>> +	 * memory, extend the node span to cover the entire section.
>> +	 */
>> +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
>> +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
> 
> Does that mean that we might create overlapping zones when one node

s/overlapping zones/overlapping nodes/

> starts in the middle of a section and the other one ends in the middle
> of a section?
> 
> Could it be a problem? (e.g., would we have to look at neighboring nodes
> when making the decision to extend, and how far to extend?)
>
Michal Hocko Feb. 12, 2021, 10:11 a.m. UTC | #5
On Fri 12-02-21 10:56:19, David Hildenbrand wrote:
> On 12.02.21 10:55, David Hildenbrand wrote:
> > On 08.02.21 12:08, Mike Rapoport wrote:
[...]
> > > @@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid,
> > >    		*end_pfn = max(*end_pfn, this_end_pfn);
> > >    	}
> > > -	if (*start_pfn == -1UL)
> > > +	if (*start_pfn == -1UL) {
> > >    		*start_pfn = 0;
> > > +		return;
> > > +	}
> > > +
> > > +#ifdef CONFIG_SPARSEMEM
> > > +	/*
> > > +	 * Sections in the memory map may not match actual populated
> > > +	 * memory, extend the node span to cover the entire section.
> > > +	 */
> > > +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
> > > +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
> > 
> > Does that mean that we might create overlapping zones when one node
> 
> s/overlapping zones/overlapping nodes/

I didn't get to review the patch yet. Just wanted to note that we can
interleave nodes/zone. Or what kind of concern do you have in mind?
David Hildenbrand Feb. 12, 2021, 10:16 a.m. UTC | #6
On 12.02.21 11:11, Michal Hocko wrote:
> On Fri 12-02-21 10:56:19, David Hildenbrand wrote:
>> On 12.02.21 10:55, David Hildenbrand wrote:
>>> On 08.02.21 12:08, Mike Rapoport wrote:
> [...]
>>>> @@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid,
>>>>     		*end_pfn = max(*end_pfn, this_end_pfn);
>>>>     	}
>>>> -	if (*start_pfn == -1UL)
>>>> +	if (*start_pfn == -1UL) {
>>>>     		*start_pfn = 0;
>>>> +		return;
>>>> +	}
>>>> +
>>>> +#ifdef CONFIG_SPARSEMEM
>>>> +	/*
>>>> +	 * Sections in the memory map may not match actual populated
>>>> +	 * memory, extend the node span to cover the entire section.
>>>> +	 */
>>>> +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
>>>> +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
>>>
>>> Does that mean that we might create overlapping zones when one node
>>
>> s/overlapping zones/overlapping nodes/
> 
> I didn't get to review the patch yet. Just wanted to note that we can
> interleave nodes/zone. Or what kind of concern do you have in mind?

I know that we can have it after boot, when hotplugging memory. How 
about during boot?

For example, which node will a PFN then actually be assigned to?

I was just wondering if this might result in issues - if that can 
already happen, then it's just fine I guess.
Michal Hocko Feb. 12, 2021, 10:33 a.m. UTC | #7
On Mon 08-02-21 13:08:20, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> There could be struct pages that are not backed by actual physical memory.
> This can happen when the actual memory bank is not a multiple of
> SECTION_SIZE or when an architecture does not register memory holes
> reserved by the firmware as memblock.memory.
> 
> Such pages are currently initialized using init_unavailable_mem() function
> that iterates through PFNs in holes in memblock.memory and if there is a
> struct page corresponding to a PFN, the fields of this page are set to
> default values and it is marked as Reserved.
> 
> init_unavailable_mem() does not take into account zone and node the page
> belongs to and sets both zone and node links in struct page to zero.

IIUC the zone should be associated based on the pfn and architecture
constraines on zones. Node is then guessed based on the last existing
range, right?

> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> instance in a configuration below:
> 
> 	# grep -A1 E820 /proc/iomem
> 	7a17b000-7a216fff : Unknown E820 type
> 	7a217000-7bffffff : System RAM

I like the description here though. Thanks very useful.

> unset zone link in struct page will trigger
> 
> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

I guess you mean set_pfnblock_flags_mask, right? Is this bug on really
needed? Maybe we just need to skip over reserved pages?

> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
> in struct page) in the same pageblock.
> 
> Moreover, it is possible that the lowest node and zone start is not aligned
> to the section boundarie, for example on x86:
> 
> [    0.078898] Zone ranges:
> [    0.078899]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> ...
> [    0.078910] Early memory node ranges
> [    0.078912]   node   0: [mem 0x0000000000001000-0x000000000009cfff]
> [    0.078913]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
> 
> and thus with SPARSEMEM memory model the beginning of the memory map will
> have struct pages that are not spanned by any node and zone.
> 
> Update detection of node boundaries in get_pfn_range_for_nid() so that the
> node range will be expanded to cover memory map section. Since zone spans
> are derived from the node span, there always will be a zone that covers the
> part of the memory map with unavailable pages.
> 
> Interleave initialization of the unavailable pages with the normal
> initialization of memory map, so that zone and node information will be
> properly set on struct pages that are not backed by the actual memory.

I have to digest this but my first impression is that this is more heavy
weight than it needs to. Pfn walkers should normally obey node range at
least. The first pfn is usually excluded but I haven't seen real
problems with that. The VM_BUG_ON blowing up is really bad but as said
above we can simply make it less offensive in presence of reserved pages
as those shouldn't reach that path AFAICS normally. Or we can just
remove it. It is not really clear to me whether it has any value beyond
debugging TBH.
Michal Hocko Feb. 12, 2021, 10:37 a.m. UTC | #8
On Fri 12-02-21 11:16:28, David Hildenbrand wrote:
> On 12.02.21 11:11, Michal Hocko wrote:
> > On Fri 12-02-21 10:56:19, David Hildenbrand wrote:
> > > On 12.02.21 10:55, David Hildenbrand wrote:
> > > > On 08.02.21 12:08, Mike Rapoport wrote:
> > [...]
> > > > > @@ -6519,8 +6581,19 @@ void __init get_pfn_range_for_nid(unsigned int nid,
> > > > >     		*end_pfn = max(*end_pfn, this_end_pfn);
> > > > >     	}
> > > > > -	if (*start_pfn == -1UL)
> > > > > +	if (*start_pfn == -1UL) {
> > > > >     		*start_pfn = 0;
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > > +#ifdef CONFIG_SPARSEMEM
> > > > > +	/*
> > > > > +	 * Sections in the memory map may not match actual populated
> > > > > +	 * memory, extend the node span to cover the entire section.
> > > > > +	 */
> > > > > +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
> > > > > +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
> > > > 
> > > > Does that mean that we might create overlapping zones when one node
> > > 
> > > s/overlapping zones/overlapping nodes/
> > 
> > I didn't get to review the patch yet. Just wanted to note that we can
> > interleave nodes/zone. Or what kind of concern do you have in mind?
> 
> I know that we can have it after boot, when hotplugging memory. How about
> during boot?

I have seen systems where NUMA nodes are interleaved.

> For example, which node will a PFN then actually be assigned to?

I would have to double check all the details but I do remember there was
a non trivial work to handle those systems.
David Hildenbrand Feb. 12, 2021, 10:42 a.m. UTC | #9
On 12.02.21 11:33, Michal Hocko wrote:
> On Mon 08-02-21 13:08:20, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> There could be struct pages that are not backed by actual physical memory.
>> This can happen when the actual memory bank is not a multiple of
>> SECTION_SIZE or when an architecture does not register memory holes
>> reserved by the firmware as memblock.memory.
>>
>> Such pages are currently initialized using init_unavailable_mem() function
>> that iterates through PFNs in holes in memblock.memory and if there is a
>> struct page corresponding to a PFN, the fields of this page are set to
>> default values and it is marked as Reserved.
>>
>> init_unavailable_mem() does not take into account zone and node the page
>> belongs to and sets both zone and node links in struct page to zero.
> 
> IIUC the zone should be associated based on the pfn and architecture
> constraines on zones. Node is then guessed based on the last existing
> range, right?
> 
>> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
>> instance in a configuration below:
>>
>> 	# grep -A1 E820 /proc/iomem
>> 	7a17b000-7a216fff : Unknown E820 type
>> 	7a217000-7bffffff : System RAM
> 
> I like the description here though. Thanks very useful.
> 
>> unset zone link in struct page will trigger
>>
>> 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> I guess you mean set_pfnblock_flags_mask, right? Is this bug on really
> needed? Maybe we just need to skip over reserved pages?
> 
>> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
>> in struct page) in the same pageblock.
>>
>> Moreover, it is possible that the lowest node and zone start is not aligned
>> to the section boundarie, for example on x86:
>>
>> [    0.078898] Zone ranges:
>> [    0.078899]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
>> ...
>> [    0.078910] Early memory node ranges
>> [    0.078912]   node   0: [mem 0x0000000000001000-0x000000000009cfff]
>> [    0.078913]   node   0: [mem 0x0000000000100000-0x000000003fffffff]
>>
>> and thus with SPARSEMEM memory model the beginning of the memory map will
>> have struct pages that are not spanned by any node and zone.
>>
>> Update detection of node boundaries in get_pfn_range_for_nid() so that the
>> node range will be expanded to cover memory map section. Since zone spans
>> are derived from the node span, there always will be a zone that covers the
>> part of the memory map with unavailable pages.
>>
>> Interleave initialization of the unavailable pages with the normal
>> initialization of memory map, so that zone and node information will be
>> properly set on struct pages that are not backed by the actual memory.
> 
> I have to digest this but my first impression is that this is more heavy
> weight than it needs to. Pfn walkers should normally obey node range at
> least. The first pfn is usually excluded but I haven't seen real

We've seen examples where this is not sufficient. Simple example:

Have your physical memory end within a memory section. Easy via QEMU, 
just do a "-m 4000M". The remaining part of the last section has 
fake/wrong node/zone info.

Hotplug memory. The node/zone gets resized such that PFN walkers might 
stumble over it.

The basic idea is to make sure that any initialized/"online" pfn belongs 
to exactly one node/zone and that the node/zone spans that PFN.

> problems with that. The VM_BUG_ON blowing up is really bad but as said
> above we can simply make it less offensive in presence of reserved pages
> as those shouldn't reach that path AFAICS normally.

Andrea tried tried working around if via PG_reserved pages and it 
resulted in quite some ugly code. Andrea also noted that we cannot rely 
on any random page walker to do the right think when it comes to messed 
up node/zone info.
Michal Hocko Feb. 12, 2021, 1:18 p.m. UTC | #10
On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
> On 12.02.21 11:33, Michal Hocko wrote:
[...]
> > I have to digest this but my first impression is that this is more heavy
> > weight than it needs to. Pfn walkers should normally obey node range at
> > least. The first pfn is usually excluded but I haven't seen real
> 
> We've seen examples where this is not sufficient. Simple example:
> 
> Have your physical memory end within a memory section. Easy via QEMU, just
> do a "-m 4000M". The remaining part of the last section has fake/wrong
> node/zone info.

Does this really matter though. If those pages are reserved then nobody
will touch them regardless of their node/zone ids.

> Hotplug memory. The node/zone gets resized such that PFN walkers might
> stumble over it.
> 
> The basic idea is to make sure that any initialized/"online" pfn belongs to
> exactly one node/zone and that the node/zone spans that PFN.

Yeah, this sounds like a good idea but what is the poper node for hole
between two ranges associated with a different nodes/zones? This will
always be a random number. We should have a clear way to tell "do not
touch those pages" and PageReserved sounds like a good way to tell that.

> > problems with that. The VM_BUG_ON blowing up is really bad but as said
> > above we can simply make it less offensive in presence of reserved pages
> > as those shouldn't reach that path AFAICS normally.
> 
> Andrea tried tried working around if via PG_reserved pages and it resulted
> in quite some ugly code. Andrea also noted that we cannot rely on any random
> page walker to do the right think when it comes to messed up node/zone info.

I am sorry, I haven't followed previous discussions. Has the removal of
the VM_BUG_ON been considered as an immediate workaround?
Mike Rapoport Feb. 14, 2021, 5:29 p.m. UTC | #11
On Fri, Feb 12, 2021 at 10:56:19AM +0100, David Hildenbrand wrote:
> On 12.02.21 10:55, David Hildenbrand wrote:
> > On 08.02.21 12:08, Mike Rapoport wrote:
> > > +#ifdef CONFIG_SPARSEMEM
> > > +	/*
> > > +	 * Sections in the memory map may not match actual populated
> > > +	 * memory, extend the node span to cover the entire section.
> > > +	 */
> > > +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
> > > +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
> > 
> > Does that mean that we might create overlapping zones when one node
> 
> s/overlapping zones/overlapping nodes/
> 
> > starts in the middle of a section and the other one ends in the middle
> > of a section?
> 
> > Could it be a problem? (e.g., would we have to look at neighboring nodes
> > when making the decision to extend, and how far to extend?)

Having a node end/start in a middle of a section would be a problem, but in
this case I don't see a way to detect how a node should be extended :(

We can return to a v4 [1] without x86 modifications.
With that we'll have struct pages corresponding to a hole in a middle of a
zone with correct zone link and a good guess for the node.

As for the pfn 0 on x86, it'll remain outside any node and zone, but since
it was the case since, like forever, I think we can live with it.

[1] https://lore.kernel.org/lkml/20210130221035.4169-1-rppt@kernel.org
Mike Rapoport Feb. 14, 2021, 6 p.m. UTC | #12
On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
> On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
> > On 12.02.21 11:33, Michal Hocko wrote:
> [...]
> > > I have to digest this but my first impression is that this is more heavy
> > > weight than it needs to. Pfn walkers should normally obey node range at
> > > least. The first pfn is usually excluded but I haven't seen real
> > 
> > We've seen examples where this is not sufficient. Simple example:
> > 
> > Have your physical memory end within a memory section. Easy via QEMU, just
> > do a "-m 4000M". The remaining part of the last section has fake/wrong
> > node/zone info.
> 
> Does this really matter though. If those pages are reserved then nobody
> will touch them regardless of their node/zone ids.
>
> > Hotplug memory. The node/zone gets resized such that PFN walkers might
> > stumble over it.
> > 
> > The basic idea is to make sure that any initialized/"online" pfn belongs to
> > exactly one node/zone and that the node/zone spans that PFN.
> 
> Yeah, this sounds like a good idea but what is the poper node for hole
> between two ranges associated with a different nodes/zones? This will
> always be a random number. We should have a clear way to tell "do not
> touch those pages" and PageReserved sounds like a good way to tell that.
 
Nobody should touch reserved pages, but I don't think we can ensure that.

We can correctly set the zone links for the reserved pages for holes in the
middle of a zone based on the architecture constraints and with only the
holes in the beginning/end of the memory will be not spanned by any
node/zone which in practice does not seem to be a problem as the VM_BUG_ON
in set_pfnblock_flags_mask() never triggered on pfn 0.

I believe that any improvement in memory map consistency is a step forward.

> > > problems with that. The VM_BUG_ON blowing up is really bad but as said
> > > above we can simply make it less offensive in presence of reserved pages
> > > as those shouldn't reach that path AFAICS normally.
> > 
> > Andrea tried tried working around if via PG_reserved pages and it resulted
> > in quite some ugly code. Andrea also noted that we cannot rely on any random
> > page walker to do the right think when it comes to messed up node/zone info.
> 
> I am sorry, I haven't followed previous discussions. Has the removal of
> the VM_BUG_ON been considered as an immediate workaround?

It was never discussed, but I'm not sure it's a good idea.

Judging by the commit message that introduced the VM_BUG_ON (commit
86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
inconsistency in the memory map that required a special care.
David Hildenbrand Feb. 15, 2021, 8:45 a.m. UTC | #13
On 14.02.21 18:29, Mike Rapoport wrote:
> On Fri, Feb 12, 2021 at 10:56:19AM +0100, David Hildenbrand wrote:
>> On 12.02.21 10:55, David Hildenbrand wrote:
>>> On 08.02.21 12:08, Mike Rapoport wrote:
>>>> +#ifdef CONFIG_SPARSEMEM
>>>> +	/*
>>>> +	 * Sections in the memory map may not match actual populated
>>>> +	 * memory, extend the node span to cover the entire section.
>>>> +	 */
>>>> +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
>>>> +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
>>>
>>> Does that mean that we might create overlapping zones when one node
>>
>> s/overlapping zones/overlapping nodes/
>>
>>> starts in the middle of a section and the other one ends in the middle
>>> of a section?
>>
>>> Could it be a problem? (e.g., would we have to look at neighboring nodes
>>> when making the decision to extend, and how far to extend?)
> 
> Having a node end/start in a middle of a section would be a problem, but in
> this case I don't see a way to detect how a node should be extended :(

Running QEMU with something like:

...
     -m 8G \
     -smp sockets=2,cores=2 \
     -object memory-backend-ram,id=bmem0,size=4160M \
     -object memory-backend-ram,id=bmem1,size=4032M \
     -numa node,nodeid=0,cpus=0-1,memdev=bmem0 -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
...

Creates such a setup.

With an older kernel:

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
[...]
[    0.002506] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.002508] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[    0.002509] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff]
[    0.002510] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff]
[    0.002511] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
[    0.002513] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff]
[    0.002519] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff]
[    0.002669] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff]
[    0.017947] memblock: reserved range [0x0000000000000000-0x0000000000001000] is not in memory
[    0.017953] memblock: reserved range [0x000000000009f000-0x0000000000100000] is not in memory
[    0.017956] Zone ranges:
[    0.017957]   DMA      [mem 0x0000000000000000-0x0000000000ffffff]
[    0.017958]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.017960]   Normal   [mem 0x0000000100000000-0x000000023fffffff]
[    0.017961]   Device   empty
[    0.017962] Movable zone start for each node
[    0.017964] Early memory node ranges
[    0.017965]   node   0: [mem 0x0000000000000000-0x00000000bffdffff]
[    0.017966]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
[    0.017967]   node   1: [mem 0x0000000144000000-0x000000023fffffff]
[    0.017969] Initmem setup node 0 [mem 0x0000000000000000-0x0000000143ffffff]
[    0.017971] On node 0 totalpages: 1064928
[    0.017972]   DMA zone: 64 pages used for memmap
[    0.017973]   DMA zone: 21 pages reserved
[    0.017974]   DMA zone: 4096 pages, LIFO batch:0
[    0.017994]   DMA32 zone: 12224 pages used for memmap
[    0.017995]   DMA32 zone: 782304 pages, LIFO batch:63
[    0.022281] DMA32: Zeroed struct page in unavailable ranges: 32
[    0.022286]   Normal zone: 4352 pages used for memmap
[    0.022287]   Normal zone: 278528 pages, LIFO batch:63
[    0.023769] Initmem setup node 1 [mem 0x0000000144000000-0x000000023fffffff]
[    0.023774] On node 1 totalpages: 1032192
[    0.023775]   Normal zone: 16128 pages used for memmap
[    0.023775]   Normal zone: 1032192 pages, LIFO batch:63


With current next/master:

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
[...]
[    0.002419] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.002421] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[    0.002422] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff]
[    0.002423] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff]
[    0.002424] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
[    0.002426] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff]
[    0.002432] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff]
[    0.002583] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff]
[    0.017722] Zone ranges:
[    0.017726]   DMA      [mem 0x0000000000000000-0x0000000000ffffff]
[    0.017728]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.017729]   Normal   [mem 0x0000000100000000-0x000000023fffffff]
[    0.017731]   Device   empty
[    0.017732] Movable zone start for each node
[    0.017734] Early memory node ranges
[    0.017735]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.017736]   node   0: [mem 0x0000000000100000-0x00000000bffdffff]
[    0.017737]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
[    0.017738]   node   1: [mem 0x0000000144000000-0x000000023fffffff]
[    0.017741] Initmem setup node 0 [mem 0x0000000000000000-0x0000000147ffffff]
[    0.017742] On node 0 totalpages: 1064830
[    0.017743]   DMA zone: 64 pages used for memmap
[    0.017744]   DMA zone: 21 pages reserved
[    0.017745]   DMA zone: 3998 pages, LIFO batch:0
[    0.017765]   DMA zone: 98 pages in unavailable ranges
[    0.017766]   DMA32 zone: 12224 pages used for memmap
[    0.017766]   DMA32 zone: 782304 pages, LIFO batch:63
[    0.022042]   DMA32 zone: 32 pages in unavailable ranges
[    0.022046]   Normal zone: 4608 pages used for memmap
[    0.022047]   Normal zone: 278528 pages, LIFO batch:63
[    0.023601]   Normal zone: 16384 pages in unavailable ranges
[    0.023606] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff]
[    0.023608] On node 1 totalpages: 1032192
[    0.023609]   Normal zone: 16384 pages used for memmap
[    0.023609]   Normal zone: 1032192 pages, LIFO batch:63
[    0.029267]   Normal zone: 16384 pages in unavailable ranges


In this setup, one node ends in the middle of a section (+64MB), the
other one starts in the middle of the same section (+64MB).

After your patch, the nodes overlap (in one section)

I can spot that each node still has the same number of present pages and
that each node now has exactly 64MB unavailable pages (the extra ones spanned).


So at least here, it looks like the machinery is still doing the right thing?
Michal Hocko Feb. 15, 2021, 9 a.m. UTC | #14
On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
> On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
> > On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
> > > On 12.02.21 11:33, Michal Hocko wrote:
> > [...]
> > > > I have to digest this but my first impression is that this is more heavy
> > > > weight than it needs to. Pfn walkers should normally obey node range at
> > > > least. The first pfn is usually excluded but I haven't seen real
> > > 
> > > We've seen examples where this is not sufficient. Simple example:
> > > 
> > > Have your physical memory end within a memory section. Easy via QEMU, just
> > > do a "-m 4000M". The remaining part of the last section has fake/wrong
> > > node/zone info.
> > 
> > Does this really matter though. If those pages are reserved then nobody
> > will touch them regardless of their node/zone ids.
> >
> > > Hotplug memory. The node/zone gets resized such that PFN walkers might
> > > stumble over it.
> > > 
> > > The basic idea is to make sure that any initialized/"online" pfn belongs to
> > > exactly one node/zone and that the node/zone spans that PFN.
> > 
> > Yeah, this sounds like a good idea but what is the poper node for hole
> > between two ranges associated with a different nodes/zones? This will
> > always be a random number. We should have a clear way to tell "do not
> > touch those pages" and PageReserved sounds like a good way to tell that.
>  
> Nobody should touch reserved pages, but I don't think we can ensure that.

Touching a reserved page which doesn't belong to you is a bug. Sure we
cannot enforce that rule by runtime checks. But incorrect/misleading zone/node 
association is the least of the problem when somebody already does that.

> We can correctly set the zone links for the reserved pages for holes in the
> middle of a zone based on the architecture constraints and with only the
> holes in the beginning/end of the memory will be not spanned by any
> node/zone which in practice does not seem to be a problem as the VM_BUG_ON
> in set_pfnblock_flags_mask() never triggered on pfn 0.

I really fail to see what you mean by correct zone/node for a memory
range which is not associated with any real node.
 
> I believe that any improvement in memory map consistency is a step forward.

I do agree but we are talking about a subtle bug (VM_BUG_ON) which would
be better of with a simplistic fix first. You can work on consistency
improvements on top of that.

> > > > problems with that. The VM_BUG_ON blowing up is really bad but as said
> > > > above we can simply make it less offensive in presence of reserved pages
> > > > as those shouldn't reach that path AFAICS normally.
> > > 
> > > Andrea tried tried working around if via PG_reserved pages and it resulted
> > > in quite some ugly code. Andrea also noted that we cannot rely on any random
> > > page walker to do the right think when it comes to messed up node/zone info.
> > 
> > I am sorry, I haven't followed previous discussions. Has the removal of
> > the VM_BUG_ON been considered as an immediate workaround?
> 
> It was never discussed, but I'm not sure it's a good idea.
> 
> Judging by the commit message that introduced the VM_BUG_ON (commit
> 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
> inconsistency in the memory map that required a special care.

Can we actually explore that path before adding yet additional
complexity and potentially a very involved fix for a subtle problem?

Mel who is author of this code might help us out. I have to say I do not
see the point for the VM_BUG_ON other than a better debuggability. If
there is a real incosistency problem as a result then we should be
handling that situation for non debugging kernels as well.
David Hildenbrand Feb. 15, 2021, 9:05 a.m. UTC | #15
On 15.02.21 10:00, Michal Hocko wrote:
> On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
>> On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
>>> On Fri 12-02-21 11:42:15, David Hildenbrand wrote:
>>>> On 12.02.21 11:33, Michal Hocko wrote:
>>> [...]
>>>>> I have to digest this but my first impression is that this is more heavy
>>>>> weight than it needs to. Pfn walkers should normally obey node range at
>>>>> least. The first pfn is usually excluded but I haven't seen real
>>>>
>>>> We've seen examples where this is not sufficient. Simple example:
>>>>
>>>> Have your physical memory end within a memory section. Easy via QEMU, just
>>>> do a "-m 4000M". The remaining part of the last section has fake/wrong
>>>> node/zone info.
>>>
>>> Does this really matter though. If those pages are reserved then nobody
>>> will touch them regardless of their node/zone ids.
>>>
>>>> Hotplug memory. The node/zone gets resized such that PFN walkers might
>>>> stumble over it.
>>>>
>>>> The basic idea is to make sure that any initialized/"online" pfn belongs to
>>>> exactly one node/zone and that the node/zone spans that PFN.
>>>
>>> Yeah, this sounds like a good idea but what is the poper node for hole
>>> between two ranges associated with a different nodes/zones? This will
>>> always be a random number. We should have a clear way to tell "do not
>>> touch those pages" and PageReserved sounds like a good way to tell that.
>>   
>> Nobody should touch reserved pages, but I don't think we can ensure that.
> 
> Touching a reserved page which doesn't belong to you is a bug. Sure we
> cannot enforce that rule by runtime checks. But incorrect/misleading zone/node
> association is the least of the problem when somebody already does that.
> 
>> We can correctly set the zone links for the reserved pages for holes in the
>> middle of a zone based on the architecture constraints and with only the
>> holes in the beginning/end of the memory will be not spanned by any
>> node/zone which in practice does not seem to be a problem as the VM_BUG_ON
>> in set_pfnblock_flags_mask() never triggered on pfn 0.
> 
> I really fail to see what you mean by correct zone/node for a memory
> range which is not associated with any real node.
>   
>> I believe that any improvement in memory map consistency is a step forward.
> 
> I do agree but we are talking about a subtle bug (VM_BUG_ON) which would
> be better of with a simplistic fix first. You can work on consistency
> improvements on top of that.
> 
>>>>> problems with that. The VM_BUG_ON blowing up is really bad but as said
>>>>> above we can simply make it less offensive in presence of reserved pages
>>>>> as those shouldn't reach that path AFAICS normally.
>>>>
>>>> Andrea tried tried working around if via PG_reserved pages and it resulted
>>>> in quite some ugly code. Andrea also noted that we cannot rely on any random
>>>> page walker to do the right think when it comes to messed up node/zone info.
>>>
>>> I am sorry, I haven't followed previous discussions. Has the removal of
>>> the VM_BUG_ON been considered as an immediate workaround?
>>
>> It was never discussed, but I'm not sure it's a good idea.
>>
>> Judging by the commit message that introduced the VM_BUG_ON (commit
>> 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
>> inconsistency in the memory map that required a special care.
> 
> Can we actually explore that path before adding yet additional
> complexity and potentially a very involved fix for a subtle problem?
> 
> Mel who is author of this code might help us out. I have to say I do not
> see the point for the VM_BUG_ON other than a better debuggability. If
> there is a real incosistency problem as a result then we should be
> handling that situation for non debugging kernels as well.
> 

I have no time to summarize, you can find the complete discussion (also 
involving Mel) at

https://lkml.kernel.org/r/20201121194506.13464-1-aarcange@redhat.com
Mike Rapoport Feb. 15, 2021, 9:24 p.m. UTC | #16
On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
> On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
> > On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
> 
> > We can correctly set the zone links for the reserved pages for holes in the
> > middle of a zone based on the architecture constraints and with only the
> > holes in the beginning/end of the memory will be not spanned by any
> > node/zone which in practice does not seem to be a problem as the VM_BUG_ON
> > in set_pfnblock_flags_mask() never triggered on pfn 0.
> 
> I really fail to see what you mean by correct zone/node for a memory
> range which is not associated with any real node.

We know architectural zone constraints, so we can have always have 1:1
match from pfn to zone. Node indeed will be a guess.
  
> > > I am sorry, I haven't followed previous discussions. Has the removal of
> > > the VM_BUG_ON been considered as an immediate workaround?
> > 
> > It was never discussed, but I'm not sure it's a good idea.
> > 
> > Judging by the commit message that introduced the VM_BUG_ON (commit
> > 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
> > inconsistency in the memory map that required a special care.
> 
> Can we actually explore that path before adding yet additional
> complexity and potentially a very involved fix for a subtle problem?

This patch was intended as a fix for inconsistency of the memory map that
is the root cause for triggering this VM_BUG_ON and other corner case
problems. 

The previous version [1] is less involved as it does not extend node/zone
spans.

[1] https://lore.kernel.org/lkml/20210130221035.4169-3-rppt@kernel.org
Michal Hocko Feb. 16, 2021, 8:33 a.m. UTC | #17
On Mon 15-02-21 23:24:40, Mike Rapoport wrote:
> On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
> > On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
> > > On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
> > 
> > > We can correctly set the zone links for the reserved pages for holes in the
> > > middle of a zone based on the architecture constraints and with only the
> > > holes in the beginning/end of the memory will be not spanned by any
> > > node/zone which in practice does not seem to be a problem as the VM_BUG_ON
> > > in set_pfnblock_flags_mask() never triggered on pfn 0.
> > 
> > I really fail to see what you mean by correct zone/node for a memory
> > range which is not associated with any real node.
> 
> We know architectural zone constraints, so we can have always have 1:1
> match from pfn to zone. Node indeed will be a guess.

That is true only for some zones. Also we do require those to be correct
when the memory is managed by the page allocator. I believe we can live
with incorrect zones when they are in holes.

> > > > I am sorry, I haven't followed previous discussions. Has the removal of
> > > > the VM_BUG_ON been considered as an immediate workaround?
> > > 
> > > It was never discussed, but I'm not sure it's a good idea.
> > > 
> > > Judging by the commit message that introduced the VM_BUG_ON (commit
> > > 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
> > > inconsistency in the memory map that required a special care.
> > 
> > Can we actually explore that path before adding yet additional
> > complexity and potentially a very involved fix for a subtle problem?
> 
> This patch was intended as a fix for inconsistency of the memory map that
> is the root cause for triggering this VM_BUG_ON and other corner case
> problems. 
> 
> The previous version [1] is less involved as it does not extend node/zone
> spans.

I do understand that. And I am not objecting to the patch. I have to
confess I haven't digested it yet. Any changes to early memory
intialization have turned out to be subtle and corner cases only pop up
later. This is almost impossible to review just by reading the code.
That's why I am asking whether we want to address the specific VM_BUG_ON
first with something much less tricky and actually reviewable. And
that's why I am asking whether dropping the bug_on itself is safe to do
and use as a hot fix which should be easier to backport.

Longterm I am definitely supporting any change which will lead to a
fully initialized state. Whatever that means. One option would be to
simply never allow partial page blocks or even memory sections. This
would waste some memory but from what I have seen so far this would be
quite small amount on very rare setups. So it might turn out as a much
more easier and maintainable way forward.

> [1] https://lore.kernel.org/lkml/20210130221035.4169-3-rppt@kernel.org
> -- 
> Sincerely yours,
> Mike.
Mike Rapoport Feb. 16, 2021, 11:01 a.m. UTC | #18
On Tue, Feb 16, 2021 at 09:33:20AM +0100, Michal Hocko wrote:
> On Mon 15-02-21 23:24:40, Mike Rapoport wrote:
> > On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
> > > On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
> > > > On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
> > > 
> > > > We can correctly set the zone links for the reserved pages for holes in the
> > > > middle of a zone based on the architecture constraints and with only the
> > > > holes in the beginning/end of the memory will be not spanned by any
> > > > node/zone which in practice does not seem to be a problem as the VM_BUG_ON
> > > > in set_pfnblock_flags_mask() never triggered on pfn 0.
> > > 
> > > I really fail to see what you mean by correct zone/node for a memory
> > > range which is not associated with any real node.
> > 
> > We know architectural zone constraints, so we can have always have 1:1
> > match from pfn to zone. Node indeed will be a guess.
> 
> That is true only for some zones.

Hmm, and when is it not true?

> Also we do require those to be correct when the memory is managed by the
> page allocator. I believe we can live with incorrect zones when they are
> in holes.
 
Note that the holes Andrea reported in the first place are not ranges that
are not populated, but rather ranges that arch didn't report as usable,
i.e. there is physical memory and it perfectly fits into an existing node
and zone.

> > > > > I am sorry, I haven't followed previous discussions. Has the removal of
> > > > > the VM_BUG_ON been considered as an immediate workaround?
> > > > 
> > > > It was never discussed, but I'm not sure it's a good idea.
> > > > 
> > > > Judging by the commit message that introduced the VM_BUG_ON (commit
> > > > 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
> > > > inconsistency in the memory map that required a special care.
> > > 
> > > Can we actually explore that path before adding yet additional
> > > complexity and potentially a very involved fix for a subtle problem?
> > 
> > This patch was intended as a fix for inconsistency of the memory map that
> > is the root cause for triggering this VM_BUG_ON and other corner case
> > problems. 
> > 
> > The previous version [1] is less involved as it does not extend node/zone
> > spans.
> 
> I do understand that. And I am not objecting to the patch. I have to
> confess I haven't digested it yet. Any changes to early memory
> intialization have turned out to be subtle and corner cases only pop up
> later. This is almost impossible to review just by reading the code.
> That's why I am asking whether we want to address the specific VM_BUG_ON
> first with something much less tricky and actually reviewable. And
> that's why I am asking whether dropping the bug_on itself is safe to do
> and use as a hot fix which should be easier to backport.

I can't say I'm familiar enough with migration and compaction code to say
if it's ok to remove that bug_on. It does point to inconsistency in the
memmap, but probably it's not important.
Mike Rapoport Feb. 16, 2021, 11:13 a.m. UTC | #19
On Mon, Feb 15, 2021 at 09:45:30AM +0100, David Hildenbrand wrote:
> On 14.02.21 18:29, Mike Rapoport wrote:
> > On Fri, Feb 12, 2021 at 10:56:19AM +0100, David Hildenbrand wrote:
> > > On 12.02.21 10:55, David Hildenbrand wrote:
> > > > On 08.02.21 12:08, Mike Rapoport wrote:
> > > > > +#ifdef CONFIG_SPARSEMEM
> > > > > +	/*
> > > > > +	 * Sections in the memory map may not match actual populated
> > > > > +	 * memory, extend the node span to cover the entire section.
> > > > > +	 */
> > > > > +	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
> > > > > +	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
> > > > 
> > > > Does that mean that we might create overlapping zones when one node
> > > 
> > > s/overlapping zones/overlapping nodes/
> > > 
> > > > starts in the middle of a section and the other one ends in the middle
> > > > of a section?
> > > 
> > > > Could it be a problem? (e.g., would we have to look at neighboring nodes
> > > > when making the decision to extend, and how far to extend?)
> > 
> > Having a node end/start in a middle of a section would be a problem, but in
> > this case I don't see a way to detect how a node should be extended :(
> 
> Running QEMU with something like:
> 
> ...
>     -m 8G \
>     -smp sockets=2,cores=2 \
>     -object memory-backend-ram,id=bmem0,size=4160M \
>     -object memory-backend-ram,id=bmem1,size=4032M \

This is an interesting setup :)

TBH, I've tried to think what physical configuration would be problematic
for the implicit node extension, and I had concerns about arm64 with it's
huge section size, but it entirely slipped my mind that a VM can have
really weird memory configuration.

>     -numa node,nodeid=0,cpus=0-1,memdev=bmem0 -numa node,nodeid=1,cpus=2-3,memdev=bmem1 \
> ...
> 
> Creates such a setup.
> 
> With an older kernel:
> 
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
> [    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
> [...]
> [    0.002506] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [    0.002508] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> [    0.002509] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff]
> [    0.002510] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff]
> [    0.002511] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
> [    0.002513] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff]
> [    0.002519] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff]
> [    0.002669] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff]
> [    0.017947] memblock: reserved range [0x0000000000000000-0x0000000000001000] is not in memory
> [    0.017953] memblock: reserved range [0x000000000009f000-0x0000000000100000] is not in memory
> [    0.017956] Zone ranges:
> [    0.017957]   DMA      [mem 0x0000000000000000-0x0000000000ffffff]
> [    0.017958]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.017960]   Normal   [mem 0x0000000100000000-0x000000023fffffff]
> [    0.017961]   Device   empty
> [    0.017962] Movable zone start for each node
> [    0.017964] Early memory node ranges
> [    0.017965]   node   0: [mem 0x0000000000000000-0x00000000bffdffff]
> [    0.017966]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
> [    0.017967]   node   1: [mem 0x0000000144000000-0x000000023fffffff]
> [    0.017969] Initmem setup node 0 [mem 0x0000000000000000-0x0000000143ffffff]
> [    0.017971] On node 0 totalpages: 1064928
> [    0.017972]   DMA zone: 64 pages used for memmap
> [    0.017973]   DMA zone: 21 pages reserved
> [    0.017974]   DMA zone: 4096 pages, LIFO batch:0
> [    0.017994]   DMA32 zone: 12224 pages used for memmap
> [    0.017995]   DMA32 zone: 782304 pages, LIFO batch:63
> [    0.022281] DMA32: Zeroed struct page in unavailable ranges: 32
> [    0.022286]   Normal zone: 4352 pages used for memmap
> [    0.022287]   Normal zone: 278528 pages, LIFO batch:63
> [    0.023769] Initmem setup node 1 [mem 0x0000000144000000-0x000000023fffffff]
> [    0.023774] On node 1 totalpages: 1032192
> [    0.023775]   Normal zone: 16128 pages used for memmap
> [    0.023775]   Normal zone: 1032192 pages, LIFO batch:63
> 
> 
> With current next/master:
> 
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
> [    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
> [    0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
> [...]
> [    0.002419] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [    0.002421] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> [    0.002422] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x143ffffff]
> [    0.002423] ACPI: SRAT: Node 1 PXM 1 [mem 0x144000000-0x23fffffff]
> [    0.002424] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
> [    0.002426] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x143ffffff] -> [mem 0x00000000-0x143ffffff]
> [    0.002432] NODE_DATA(0) allocated [mem 0x143fd5000-0x143ffffff]
> [    0.002583] NODE_DATA(1) allocated [mem 0x23ffd2000-0x23fffcfff]
> [    0.017722] Zone ranges:
> [    0.017726]   DMA      [mem 0x0000000000000000-0x0000000000ffffff]
> [    0.017728]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> [    0.017729]   Normal   [mem 0x0000000100000000-0x000000023fffffff]
> [    0.017731]   Device   empty
> [    0.017732] Movable zone start for each node
> [    0.017734] Early memory node ranges
> [    0.017735]   node   0: [mem 0x0000000000001000-0x000000000009efff]
> [    0.017736]   node   0: [mem 0x0000000000100000-0x00000000bffdffff]
> [    0.017737]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
> [    0.017738]   node   1: [mem 0x0000000144000000-0x000000023fffffff]
> [    0.017741] Initmem setup node 0 [mem 0x0000000000000000-0x0000000147ffffff]
> [    0.017742] On node 0 totalpages: 1064830
> [    0.017743]   DMA zone: 64 pages used for memmap
> [    0.017744]   DMA zone: 21 pages reserved
> [    0.017745]   DMA zone: 3998 pages, LIFO batch:0
> [    0.017765]   DMA zone: 98 pages in unavailable ranges
> [    0.017766]   DMA32 zone: 12224 pages used for memmap
> [    0.017766]   DMA32 zone: 782304 pages, LIFO batch:63
> [    0.022042]   DMA32 zone: 32 pages in unavailable ranges
> [    0.022046]   Normal zone: 4608 pages used for memmap
> [    0.022047]   Normal zone: 278528 pages, LIFO batch:63
> [    0.023601]   Normal zone: 16384 pages in unavailable ranges
> [    0.023606] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff]
> [    0.023608] On node 1 totalpages: 1032192
> [    0.023609]   Normal zone: 16384 pages used for memmap
> [    0.023609]   Normal zone: 1032192 pages, LIFO batch:63
> [    0.029267]   Normal zone: 16384 pages in unavailable ranges
> 
> 
> In this setup, one node ends in the middle of a section (+64MB), the
> other one starts in the middle of the same section (+64MB).
> 
> After your patch, the nodes overlap (in one section)
> 
> I can spot that each node still has the same number of present pages and
> that each node now has exactly 64MB unavailable pages (the extra ones spanned).
> 
> So at least here, it looks like the machinery is still doing the right thing?

So in this setup we'll have pages in the overlapping section initialized twice
and they will end linked to node1 which is not exactly correct, but we care
less about the nodes than about the zones. Well, at least we don't have
VM_BUG_ON(!node_spans_pfn()) :)
Michal Hocko Feb. 16, 2021, 11:39 a.m. UTC | #20
On Tue 16-02-21 13:01:54, Mike Rapoport wrote:
> On Tue, Feb 16, 2021 at 09:33:20AM +0100, Michal Hocko wrote:
> > On Mon 15-02-21 23:24:40, Mike Rapoport wrote:
> > > On Mon, Feb 15, 2021 at 10:00:31AM +0100, Michal Hocko wrote:
> > > > On Sun 14-02-21 20:00:16, Mike Rapoport wrote:
> > > > > On Fri, Feb 12, 2021 at 02:18:20PM +0100, Michal Hocko wrote:
> > > > 
> > > > > We can correctly set the zone links for the reserved pages for holes in the
> > > > > middle of a zone based on the architecture constraints and with only the
> > > > > holes in the beginning/end of the memory will be not spanned by any
> > > > > node/zone which in practice does not seem to be a problem as the VM_BUG_ON
> > > > > in set_pfnblock_flags_mask() never triggered on pfn 0.
> > > > 
> > > > I really fail to see what you mean by correct zone/node for a memory
> > > > range which is not associated with any real node.
> > > 
> > > We know architectural zone constraints, so we can have always have 1:1
> > > match from pfn to zone. Node indeed will be a guess.
> > 
> > That is true only for some zones.
> 
> Hmm, and when is it not true?

ZONE_MOVABLE, ZONE_NORMAL and ZONE_DEVICE.
 
> > Also we do require those to be correct when the memory is managed by the
> > page allocator. I believe we can live with incorrect zones when they are
> > in holes.
>  
> Note that the holes Andrea reported in the first place are not ranges that
> are not populated, but rather ranges that arch didn't report as usable,
> i.e. there is physical memory and it perfectly fits into an existing node
> and zone.

Except those are not usable so they have no clear meaning, right?

> > > > > > I am sorry, I haven't followed previous discussions. Has the removal of
> > > > > > the VM_BUG_ON been considered as an immediate workaround?
> > > > > 
> > > > > It was never discussed, but I'm not sure it's a good idea.
> > > > > 
> > > > > Judging by the commit message that introduced the VM_BUG_ON (commit
> > > > > 86051ca5eaf5 ("mm: fix usemap initialization")) there was yet another
> > > > > inconsistency in the memory map that required a special care.
> > > > 
> > > > Can we actually explore that path before adding yet additional
> > > > complexity and potentially a very involved fix for a subtle problem?
> > > 
> > > This patch was intended as a fix for inconsistency of the memory map that
> > > is the root cause for triggering this VM_BUG_ON and other corner case
> > > problems. 
> > > 
> > > The previous version [1] is less involved as it does not extend node/zone
> > > spans.
> > 
> > I do understand that. And I am not objecting to the patch. I have to
> > confess I haven't digested it yet. Any changes to early memory
> > intialization have turned out to be subtle and corner cases only pop up
> > later. This is almost impossible to review just by reading the code.
> > That's why I am asking whether we want to address the specific VM_BUG_ON
> > first with something much less tricky and actually reviewable. And
> > that's why I am asking whether dropping the bug_on itself is safe to do
> > and use as a hot fix which should be easier to backport.
> 
> I can't say I'm familiar enough with migration and compaction code to say
> if it's ok to remove that bug_on. It does point to inconsistency in the
> memmap, but probably it's not important.

Yeah, I am not really clear on that myself TBH. On one hand this cannot
be really critical because it is a conditional assert on the debuging
mode. Most production users do not enable it so they wouldn't learn
about that inconsistency and maybe never notice any difference. This has
led me to think about this to be something mostly focused on debugging.
If that is an incorrect assumption then the VM_BUG_ON should be changed
to something else - either a graceful failure or a real BUG_ON if that
is really worth it.
Vlastimil Babka Feb. 16, 2021, 12:34 p.m. UTC | #21
On 2/16/21 12:01 PM, Mike Rapoport wrote:
>> 
>> I do understand that. And I am not objecting to the patch. I have to
>> confess I haven't digested it yet. Any changes to early memory
>> intialization have turned out to be subtle and corner cases only pop up
>> later. This is almost impossible to review just by reading the code.
>> That's why I am asking whether we want to address the specific VM_BUG_ON
>> first with something much less tricky and actually reviewable. And
>> that's why I am asking whether dropping the bug_on itself is safe to do
>> and use as a hot fix which should be easier to backport.
> 
> I can't say I'm familiar enough with migration and compaction code to say
> if it's ok to remove that bug_on. It does point to inconsistency in the
> memmap, but probably it's not important.

On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is
not safe. If we violate the zone_spans_pfn condition, it means we will write
outside of the pageblock bitmap for the zone, and corrupt something. Actually
similar thing can happen in __get_pfnblock_flags_mask() where there's no
VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault
to do accessing some unmapped range?

So the checks would have to become unconditional !DEBUG_VM and return instead of
causing a BUG. Or we could go back one level and add some checks to
fast_isolate_around() to detect a page from zone that doesn't match cc->zone.
The question is if there is another code that will break if a page_zone()
suddenly changes e.g. in the middle of the pageblock - __pageblock_pfn_to_page()
assumes that if first and last page is from the same zone, so are all pages in
between, and the rest relies on that. But maybe if Andrea's
fast_isolate_around() issue is fixed, that's enough for stable backport.
Vlastimil Babka Feb. 16, 2021, 12:59 p.m. UTC | #22
On 2/16/21 1:34 PM, Vlastimil Babka wrote:
> On 2/16/21 12:01 PM, Mike Rapoport wrote:
>>> 
>>> I do understand that. And I am not objecting to the patch. I have to
>>> confess I haven't digested it yet. Any changes to early memory
>>> intialization have turned out to be subtle and corner cases only pop up
>>> later. This is almost impossible to review just by reading the code.
>>> That's why I am asking whether we want to address the specific VM_BUG_ON
>>> first with something much less tricky and actually reviewable. And
>>> that's why I am asking whether dropping the bug_on itself is safe to do
>>> and use as a hot fix which should be easier to backport.
>> 
>> I can't say I'm familiar enough with migration and compaction code to say
>> if it's ok to remove that bug_on. It does point to inconsistency in the
>> memmap, but probably it's not important.
> 
> On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is
> not safe. If we violate the zone_spans_pfn condition, it means we will write
> outside of the pageblock bitmap for the zone, and corrupt something. Actually

Clarification. This is true only for !CONFIG_SPARSEMEM, which is unlikely in
practice to produce the configurations that trigger this issue. So we can remove
the VM_BUG_ON_PAGE()

> similar thing can happen in __get_pfnblock_flags_mask() where there's no
> VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault
> to do accessing some unmapped range?
> 
> So the checks would have to become unconditional !DEBUG_VM and return instead of
> causing a BUG. Or we could go back one level and add some checks to
> fast_isolate_around() to detect a page from zone that doesn't match cc->zone.
> The question is if there is another code that will break if a page_zone()
> suddenly changes e.g. in the middle of the pageblock - __pageblock_pfn_to_page()
> assumes that if first and last page is from the same zone, so are all pages in
> between, and the rest relies on that. But maybe if Andrea's
> fast_isolate_around() issue is fixed, that's enough for stable backport.
> 
> 
> 
>
Michal Hocko Feb. 16, 2021, 1:11 p.m. UTC | #23
On Tue 16-02-21 13:34:56, Vlastimil Babka wrote:
> On 2/16/21 12:01 PM, Mike Rapoport wrote:
> >> 
> >> I do understand that. And I am not objecting to the patch. I have to
> >> confess I haven't digested it yet. Any changes to early memory
> >> intialization have turned out to be subtle and corner cases only pop up
> >> later. This is almost impossible to review just by reading the code.
> >> That's why I am asking whether we want to address the specific VM_BUG_ON
> >> first with something much less tricky and actually reviewable. And
> >> that's why I am asking whether dropping the bug_on itself is safe to do
> >> and use as a hot fix which should be easier to backport.
> > 
> > I can't say I'm familiar enough with migration and compaction code to say
> > if it's ok to remove that bug_on. It does point to inconsistency in the
> > memmap, but probably it's not important.
> 
> On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is
> not safe. If we violate the zone_spans_pfn condition, it means we will write
> outside of the pageblock bitmap for the zone, and corrupt something.

Isn't it enough that at least some pfn from the pageblock belongs to the
zone in order to have the bitmap allocated for the whole page block
(even if it partially belongs to a different zone)?

> Actually
> similar thing can happen in __get_pfnblock_flags_mask() where there's no
> VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault
> to do accessing some unmapped range?
> 
> So the checks would have to become unconditional !DEBUG_VM and return instead of
> causing a BUG. Or we could go back one level and add some checks to
> fast_isolate_around() to detect a page from zone that doesn't match cc->zone.

Thanks for looking deeper into that. This sounds like a much more
targeted fix to me.

> The question is if there is another code that will break if a page_zone()
> suddenly changes e.g. in the middle of the pageblock - __pageblock_pfn_to_page()
> assumes that if first and last page is from the same zone, so are all pages in
> between, and the rest relies on that. But maybe if Andrea's
> fast_isolate_around() issue is fixed, that's enough for stable backport.

There might be some other cases but I think it would be better to have a
single fix for this particular issue and have it fixed properly and only
then build something more robust on top.
Vlastimil Babka Feb. 16, 2021, 4:39 p.m. UTC | #24
On 2/16/21 2:11 PM, Michal Hocko wrote:
> On Tue 16-02-21 13:34:56, Vlastimil Babka wrote:
>> On 2/16/21 12:01 PM, Mike Rapoport wrote:
>> >> 
>> >> I do understand that. And I am not objecting to the patch. I have to
>> >> confess I haven't digested it yet. Any changes to early memory
>> >> intialization have turned out to be subtle and corner cases only pop up
>> >> later. This is almost impossible to review just by reading the code.
>> >> That's why I am asking whether we want to address the specific VM_BUG_ON
>> >> first with something much less tricky and actually reviewable. And
>> >> that's why I am asking whether dropping the bug_on itself is safe to do
>> >> and use as a hot fix which should be easier to backport.
>> > 
>> > I can't say I'm familiar enough with migration and compaction code to say
>> > if it's ok to remove that bug_on. It does point to inconsistency in the
>> > memmap, but probably it's not important.
>> 
>> On closer look, removing the VM_BUG_ON_PAGE() in set_pfnblock_flags_mask() is
>> not safe. If we violate the zone_spans_pfn condition, it means we will write
>> outside of the pageblock bitmap for the zone, and corrupt something.
> 
> Isn't it enough that at least some pfn from the pageblock belongs to the
> zone in order to have the bitmap allocated for the whole page block
> (even if it partially belongs to a different zone)?
> 
>> Actually
>> similar thing can happen in __get_pfnblock_flags_mask() where there's no
>> VM_BUG_ON, but there we can't corrupt memory. But we could theoretically fault
>> to do accessing some unmapped range?
>> 
>> So the checks would have to become unconditional !DEBUG_VM and return instead of
>> causing a BUG. Or we could go back one level and add some checks to
>> fast_isolate_around() to detect a page from zone that doesn't match cc->zone.
> 
> Thanks for looking deeper into that. This sounds like a much more
> targeted fix to me.

So, Andrea could you please check if this fixes the original
fast_isolate_around() issue for you? With the VM_BUG_ON not removed, DEBUG_VM
enabled, no changes to struct page initialization...
It relies on pageblock_pfn_to_page as the rest of the compaction code.

Thanks!

----8<----
From f5c8d7bc77d2ec0b4cfec44820ce6f602fdb3a86 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 16 Feb 2021 17:32:34 +0100
Subject: [PATCH] mm, compaction: make fast_isolate_around() robust against
 pfns from a wrong zone

TBD

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccdaa6c19..b75645e4678d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1288,7 +1288,7 @@ static void
 fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long nr_isolated)
 {
 	unsigned long start_pfn, end_pfn;
-	struct page *page = pfn_to_page(pfn);
+	struct page *page;
 
 	/* Do not search around if there are enough pages already */
 	if (cc->nr_freepages >= cc->nr_migratepages)
@@ -1300,7 +1300,11 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long
 
 	/* Pageblock boundaries */
 	start_pfn = pageblock_start_pfn(pfn);
-	end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone)) - 1;
+	end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone));
+
+	page = pageblock_pfn_to_page(start_pfn, end_pfn, cc->zone);
+	if (!page)
+		return;
 
 	/* Scan before */
 	if (start_pfn != pfn) {
@@ -1486,7 +1490,7 @@ fast_isolate_freepages(struct compact_control *cc)
 	}
 
 	cc->total_free_scanned += nr_scanned;
-	if (!page)
+	if (!page || page_zone(page) != cc->zone)
 		return cc->free_pfn;
 
 	low_pfn = page_to_pfn(page);
Mike Rapoport Feb. 16, 2021, 5:49 p.m. UTC | #25
Hi Vlastimil,

On Tue, Feb 16, 2021 at 05:39:12PM +0100, Vlastimil Babka wrote:
> 
> 
> So, Andrea could you please check if this fixes the original
> fast_isolate_around() issue for you? With the VM_BUG_ON not removed, DEBUG_VM
> enabled, no changes to struct page initialization...
> It relies on pageblock_pfn_to_page as the rest of the compaction code.

Pardon my ignorance of compaction internals, but does this mean that with
your patch we'll never call set_pfnblock_flags_mask() for a pfn in a hole?
 
> Thanks!
> 
> ----8<----
> From f5c8d7bc77d2ec0b4cfec44820ce6f602fdb3a86 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Tue, 16 Feb 2021 17:32:34 +0100
> Subject: [PATCH] mm, compaction: make fast_isolate_around() robust against
>  pfns from a wrong zone
> 
> TBD
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/compaction.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 190ccdaa6c19..b75645e4678d 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1288,7 +1288,7 @@ static void
>  fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long nr_isolated)
>  {
>  	unsigned long start_pfn, end_pfn;
> -	struct page *page = pfn_to_page(pfn);
> +	struct page *page;
>  
>  	/* Do not search around if there are enough pages already */
>  	if (cc->nr_freepages >= cc->nr_migratepages)
> @@ -1300,7 +1300,11 @@ fast_isolate_around(struct compact_control *cc, unsigned long pfn, unsigned long
>  
>  	/* Pageblock boundaries */
>  	start_pfn = pageblock_start_pfn(pfn);
> -	end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone)) - 1;
> +	end_pfn = min(pageblock_end_pfn(pfn), zone_end_pfn(cc->zone));
> +
> +	page = pageblock_pfn_to_page(start_pfn, end_pfn, cc->zone);
> +	if (!page)
> +		return;
>  
>  	/* Scan before */
>  	if (start_pfn != pfn) {
> @@ -1486,7 +1490,7 @@ fast_isolate_freepages(struct compact_control *cc)
>  	}
>  
>  	cc->total_free_scanned += nr_scanned;
> -	if (!page)
> +	if (!page || page_zone(page) != cc->zone)
>  		return cc->free_pfn;
>  
>  	low_pfn = page_to_pfn(page);
> -- 
> 2.30.0
>
Vlastimil Babka Feb. 17, 2021, 12:27 p.m. UTC | #26
On 2/16/21 6:49 PM, Mike Rapoport wrote:
> Hi Vlastimil,
> 
> On Tue, Feb 16, 2021 at 05:39:12PM +0100, Vlastimil Babka wrote:
>> 
>> 
>> So, Andrea could you please check if this fixes the original
>> fast_isolate_around() issue for you? With the VM_BUG_ON not removed, DEBUG_VM
>> enabled, no changes to struct page initialization...
>> It relies on pageblock_pfn_to_page as the rest of the compaction code.
> 
> Pardon my ignorance of compaction internals, but does this mean that with
> your patch we'll never call set_pfnblock_flags_mask() for a pfn in a hole?

No it doesn't mean that kind of guarantee. But we will not call it anymore (if
my patch is correct) from a path which we currently know it's doing that and
triggering the VM_BUG_ON. So that's a targetted fix that matches stable backport
criteria. It doesn't contradict your patch as a way to improve mainline, I still
agree it's best long-term if we initialize the struct pages without such
surprises. But I also agree with Michal that there's a risk of replacing one
corner case with another and thus we shouldn't do that as a stable fix.

Patch
diff mbox series

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6446778cbc6b..1c3f7521028f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6257,22 +6257,84 @@  static void __meminit zone_init_free_lists(struct zone *zone)
 	}
 }
 
+#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
+/*
+ * Only struct pages that correspond to ranges defined by memblock.memory
+ * are zeroed and initialized by going through __init_single_page() during
+ * memmap_init_zone().
+ *
+ * But, there could be struct pages that correspond to holes in
+ * memblock.memory. This can happen because of the following reasons:
+ * - phyiscal memory bank size is not necessarily the exact multiple of the
+ *   arbitrary section size
+ * - early reserved memory may not be listed in memblock.memory
+ * - memory layouts defined with memmap= kernel parameter may not align
+ *   nicely with memmap sections
+ *
+ * Explicitly initialize those struct pages so that:
+ * - PG_Reserved is set
+ * - zone and node links point to zone and node that span the page
+ */
+static u64 __meminit init_unavailable_range(unsigned long spfn,
+					    unsigned long epfn,
+					    int zone, int node)
+{
+	unsigned long pfn;
+	u64 pgcnt = 0;
+
+	for (pfn = spfn; pfn < epfn; pfn++) {
+		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
+			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
+				+ pageblock_nr_pages - 1;
+			continue;
+		}
+		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
+		__SetPageReserved(pfn_to_page(pfn));
+		pgcnt++;
+	}
+
+	return pgcnt;
+}
+#else
+static inline u64 init_unavailable_range(unsigned long spfn, unsigned long epfn,
+					 int zone, int node)
+{
+	return 0;
+}
+#endif
+
 void __meminit __weak memmap_init_zone(struct zone *zone)
 {
 	unsigned long zone_start_pfn = zone->zone_start_pfn;
 	unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages;
 	int i, nid = zone_to_nid(zone), zone_id = zone_idx(zone);
 	unsigned long start_pfn, end_pfn;
+	unsigned long hole_pfn = 0;
+	u64 pgcnt = 0;
 
 	for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
 		start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn);
 		end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn);
+		hole_pfn = clamp(hole_pfn, zone_start_pfn, zone_end_pfn);
 
 		if (end_pfn > start_pfn)
 			memmap_init_range(end_pfn - start_pfn, nid,
 					zone_id, start_pfn, zone_end_pfn,
 					MEMINIT_EARLY, NULL, MIGRATE_MOVABLE);
+
+		if (hole_pfn < start_pfn)
+			pgcnt += init_unavailable_range(hole_pfn, start_pfn,
+							zone_id, nid);
+		hole_pfn = end_pfn;
 	}
+
+	if (hole_pfn < zone_end_pfn)
+		pgcnt += init_unavailable_range(hole_pfn, zone_end_pfn,
+						zone_id, nid);
+
+	if (pgcnt)
+		pr_info("  %s zone: %lld pages in unavailable ranges\n",
+			zone->name, pgcnt);
 }
 
 static int zone_batchsize(struct zone *zone)
@@ -6519,8 +6581,19 @@  void __init get_pfn_range_for_nid(unsigned int nid,
 		*end_pfn = max(*end_pfn, this_end_pfn);
 	}
 
-	if (*start_pfn == -1UL)
+	if (*start_pfn == -1UL) {
 		*start_pfn = 0;
+		return;
+	}
+
+#ifdef CONFIG_SPARSEMEM
+	/*
+	 * Sections in the memory map may not match actual populated
+	 * memory, extend the node span to cover the entire section.
+	 */
+	*start_pfn = round_down(*start_pfn, PAGES_PER_SECTION);
+	*end_pfn = round_up(*end_pfn, PAGES_PER_SECTION);
+#endif
 }
 
 /*
@@ -7069,88 +7142,6 @@  void __init free_area_init_memoryless_node(int nid)
 	free_area_init_node(nid);
 }
 
-#if !defined(CONFIG_FLAT_NODE_MEM_MAP)
-/*
- * Initialize all valid struct pages in the range [spfn, epfn) and mark them
- * PageReserved(). Return the number of struct pages that were initialized.
- */
-static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
-{
-	unsigned long pfn;
-	u64 pgcnt = 0;
-
-	for (pfn = spfn; pfn < epfn; pfn++) {
-		if (!pfn_valid(ALIGN_DOWN(pfn, pageblock_nr_pages))) {
-			pfn = ALIGN_DOWN(pfn, pageblock_nr_pages)
-				+ pageblock_nr_pages - 1;
-			continue;
-		}
-		/*
-		 * Use a fake node/zone (0) for now. Some of these pages
-		 * (in memblock.reserved but not in memblock.memory) will
-		 * get re-initialized via reserve_bootmem_region() later.
-		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
-		__SetPageReserved(pfn_to_page(pfn));
-		pgcnt++;
-	}
-
-	return pgcnt;
-}
-
-/*
- * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
- * struct pages which are reserved in memblock allocator and their fields
- * may be accessed (for example page_to_pfn() on some configuration accesses
- * flags). We must explicitly initialize those struct pages.
- *
- * This function also addresses a similar issue where struct pages are left
- * uninitialized because the physical address range is not covered by
- * memblock.memory or memblock.reserved. That could happen when memblock
- * layout is manually configured via memmap=, or when the highest physical
- * address (max_pfn) does not end on a section boundary.
- */
-static void __init init_unavailable_mem(void)
-{
-	phys_addr_t start, end;
-	u64 i, pgcnt;
-	phys_addr_t next = 0;
-
-	/*
-	 * Loop through unavailable ranges not covered by memblock.memory.
-	 */
-	pgcnt = 0;
-	for_each_mem_range(i, &start, &end) {
-		if (next < start)
-			pgcnt += init_unavailable_range(PFN_DOWN(next),
-							PFN_UP(start));
-		next = end;
-	}
-
-	/*
-	 * Early sections always have a fully populated memmap for the whole
-	 * section - see pfn_valid(). If the last section has holes at the
-	 * end and that section is marked "online", the memmap will be
-	 * considered initialized. Make sure that memmap has a well defined
-	 * state.
-	 */
-	pgcnt += init_unavailable_range(PFN_DOWN(next),
-					round_up(max_pfn, PAGES_PER_SECTION));
-
-	/*
-	 * Struct pages that do not have backing memory. This could be because
-	 * firmware is using some of this memory, or for some other reasons.
-	 */
-	if (pgcnt)
-		pr_info("Zeroed struct page in unavailable ranges: %lld pages", pgcnt);
-}
-#else
-static inline void __init init_unavailable_mem(void)
-{
-}
-#endif /* !CONFIG_FLAT_NODE_MEM_MAP */
-
 #if MAX_NUMNODES > 1
 /*
  * Figure out the number of possible node ids.
@@ -7510,7 +7501,7 @@  void __init free_area_init(unsigned long *max_zone_pfn)
 	memset(arch_zone_highest_possible_pfn, 0,
 				sizeof(arch_zone_highest_possible_pfn));
 
-	start_pfn = find_min_pfn_with_active_regions();
+	start_pfn = 0;
 	descending = arch_has_descending_max_zone_pfns();
 
 	for (i = 0; i < MAX_NR_ZONES; i++) {
@@ -7574,7 +7565,6 @@  void __init free_area_init(unsigned long *max_zone_pfn)
 	/* Initialise every node */
 	mminit_verify_pageflags_layout();
 	setup_nr_node_ids();
-	init_unavailable_mem();
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 		free_area_init_node(nid);