mm/compaction: Fix the incorrect hole in fast_isolate_freepages()
diff mbox series

Message ID 20200521014407.29690-1-bhe@redhat.com
State In Next
Commit 22d2a9e534cc11caba91675c89c0bbfbc06887da
Headers show
Series
  • mm/compaction: Fix the incorrect hole in fast_isolate_freepages()
Related show

Commit Message

Baoquan He May 21, 2020, 1:44 a.m. UTC
Qian reported that a crash happened in compaction.
http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw

LTP: starting swapping01 (swapping01 -i 5)
page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0
flags: 0x1fffe000001000(reserved)
raw: 001fffe000001000 ffffea0000aa0008 ffffea0000aa0008 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn))
page_owner info is not present (never set?)
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:533!
...
CPU: 17 PID: 218 Comm: kcompactd0 Not tainted 5.7.0-rc2-next-20200423+ #7
...
RIP: 0010:set_pfnblock_flags_mask+0x150/0x210
...
RSP: 0018:ffffc900042ff858 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff9a002382
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8884535b8e6c
RBP: ffffc900042ff8b8 R08: ffffed108a6b8459 R09: ffffed108a6b8459
R10: ffff8884535c22c7 R11: ffffed108a6b8458 R12: 000000000002a800
R13: ffffea0000aa0000 R14: ffff88847fff3000 R15: ffff88847fff3040
FS:  0000000000000000(0000) GS:ffff888453580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd1eb4a1000 CR3: 000000083154c000 CR4: 00000000003406e0
Call Trace:
 isolate_freepages+0xb20/0x1140
 ? isolate_freepages_block+0x730/0x730
 ? mark_held_locks+0x34/0xb0
 ? free_unref_page+0x7d/0x90
 ? free_unref_page+0x7d/0x90
 ? check_flags.part.28+0x86/0x220
 compaction_alloc+0xdd/0x100
 migrate_pages+0x304/0x17e0
 ? __ClearPageMovable+0x100/0x100
 ? isolate_freepages+0x1140/0x1140
 compact_zone+0x1249/0x1e90
 ? compaction_suitable+0x260/0x260
 kcompactd_do_work+0x231/0x650
 ? sysfs_compact_node+0x80/0x80
 ? finish_wait+0xe6/0x110
 kcompactd+0x162/0x490
 ? kcompactd_do_work+0x650/0x650
 ? finish_wait+0x110/0x110
 ? __kasan_check_read+0x11/0x20
 ? __kthread_parkme+0xd4/0xf0
 ? kcompactd_do_work+0x650/0x650
 kthread+0x1f7/0x220
 ? kthread_create_worker_on_cpu+0xc0/0xc0
 ret_from_fork+0x27/0x50

After investigation, it turns out that this is introduced by commit of
linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN").

After investigation, it turns out that this is introduced by commit of
linux-next, the patch subject is:
  "mm: memmap_init: iterate over memblock regions rather that check each PFN".

Qian added debugging code. The debugging log shows that the fault page is
0x2a800000. From the system e820 map which is pasted at bottom, the page
is in e820 reserved range:
	BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
And it's in section [0x28000000, 0x2fffffff]. In that secion, there are
several usable ranges and some e820 reserved ranges.

For this kind of e820 reserved range, it won't be added to memblock allocator.
However, init_unavailable_mem() will initialize to add them into node 0,
zone 0. Before that commit, later, memmap_init() will add e820 reserved
ranges into the zone where they are contained, because it can pass
the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
the e820 reserved range where fault page 0x2a800000 is located is added
into DMA32 zone. After that commit, the e820 reserved rgions are kept
in node 0, zone 0, since we iterate over memblock regions to iniatialize
in memmap_init() instead, their node and zone won't be changed.

Now, fast_isolate_freepages() will use min mark directly as the migration
target if no page is found from buddy. However, the min mark is not checked
carefully, it could be in e820 reserved range, and trigger the
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it
as skip.

Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it
will do careful checks and return NULL if the pfn is not qualified.

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
[    0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Baoquan He <bhe@redhat.com>
---
 mm/compaction.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Mike Rapoport May 21, 2020, 9:26 a.m. UTC | #1
Hi Baoquan,

On Thu, May 21, 2020 at 09:44:07AM +0800, Baoquan He wrote:
> Qian reported that a crash happened in compaction.
> http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
> 
V> LTP: starting swapping01 (swapping01 -i 5)
> page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0
> flags: 0x1fffe000001000(reserved)
> raw: 001fffe000001000 ffffea0000aa0008 ffffea0000aa0008 0000000000000000
> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn))
> page_owner info is not present (never set?)
> ------------[ cut here ]------------
> kernel BUG at mm/page_alloc.c:533!

...

> After investigation, it turns out that this is introduced by commit of
> linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock
> regions rather that check each PFN").
> 
> After investigation, it turns out that this is introduced by commit of
> linux-next, the patch subject is:
>   "mm: memmap_init: iterate over memblock regions rather that check each PFN".
> 
> Qian added debugging code. The debugging log shows that the fault page is
> 0x2a800000. From the system e820 map which is pasted at bottom, the page
> is in e820 reserved range:
> 	BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
> And it's in section [0x28000000, 0x2fffffff]. In that secion, there are
> several usable ranges and some e820 reserved ranges.
> 
> For this kind of e820 reserved range, it won't be added to memblock allocator.
> However, init_unavailable_mem() will initialize to add them into node 0,
> zone 0. Before that commit, later, memmap_init() will add e820 reserved
> ranges into the zone where they are contained, because it can pass
> the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
> the e820 reserved range where fault page 0x2a800000 is located is added
> into DMA32 zone. After that commit, the e820 reserved rgions are kept
> in node 0, zone 0, since we iterate over memblock regions to iniatialize
> in memmap_init() instead, their node and zone won't be changed.

I wonder why woudn't we add the reserved memory to memblock from the
very beginning...
I've tried to undestand why this was not done, but I couldn't find the
reasoning behind that.

Can you please try the below patch and see if it helps or, on the
contrary, breaks anything else :)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c5399e80c59c..b0940c618ed9 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
 		if (end != (resource_size_t)end)
 			continue;
 
-		if (entry->type == E820_TYPE_SOFT_RESERVED)
+		if (entry->type == E820_TYPE_SOFT_RESERVED ||
+		    entry->type == E820_TYPE_RESERVED) {
+			memblock_add(entry->addr, entry->size);
 			memblock_reserve(entry->addr, entry->size);
+		}
 
 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
 			continue;

> Now, fast_isolate_freepages() will use min mark directly as the migration
> target if no page is found from buddy. However, the min mark is not checked
> carefully, it could be in e820 reserved range, and trigger the
> VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it
> as skip.
> 
> Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it
> will do careful checks and return NULL if the pfn is not qualified.
> 
> [    0.000000] BIOS-provided physical RAM map:
> [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable
> [    0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS
> [    0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
> [    0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS
> [    0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved
> [    0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS
> [    0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS
> [    0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data
> [    0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable
> [    0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved
> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
> [    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
> 
> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  mm/compaction.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67fd317f78db..9ce4cff4d407 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc)
>  				cc->free_pfn = highest;
>  			} else {
>  				if (cc->direct_compaction && pfn_valid(min_pfn)) {
> -					page = pfn_to_page(min_pfn);
> +					page = pageblock_pfn_to_page(min_pfn,
> +						pageblock_end_pfn(min_pfn),
> +						cc->zone);
>  					cc->free_pfn = min_pfn;
>  				}
>  			}
> -- 
> 2.17.2
>
Mel Gorman May 21, 2020, 9:36 a.m. UTC | #2
On Thu, May 21, 2020 at 09:44:07AM +0800, Baoquan He wrote:
> After investigation, it turns out that this is introduced by commit of
> linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock
> regions rather that check each PFN").
> 
> After investigation, it turns out that this is introduced by commit of
> linux-next, the patch subject is:
>   "mm: memmap_init: iterate over memblock regions rather that check each PFN".
> 

Some repetition here. I assume it's because the commit ID is not stable
because it's in linux-next.

> Qian added debugging code. The debugging log shows that the fault page is
> 0x2a800000. From the system e820 map which is pasted at bottom, the page
> is in e820 reserved range:
> 	BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
> And it's in section [0x28000000, 0x2fffffff]. In that secion, there are
> several usable ranges and some e820 reserved ranges.
> 
> For this kind of e820 reserved range, it won't be added to memblock allocator.
> However, init_unavailable_mem() will initialize to add them into node 0,
> zone 0.

Why is it appropriate for init_unavailable_mem to add a e820 reserved
range to the zone at all? The bug being triggered indicates there is a
mismatch between the zone of a struct page and the PFN passed in.

> Before that commit, later, memmap_init() will add e820 reserved
> ranges into the zone where they are contained, because it can pass
> the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
> the e820 reserved range where fault page 0x2a800000 is located is added
> into DMA32 zone. After that commit, the e820 reserved rgions are kept
> in node 0, zone 0, since we iterate over memblock regions to iniatialize
> in memmap_init() instead, their node and zone won't be changed.
> 

This implies that we have struct pages that should never be used (e820
reserved) but exist somehow in a zone range but with broken linkages to
their node and zone.

> Reported-by: Qian Cai <cai@lca.pw>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  mm/compaction.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67fd317f78db..9ce4cff4d407 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc)
>  				cc->free_pfn = highest;
>  			} else {
>  				if (cc->direct_compaction && pfn_valid(min_pfn)) {
> -					page = pfn_to_page(min_pfn);
> +					page = pageblock_pfn_to_page(min_pfn,
> +						pageblock_end_pfn(min_pfn),
> +						cc->zone);
>  					cc->free_pfn = min_pfn;
>  				}
>  			}

Why is the correct fix not to avoid creating struct pages for e820
ranges and make sure that struct pages that are reachable have proper
node and zone linkages?
Mike Rapoport May 21, 2020, 1:38 p.m. UTC | #3
On Thu, May 21, 2020 at 10:36:06AM +0100, Mel Gorman wrote:
> On Thu, May 21, 2020 at 09:44:07AM +0800, Baoquan He wrote:
> > After investigation, it turns out that this is introduced by commit of
> > linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock
> > regions rather that check each PFN").
> > 
> > After investigation, it turns out that this is introduced by commit of
> > linux-next, the patch subject is:
> >   "mm: memmap_init: iterate over memblock regions rather that check each PFN".
> > 
> 
> Some repetition here. I assume it's because the commit ID is not stable
> because it's in linux-next.
> 
> > Qian added debugging code. The debugging log shows that the fault page is
> > 0x2a800000. From the system e820 map which is pasted at bottom, the page
> > is in e820 reserved range:
> > 	BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
> > And it's in section [0x28000000, 0x2fffffff]. In that secion, there are
> > several usable ranges and some e820 reserved ranges.
> > 
> > For this kind of e820 reserved range, it won't be added to memblock allocator.
> > However, init_unavailable_mem() will initialize to add them into node 0,
> > zone 0.
> 
> Why is it appropriate for init_unavailable_mem to add a e820 reserved
> range to the zone at all? The bug being triggered indicates there is a
> mismatch between the zone of a struct page and the PFN passed in.
> 
> > Before that commit, later, memmap_init() will add e820 reserved
> > ranges into the zone where they are contained, because it can pass
> > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
> > the e820 reserved range where fault page 0x2a800000 is located is added
> > into DMA32 zone. After that commit, the e820 reserved rgions are kept
> > in node 0, zone 0, since we iterate over memblock regions to iniatialize
> > in memmap_init() instead, their node and zone won't be changed.
> > 
> 
> This implies that we have struct pages that should never be used (e820
> reserved) but exist somehow in a zone range but with broken linkages to
> their node and zone.

Unless I misread Baoquan's explanation we do have struct pages for e820
resrved memory, but they are properly linked to their node and zone and
marked as reserved.

Before the commit memmap_init_zone() looped over all pfns in node in
zone and the reserved pages were passed to __init_single_page with node
and zone where their pfn belongs.

Afterwards, free_low_memory_core_early() reserved them because they were
present in memblock.reserved and form here on nothing would touch them.

After the commit, we skip over the holes in memblock.memory during
memmap_init_zone() which leaves the zone and node link zeroed for e820
reserved pages and that evetually triggers
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)).

> > Reported-by: Qian Cai <cai@lca.pw>
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  mm/compaction.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 67fd317f78db..9ce4cff4d407 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc)
> >  				cc->free_pfn = highest;
> >  			} else {
> >  				if (cc->direct_compaction && pfn_valid(min_pfn)) {
> > -					page = pfn_to_page(min_pfn);
> > +					page = pageblock_pfn_to_page(min_pfn,
> > +						pageblock_end_pfn(min_pfn),
> > +						cc->zone);
> >  					cc->free_pfn = min_pfn;
> >  				}
> >  			}
> 
> Why is the correct fix not to avoid creating struct pages for e820
> ranges and make sure that struct pages that are reachable have proper
> node and zone linkages?

I think that simpler fix would be to have e820 reserved ranges
reigstered as memory in memblock and not only as reserved. If we will
have these regions in both memblock.memory and memblock.reserved the
struct pages for them will be properly initialized as reserved. 

> -- 
> Mel Gorman
> SUSE Labs
Baoquan He May 21, 2020, 3:41 p.m. UTC | #4
On 05/21/20 at 10:36am, Mel Gorman wrote:
> On Thu, May 21, 2020 at 09:44:07AM +0800, Baoquan He wrote:
> > After investigation, it turns out that this is introduced by commit of
> > linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock
> > regions rather that check each PFN").
> > 
> > After investigation, it turns out that this is introduced by commit of
> > linux-next, the patch subject is:
> >   "mm: memmap_init: iterate over memblock regions rather that check each PFN".
> > 
> 
> Some repetition here. I assume it's because the commit ID is not stable
> because it's in linux-next.

Yes, I plan to remove the temporary commit id of linux-next, but
forgot cleaning up it.

> 
> > Qian added debugging code. The debugging log shows that the fault page is
> > 0x2a800000. From the system e820 map which is pasted at bottom, the page
> > is in e820 reserved range:
> > 	BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
> > And it's in section [0x28000000, 0x2fffffff]. In that secion, there are
> > several usable ranges and some e820 reserved ranges.
> > 
> > For this kind of e820 reserved range, it won't be added to memblock allocator.
> > However, init_unavailable_mem() will initialize to add them into node 0,
> > zone 0.
> 
> Why is it appropriate for init_unavailable_mem to add a e820 reserved
> range to the zone at all? The bug being triggered indicates there is a
> mismatch between the zone of a struct page and the PFN passed in.

Read the patch log again and reviewing comments, realize I could make
the log misleading with inaccurate explanation. The root cause is not
about e820 reserved only, it's about any unavailable range inside boot
mem section. The unavailable ranges include firmware reserved ranges,
and holes. Before Mike's below patchset, we call init_unavailable_mem()
to initialize the unavailable ranges into node 0, zone 0, and mark pages
inside the unvailable ranges as Reserved. Later in memmap_init_zone(),
we will add the unvaiable ranges into zone which start and end cover them.
Because the early_pfn_valid() and early_pfn_in_nid() incorrectly return
true for these unavailable pages. This looks unreasonable, right? But it
doesn't cause problem, because they are not added into buddy and is
Reserved.

After Mike's patchset applied, one of them is patch 15/21, it will
iterate over memblock regions. These unavailable ranges are not added
into memblock allocator, so they won't be added into zone which start
and end cover them, but kept in node 0, zone 0. Then this issue is
triggered by VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)).

Note that init_unavailable_mem(), memmap_init(), they are all in generic
MM code, not in a certain ARCH only. So I think the fix in this patch 
is needed, no matter whether we will fix the issue that unavailable
rangs have struct page, and are added into node 0, zone 0.

[PATCH 00/21] mm: rework free_area_init*() funcitons
http://lkml.kernel.org/r/20200412194859.12663-1-rppt@kernel.org

[PATCH 15/21] mm: memmap_init: iterate over memblock regions rather that check each PFN

> 
> > Before that commit, later, memmap_init() will add e820 reserved
> > ranges into the zone where they are contained, because it can pass
> > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
> > the e820 reserved range where fault page 0x2a800000 is located is added
> > into DMA32 zone. After that commit, the e820 reserved rgions are kept
> > in node 0, zone 0, since we iterate over memblock regions to iniatialize
> > in memmap_init() instead, their node and zone won't be changed.
> > 
> 
> This implies that we have struct pages that should never be used (e820
> reserved) but exist somehow in a zone range but with broken linkages to
> their node and zone.

Yes, in one boot memory section, if it includes usable RAM, and
unavailable ranges, like firmware reserved range, holes, these
unavailable ranges will be added into node 0, zone 0, and have struct
page, they are never used by system.

> 
> > Reported-by: Qian Cai <cai@lca.pw>
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  mm/compaction.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 67fd317f78db..9ce4cff4d407 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc)
> >  				cc->free_pfn = highest;
> >  			} else {
> >  				if (cc->direct_compaction && pfn_valid(min_pfn)) {
> > -					page = pfn_to_page(min_pfn);
> > +					page = pageblock_pfn_to_page(min_pfn,
> > +						pageblock_end_pfn(min_pfn),
> > +						cc->zone);
> >  					cc->free_pfn = min_pfn;
> >  				}
> >  			}
> 
> Why is the correct fix not to avoid creating struct pages for e820
> ranges and make sure that struct pages that are reachable have proper
> node and zone linkages?

Seems we can't avoid to create struct page for them, surely as I
explained at above, they are not only e820 reserved ranges, there are
also holes and other firmware reserved ranges.
Baoquan He May 21, 2020, 3:52 p.m. UTC | #5
On 05/21/20 at 12:26pm, Mike Rapoport wrote:
> > For this kind of e820 reserved range, it won't be added to memblock allocator.
> > However, init_unavailable_mem() will initialize to add them into node 0,
> > zone 0. Before that commit, later, memmap_init() will add e820 reserved
> > ranges into the zone where they are contained, because it can pass
> > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
> > the e820 reserved range where fault page 0x2a800000 is located is added
> > into DMA32 zone. After that commit, the e820 reserved rgions are kept
> > in node 0, zone 0, since we iterate over memblock regions to iniatialize
> > in memmap_init() instead, their node and zone won't be changed.
> 
> I wonder why woudn't we add the reserved memory to memblock from the
> very beginning...
> I've tried to undestand why this was not done, but I couldn't find the
> reasoning behind that.

I have added some explanation when reply to Mel. Please check that in
that thread.

As I said, the unavailable range includes firmware reserved ranges, and
holes inside one boot memory section, if that boot memory section haves
useable memory range, and firmware reserved ranges, and holes. Adding
them all into memblock seems a little unreasonable, since they are never
used by system in memblock, buddy or high level memory allocator. But I
can see that adding them into memblock may have the same effect as the
old code which is beofre your your patchset applied. Let's see if Mel or
other people have some saying. I pesonally would not suggest doing it
like this though.

> 
> Can you please try the below patch and see if it helps or, on the
> contrary, breaks anything else :)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c5399e80c59c..b0940c618ed9 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
>  		if (end != (resource_size_t)end)
>  			continue;
>  
> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
> +		    entry->type == E820_TYPE_RESERVED) {
> +			memblock_add(entry->addr, entry->size);
>  			memblock_reserve(entry->addr, entry->size);
> +		}
>  
>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>  			continue;
> 
> > Now, fast_isolate_freepages() will use min mark directly as the migration
> > target if no page is found from buddy. However, the min mark is not checked
> > carefully, it could be in e820 reserved range, and trigger the
> > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it
> > as skip.
> > 
> > Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it
> > will do careful checks and return NULL if the pfn is not qualified.
> > 
> > [    0.000000] BIOS-provided physical RAM map:
> > [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable
> > [    0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data
> > [    0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data
> > [    0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data
> > [    0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS
> > [    0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data
> > [    0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data
> > [    0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
> > [    0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS
> > [    0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS
> > [    0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS
> > [    0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data
> > [    0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable
> > [    0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved
> > [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> > [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
> > [    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
> > 
> > Reported-by: Qian Cai <cai@lca.pw>
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  mm/compaction.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 67fd317f78db..9ce4cff4d407 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc)
> >  				cc->free_pfn = highest;
> >  			} else {
> >  				if (cc->direct_compaction && pfn_valid(min_pfn)) {
> > -					page = pfn_to_page(min_pfn);
> > +					page = pageblock_pfn_to_page(min_pfn,
> > +						pageblock_end_pfn(min_pfn),
> > +						cc->zone);
> >  					cc->free_pfn = min_pfn;
> >  				}
> >  			}
> > -- 
> > 2.17.2
> > 
> 
> -- 
> Sincerely yours,
> Mike.
>
Mike Rapoport May 21, 2020, 5:18 p.m. UTC | #6
On Thu, May 21, 2020 at 11:52:25PM +0800, Baoquan He wrote:
> On 05/21/20 at 12:26pm, Mike Rapoport wrote:
> > > For this kind of e820 reserved range, it won't be added to memblock allocator.
> > > However, init_unavailable_mem() will initialize to add them into node 0,
> > > zone 0. Before that commit, later, memmap_init() will add e820 reserved
> > > ranges into the zone where they are contained, because it can pass
> > > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
> > > the e820 reserved range where fault page 0x2a800000 is located is added
> > > into DMA32 zone. After that commit, the e820 reserved rgions are kept
> > > in node 0, zone 0, since we iterate over memblock regions to iniatialize
> > > in memmap_init() instead, their node and zone won't be changed.
> > 
> > I wonder why woudn't we add the reserved memory to memblock from the
> > very beginning...
> > I've tried to undestand why this was not done, but I couldn't find the
> > reasoning behind that.
> 
> I have added some explanation when reply to Mel. Please check that in
> that thread.

What I meant was that I've tried to find an explanation in the kernel logs
for why the reserved areas are not added to memblock.memory on x86. I
didn't mean that the your patch description lacks this explanation :)

> As I said, the unavailable range includes firmware reserved ranges, and
> holes inside one boot memory section, if that boot memory section haves
> useable memory range, and firmware reserved ranges, and holes. Adding
> them all into memblock seems a little unreasonable, since they are never
> used by system in memblock, buddy or high level memory allocator. But I
> can see that adding them into memblock may have the same effect as the
> old code which is beofre your your patchset applied. Let's see if Mel or
> other people have some saying. I pesonally would not suggest doing it
> like this though.

Adding reserved regions to memblock.memory will not have the same effect
as the old code. We anyway have to initialize struct page for these
areas, but unlike the old code we don't need to run them by the
early_pfn_in_nid() checks and we still get rid the
CONFIG_NODES_SPAN_OTHER_NODES option.

Besides, we'll have more consistency among NUMA-enabled architectures
this way.

> > 
> > Can you please try the below patch and see if it helps or, on the
> > contrary, breaks anything else :)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index c5399e80c59c..b0940c618ed9 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
> >  		if (end != (resource_size_t)end)
> >  			continue;
> >  
> > -		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
> > +		    entry->type == E820_TYPE_RESERVED) {
> > +			memblock_add(entry->addr, entry->size);
> >  			memblock_reserve(entry->addr, entry->size);
> > +		}
> >  
> >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> >  			continue;
> > 
> > > Now, fast_isolate_freepages() will use min mark directly as the migration
> > > target if no page is found from buddy. However, the min mark is not checked
> > > carefully, it could be in e820 reserved range, and trigger the
> > > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it
> > > as skip.
> > > 
> > > Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it
> > > will do careful checks and return NULL if the pfn is not qualified.
> > > 
> > > [    0.000000] BIOS-provided physical RAM map:
> > > [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable
> > > [    0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data
> > > [    0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data
> > > [    0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS
> > > [    0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS
> > > [    0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS
> > > [    0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data
> > > [    0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable
> > > [    0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> > > [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
> > > [    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
> > > 
> > > Reported-by: Qian Cai <cai@lca.pw>
> > > Signed-off-by: Baoquan He <bhe@redhat.com>
> > > ---
> > >  mm/compaction.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 67fd317f78db..9ce4cff4d407 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc)
> > >  				cc->free_pfn = highest;
> > >  			} else {
> > >  				if (cc->direct_compaction && pfn_valid(min_pfn)) {
> > > -					page = pfn_to_page(min_pfn);
> > > +					page = pageblock_pfn_to_page(min_pfn,
> > > +						pageblock_end_pfn(min_pfn),
> > > +						cc->zone);
> > >  					cc->free_pfn = min_pfn;
> > >  				}
> > >  			}
> > > -- 
> > > 2.17.2
> > > 
> > 
> > -- 
> > Sincerely yours,
> > Mike.
> > 
>
Baoquan He May 22, 2020, 7:01 a.m. UTC | #7
On 05/21/20 at 08:18pm, Mike Rapoport wrote:
> On Thu, May 21, 2020 at 11:52:25PM +0800, Baoquan He wrote:
> > On 05/21/20 at 12:26pm, Mike Rapoport wrote:
> > > > For this kind of e820 reserved range, it won't be added to memblock allocator.
> > > > However, init_unavailable_mem() will initialize to add them into node 0,
> > > > zone 0. Before that commit, later, memmap_init() will add e820 reserved
> > > > ranges into the zone where they are contained, because it can pass
> > > > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
> > > > the e820 reserved range where fault page 0x2a800000 is located is added
> > > > into DMA32 zone. After that commit, the e820 reserved rgions are kept
> > > > in node 0, zone 0, since we iterate over memblock regions to iniatialize
> > > > in memmap_init() instead, their node and zone won't be changed.
> > > 
> > > I wonder why woudn't we add the reserved memory to memblock from the
> > > very beginning...
> > > I've tried to undestand why this was not done, but I couldn't find the
> > > reasoning behind that.
> > 
> > I have added some explanation when reply to Mel. Please check that in
> > that thread.
> 
> What I meant was that I've tried to find an explanation in the kernel logs
> for why the reserved areas are not added to memblock.memory on x86. I
> didn't mean that the your patch description lacks this explanation :)

Ah, I misunderstood it, sorry about that.

> 
> > As I said, the unavailable range includes firmware reserved ranges, and
> > holes inside one boot memory section, if that boot memory section haves
> > useable memory range, and firmware reserved ranges, and holes. Adding
> > them all into memblock seems a little unreasonable, since they are never
> > used by system in memblock, buddy or high level memory allocator. But I
> > can see that adding them into memblock may have the same effect as the
> > old code which is beofre your your patchset applied. Let's see if Mel or
> > other people have some saying. I pesonally would not suggest doing it
> > like this though.
> 
> Adding reserved regions to memblock.memory will not have the same effect
> as the old code. We anyway have to initialize struct page for these
> areas, but unlike the old code we don't need to run them by the
> early_pfn_in_nid() checks and we still get rid the
> CONFIG_NODES_SPAN_OTHER_NODES option.

Hmm, I mean adding them to memblock will let us have the same result,
they are added into the node, zone where they should be, and marked as
reserved, just as the old code did.

Rethink about this, seems adding them into memblock is doable. But
we may not need to add them from e820 reserved range, since that will
skip hole range which share the same section with usable range, and may
need to change code in different ARCHes. How about this:

We add them into memblock in init_unavailable_range(), memmap_init() will
add them into the right node and zone, reserve_bootmem_region() will
initialize them and mark them as Reserved.


From d019d0f9e7c958542dfcb142f93d07fcce6c7c22 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@redhat.com>
Date: Fri, 22 May 2020 14:36:13 +0800
Subject: [PATCH] mm/page_alloc.c: Add unavailable ranges into memblock

These unavailable ranges shares the same section with the usable range
in boot memory, e.g the firmware reserved ranges, and holes.

Previously, they are added into node 0, zone 0 in function
init_unavailable_range(), and marked as Reserved. Later, in function
memmap_init(), they will be added to appropriate node and zone, where
they are covered.

However, after the patchset ("mm: rework free_area_init*() funcitons")
is applied, we change to iterate over memblock regions. These unavailable
ranges are skipped, and the node and zone adjustment won't be done any
more as the old code did. This cause a crash in compaction which is triggered
by VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)).

So let's add these unavailable ranges into memblock and reserve them
in init_unavailable_range() instead. With this change, they will be added
into appropriate node and zone in memmap_init(), and initialized in
reserve_bootmem_region() just like any other memblock reserved regions.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 mm/page_alloc.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 603187800628..3973b5fdfe3f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6925,7 +6925,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
 static void __init init_unavailable_mem(void)
 {
 	phys_addr_t start, end;
-	u64 i, pgcnt;
+	u64 i, pgcnt, size;
 	phys_addr_t next = 0;
 
 	/*
@@ -6934,9 +6934,11 @@ static void __init init_unavailable_mem(void)
 	pgcnt = 0;
 	for_each_mem_range(i, &memblock.memory, NULL,
 			NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
-		if (next < start)
-			pgcnt += init_unavailable_range(PFN_DOWN(next),
-							PFN_UP(start));
+		if (next < start) {
+			size = PFN_UP(start) - PFN_DOWN(next);
+			memblock_add(PFN_DOWN(next), size);
+			memblock_reserve(PFN_DOWN(next), size);
+		}
 		next = end;
 	}
 
@@ -6947,8 +6949,11 @@ static void __init init_unavailable_mem(void)
 	 * considered initialized. Make sure that memmap has a well defined
 	 * state.
 	 */
-	pgcnt += init_unavailable_range(PFN_DOWN(next),
-					round_up(max_pfn, PAGES_PER_SECTION));
+	size = round_up(max_pfn, PAGES_PER_SECTION) - PFN_DOWN(next);
+	if (size) {
+		memblock_add(PFN_DOWN(next), size);
+		memblock_reserve(PFN_DOWN(next), size);
+	}
 
 	/*
 	 * Struct pages that do not have backing memory. This could be because
Baoquan He May 22, 2020, 7:25 a.m. UTC | #8
On 05/22/20 at 03:01pm, Baoquan He wrote:
> > > As I said, the unavailable range includes firmware reserved ranges, and
> > > holes inside one boot memory section, if that boot memory section haves
> > > useable memory range, and firmware reserved ranges, and holes. Adding
> > > them all into memblock seems a little unreasonable, since they are never
> > > used by system in memblock, buddy or high level memory allocator. But I
> > > can see that adding them into memblock may have the same effect as the
> > > old code which is beofre your your patchset applied. Let's see if Mel or
> > > other people have some saying. I pesonally would not suggest doing it
> > > like this though.
> > 
> > Adding reserved regions to memblock.memory will not have the same effect
> > as the old code. We anyway have to initialize struct page for these
> > areas, but unlike the old code we don't need to run them by the
> > early_pfn_in_nid() checks and we still get rid the
> > CONFIG_NODES_SPAN_OTHER_NODES option.
> 
> Hmm, I mean adding them to memblock will let us have the same result,
> they are added into the node, zone where they should be, and marked as
> reserved, just as the old code did.
> 
> Rethink about this, seems adding them into memblock is doable. But
> we may not need to add them from e820 reserved range, since that will
> skip hole range which share the same section with usable range, and may
> need to change code in different ARCHes. How about this:
> 
> We add them into memblock in init_unavailable_range(), memmap_init() will
> add them into the right node and zone, reserve_bootmem_region() will
> initialize them and mark them as Reserved.
> 
> 
> From d019d0f9e7c958542dfcb142f93d07fcce6c7c22 Mon Sep 17 00:00:00 2001
> From: Baoquan He <bhe@redhat.com>
> Date: Fri, 22 May 2020 14:36:13 +0800
> Subject: [PATCH] mm/page_alloc.c: Add unavailable ranges into memblock
> 
> These unavailable ranges shares the same section with the usable range
> in boot memory, e.g the firmware reserved ranges, and holes.
> 
> Previously, they are added into node 0, zone 0 in function
> init_unavailable_range(), and marked as Reserved. Later, in function
> memmap_init(), they will be added to appropriate node and zone, where
> they are covered.
> 
> However, after the patchset ("mm: rework free_area_init*() funcitons")
> is applied, we change to iterate over memblock regions. These unavailable
> ranges are skipped, and the node and zone adjustment won't be done any
> more as the old code did. This cause a crash in compaction which is triggered
> by VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)).
> 
> So let's add these unavailable ranges into memblock and reserve them
> in init_unavailable_range() instead. With this change, they will be added
> into appropriate node and zone in memmap_init(), and initialized in
> reserve_bootmem_region() just like any other memblock reserved regions.

Seems this is not right. They can't get nid in init_unavailable_range().
Adding e820 ranges may let them get nid. But the hole range won't be
added to memblock, and still has the issue.

Nack this one for now, still considering.

> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  mm/page_alloc.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 603187800628..3973b5fdfe3f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6925,7 +6925,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
>  static void __init init_unavailable_mem(void)
>  {
>  	phys_addr_t start, end;
> -	u64 i, pgcnt;
> +	u64 i, pgcnt, size;
>  	phys_addr_t next = 0;
>  
>  	/*
> @@ -6934,9 +6934,11 @@ static void __init init_unavailable_mem(void)
>  	pgcnt = 0;
>  	for_each_mem_range(i, &memblock.memory, NULL,
>  			NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
> -		if (next < start)
> -			pgcnt += init_unavailable_range(PFN_DOWN(next),
> -							PFN_UP(start));
> +		if (next < start) {
> +			size = PFN_UP(start) - PFN_DOWN(next);
> +			memblock_add(PFN_DOWN(next), size);
> +			memblock_reserve(PFN_DOWN(next), size);
> +		}
>  		next = end;
>  	}
>  
> @@ -6947,8 +6949,11 @@ static void __init init_unavailable_mem(void)
>  	 * considered initialized. Make sure that memmap has a well defined
>  	 * state.
>  	 */
> -	pgcnt += init_unavailable_range(PFN_DOWN(next),
> -					round_up(max_pfn, PAGES_PER_SECTION));
> +	size = round_up(max_pfn, PAGES_PER_SECTION) - PFN_DOWN(next);
> +	if (size) {
> +		memblock_add(PFN_DOWN(next), size);
> +		memblock_reserve(PFN_DOWN(next), size);
> +	}
>  
>  	/*
>  	 * Struct pages that do not have backing memory. This could be because
> -- 
> 2.17.2
>
Mike Rapoport May 22, 2020, 2:20 p.m. UTC | #9
Hello Baoquan,

On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
> On 05/22/20 at 03:01pm, Baoquan He wrote:
> > 
> > So let's add these unavailable ranges into memblock and reserve them
> > in init_unavailable_range() instead. With this change, they will be added
> > into appropriate node and zone in memmap_init(), and initialized in
> > reserve_bootmem_region() just like any other memblock reserved regions.
> 
> Seems this is not right. They can't get nid in init_unavailable_range().
> Adding e820 ranges may let them get nid. But the hole range won't be
> added to memblock, and still has the issue.
> 
> Nack this one for now, still considering.

Why won't we add  the e820 reserved ranges to memblock.memory during
early boot as I suggested?

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c5399e80c59c..b0940c618ed9 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
 		if (end != (resource_size_t)end)
 			continue;
 
-		if (entry->type == E820_TYPE_SOFT_RESERVED)
+		if (entry->type == E820_TYPE_SOFT_RESERVED ||
+		    entry->type == E820_TYPE_RESERVED) {
+			memblock_add(entry->addr, entry->size);
 			memblock_reserve(entry->addr, entry->size);
+		}
 
 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
 			continue;

The setting of node later  in numa_init() will assign the proper node
for these regions as it does for the usable memory.

> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  mm/page_alloc.c | 17 +++++++++++------
> >  1 file changed, 11 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 603187800628..3973b5fdfe3f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6925,7 +6925,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
> >  static void __init init_unavailable_mem(void)
> >  {
> >  	phys_addr_t start, end;
> > -	u64 i, pgcnt;
> > +	u64 i, pgcnt, size;
> >  	phys_addr_t next = 0;
> >  
> >  	/*
> > @@ -6934,9 +6934,11 @@ static void __init init_unavailable_mem(void)
> >  	pgcnt = 0;
> >  	for_each_mem_range(i, &memblock.memory, NULL,
> >  			NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) {
> > -		if (next < start)
> > -			pgcnt += init_unavailable_range(PFN_DOWN(next),
> > -							PFN_UP(start));
> > +		if (next < start) {
> > +			size = PFN_UP(start) - PFN_DOWN(next);
> > +			memblock_add(PFN_DOWN(next), size);
> > +			memblock_reserve(PFN_DOWN(next), size);
> > +		}
> >  		next = end;
> >  	}
> >  
> > @@ -6947,8 +6949,11 @@ static void __init init_unavailable_mem(void)
> >  	 * considered initialized. Make sure that memmap has a well defined
> >  	 * state.
> >  	 */
> > -	pgcnt += init_unavailable_range(PFN_DOWN(next),
> > -					round_up(max_pfn, PAGES_PER_SECTION));
> > +	size = round_up(max_pfn, PAGES_PER_SECTION) - PFN_DOWN(next);
> > +	if (size) {
> > +		memblock_add(PFN_DOWN(next), size);
> > +		memblock_reserve(PFN_DOWN(next), size);
> > +	}
> >  
> >  	/*
> >  	 * Struct pages that do not have backing memory. This could be because
> > -- 
> > 2.17.2
> > 
>
Baoquan He May 26, 2020, 8:45 a.m. UTC | #10
On 05/22/20 at 05:20pm, Mike Rapoport wrote:
> Hello Baoquan,
> 
> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
> > On 05/22/20 at 03:01pm, Baoquan He wrote:
> > > 
> > > So let's add these unavailable ranges into memblock and reserve them
> > > in init_unavailable_range() instead. With this change, they will be added
> > > into appropriate node and zone in memmap_init(), and initialized in
> > > reserve_bootmem_region() just like any other memblock reserved regions.
> > 
> > Seems this is not right. They can't get nid in init_unavailable_range().
> > Adding e820 ranges may let them get nid. But the hole range won't be
> > added to memblock, and still has the issue.
> > 
> > Nack this one for now, still considering.
> 
> Why won't we add  the e820 reserved ranges to memblock.memory during
> early boot as I suggested?
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index c5399e80c59c..b0940c618ed9 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
>  		if (end != (resource_size_t)end)
>  			continue;
>  
> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
> +		    entry->type == E820_TYPE_RESERVED) {
> +			memblock_add(entry->addr, entry->size);
>  			memblock_reserve(entry->addr, entry->size);
> +		}
>  
>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>  			continue;
> 
> The setting of node later  in numa_init() will assign the proper node
> for these regions as it does for the usable memory.

Yes, if it's only related to e820 reserved region, this truly works.

However, it also has ACPI table regions. That's why I changed to call
the problematic area as firmware reserved ranges later.

Bisides, you can see below line, there's another reserved region which only
occupies one page in one memory seciton. If adding to memblock.memory, we also
will build struct mem_section and the relevant struct pages for the whole
section. And then the holes around that page will be added and initialized in
init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
and memblock.reserved, but won't assign proper node for the holes.

~~~
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
~~~

So I still think we should not add firmware reserved range into
memblock for fixing this issue.

And, the fix in the original patch seems necessary. You can see in
compaction code, the migration source is chosen from LRU pages or
movable pages, the migration target has to be got from Buddy. However,
only the min_pfn in fast_isolate_freepages(), it's calculated by
distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
safe, then use it as the target to handle.

Hi David,

Meanwhile, I checked the history of init_unavailable_mem(). Till below
commit From David, the unavailable ranges began to be added to zone 0
and node 0. Before that, we only zero the struct page of unavailable
ranges and mark it as Reserved. Am wondering if we have to add it to
node 0 and zone 0. From below commit, I don't get why. Could you help
clarify so that I get what I missed?

commit 4b094b7851bf4bf551ad456195d3f26e1c03bd74
Author: David Hildenbrand <david@redhat.com>
Date:   Mon Feb 3 17:33:55 2020 -0800

    mm/page_alloc.c: initialize memmap of unavailable memory directly

I have sent mail to Naoya trying to make clear question I have,
see if we can do something to clean up the mess here.


[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
[    0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved
David Hildenbrand May 26, 2020, 8:55 a.m. UTC | #11
On 26.05.20 10:45, Baoquan He wrote:
> On 05/22/20 at 05:20pm, Mike Rapoport wrote:
>> Hello Baoquan,
>>
>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
>>> On 05/22/20 at 03:01pm, Baoquan He wrote:
>>>>
>>>> So let's add these unavailable ranges into memblock and reserve them
>>>> in init_unavailable_range() instead. With this change, they will be added
>>>> into appropriate node and zone in memmap_init(), and initialized in
>>>> reserve_bootmem_region() just like any other memblock reserved regions.
>>>
>>> Seems this is not right. They can't get nid in init_unavailable_range().
>>> Adding e820 ranges may let them get nid. But the hole range won't be
>>> added to memblock, and still has the issue.
>>>
>>> Nack this one for now, still considering.
>>
>> Why won't we add  the e820 reserved ranges to memblock.memory during
>> early boot as I suggested?
>>
>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>> index c5399e80c59c..b0940c618ed9 100644
>> --- a/arch/x86/kernel/e820.c
>> +++ b/arch/x86/kernel/e820.c
>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
>>  		if (end != (resource_size_t)end)
>>  			continue;
>>  
>> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
>> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
>> +		    entry->type == E820_TYPE_RESERVED) {
>> +			memblock_add(entry->addr, entry->size);
>>  			memblock_reserve(entry->addr, entry->size);
>> +		}
>>  
>>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>>  			continue;
>>
>> The setting of node later  in numa_init() will assign the proper node
>> for these regions as it does for the usable memory.
> 
> Yes, if it's only related to e820 reserved region, this truly works.
> 
> However, it also has ACPI table regions. That's why I changed to call
> the problematic area as firmware reserved ranges later.
> 
> Bisides, you can see below line, there's another reserved region which only
> occupies one page in one memory seciton. If adding to memblock.memory, we also
> will build struct mem_section and the relevant struct pages for the whole
> section. And then the holes around that page will be added and initialized in
> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
> and memblock.reserved, but won't assign proper node for the holes.
> 
> ~~~
> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> ~~~
> 
> So I still think we should not add firmware reserved range into
> memblock for fixing this issue.
> 
> And, the fix in the original patch seems necessary. You can see in
> compaction code, the migration source is chosen from LRU pages or
> movable pages, the migration target has to be got from Buddy. However,
> only the min_pfn in fast_isolate_freepages(), it's calculated by
> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
> safe, then use it as the target to handle.
> 
> Hi David,
> 
> Meanwhile, I checked the history of init_unavailable_mem(). Till below
> commit From David, the unavailable ranges began to be added to zone 0
> and node 0. Before that, we only zero the struct page of unavailable
> ranges and mark it as Reserved. Am wondering if we have to add it to

Nope, before, not all pages were marked reserved. See the patch description.

> node 0 and zone 0. From below commit, I don't get why. Could you help
> clarify so that I get what I missed?

Node 0 / zone 0 is just like zeroing it IIRC - no change in that regard,
my patch just called that out explicitly.
Mike Rapoport May 26, 2020, 11:32 a.m. UTC | #12
Hello Baoquan,

On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote:
> On 05/22/20 at 05:20pm, Mike Rapoport wrote:
> > Hello Baoquan,
> > 
> > On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
> > > On 05/22/20 at 03:01pm, Baoquan He wrote:
> > > > 
> > > > So let's add these unavailable ranges into memblock and reserve them
> > > > in init_unavailable_range() instead. With this change, they will be added
> > > > into appropriate node and zone in memmap_init(), and initialized in
> > > > reserve_bootmem_region() just like any other memblock reserved regions.
> > > 
> > > Seems this is not right. They can't get nid in init_unavailable_range().
> > > Adding e820 ranges may let them get nid. But the hole range won't be
> > > added to memblock, and still has the issue.
> > > 
> > > Nack this one for now, still considering.
> > 
> > Why won't we add  the e820 reserved ranges to memblock.memory during
> > early boot as I suggested?
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index c5399e80c59c..b0940c618ed9 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
> >  		if (end != (resource_size_t)end)
> >  			continue;
> >  
> > -		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
> > +		    entry->type == E820_TYPE_RESERVED) {
> > +			memblock_add(entry->addr, entry->size);
> >  			memblock_reserve(entry->addr, entry->size);
> > +		}
> >  
> >  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> >  			continue;
> > 
> > The setting of node later  in numa_init() will assign the proper node
> > for these regions as it does for the usable memory.
> 
> Yes, if it's only related to e820 reserved region, this truly works.
> 
> However, it also has ACPI table regions. That's why I changed to call
> the problematic area as firmware reserved ranges later.
> 
> Bisides, you can see below line, there's another reserved region which only
> occupies one page in one memory seciton. If adding to memblock.memory, we also
> will build struct mem_section and the relevant struct pages for the whole
> section. And then the holes around that page will be added and initialized in
> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
> and memblock.reserved, but won't assign proper node for the holes.
> 
> ~~~
> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> ~~~
> 
> So I still think we should not add firmware reserved range into
> memblock for fixing this issue.
> 
> And, the fix in the original patch seems necessary. You can see in
> compaction code, the migration source is chosen from LRU pages or
> movable pages, the migration target has to be got from Buddy. However,
> only the min_pfn in fast_isolate_freepages(), it's calculated by
> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
> safe, then use it as the target to handle.

I do not object to your original fix with careful check for pfn validity.

But I still think that the memory reserved by the firmware is still
memory and it should be added to memblock.memory. This way the memory
map will be properly initialized from the very beginning and we won't
need init_unavailable_mem() and alike workarounds and. Obviously, the patch
above is not enough, but it's a small step in this direction.

I believe that improving the early memory initialization would make many
things simpler and more robust, but that's a different story :)
David Hildenbrand May 26, 2020, 11:49 a.m. UTC | #13
On 26.05.20 13:32, Mike Rapoport wrote:
> Hello Baoquan,
> 
> On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote:
>> On 05/22/20 at 05:20pm, Mike Rapoport wrote:
>>> Hello Baoquan,
>>>
>>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
>>>> On 05/22/20 at 03:01pm, Baoquan He wrote:
>>>>>
>>>>> So let's add these unavailable ranges into memblock and reserve them
>>>>> in init_unavailable_range() instead. With this change, they will be added
>>>>> into appropriate node and zone in memmap_init(), and initialized in
>>>>> reserve_bootmem_region() just like any other memblock reserved regions.
>>>>
>>>> Seems this is not right. They can't get nid in init_unavailable_range().
>>>> Adding e820 ranges may let them get nid. But the hole range won't be
>>>> added to memblock, and still has the issue.
>>>>
>>>> Nack this one for now, still considering.
>>>
>>> Why won't we add  the e820 reserved ranges to memblock.memory during
>>> early boot as I suggested?
>>>
>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>> index c5399e80c59c..b0940c618ed9 100644
>>> --- a/arch/x86/kernel/e820.c
>>> +++ b/arch/x86/kernel/e820.c
>>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
>>>  		if (end != (resource_size_t)end)
>>>  			continue;
>>>  
>>> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
>>> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
>>> +		    entry->type == E820_TYPE_RESERVED) {
>>> +			memblock_add(entry->addr, entry->size);
>>>  			memblock_reserve(entry->addr, entry->size);
>>> +		}
>>>  
>>>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>>>  			continue;
>>>
>>> The setting of node later  in numa_init() will assign the proper node
>>> for these regions as it does for the usable memory.
>>
>> Yes, if it's only related to e820 reserved region, this truly works.
>>
>> However, it also has ACPI table regions. That's why I changed to call
>> the problematic area as firmware reserved ranges later.
>>
>> Bisides, you can see below line, there's another reserved region which only
>> occupies one page in one memory seciton. If adding to memblock.memory, we also
>> will build struct mem_section and the relevant struct pages for the whole
>> section. And then the holes around that page will be added and initialized in
>> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
>> and memblock.reserved, but won't assign proper node for the holes.
>>
>> ~~~
>> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
>> ~~~
>>
>> So I still think we should not add firmware reserved range into
>> memblock for fixing this issue.
>>
>> And, the fix in the original patch seems necessary. You can see in
>> compaction code, the migration source is chosen from LRU pages or
>> movable pages, the migration target has to be got from Buddy. However,
>> only the min_pfn in fast_isolate_freepages(), it's calculated by
>> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
>> safe, then use it as the target to handle.
> 
> I do not object to your original fix with careful check for pfn validity.
> 
> But I still think that the memory reserved by the firmware is still
> memory and it should be added to memblock.memory. This way the memory

If it's really memory that could be read/written, I think I agree.

> map will be properly initialized from the very beginning and we won't
> need init_unavailable_mem() and alike workarounds and. Obviously, the patch

I remember init_unavailable_mem() is necessary for holes within
sections, where we actually *don't* have memory, but we still have have
a valid memmap (full section) that we have to initialize.

See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap
of unavailable memory directly"). Our main memory ends within a section,
so we have to initialize the remaining parts because the whole section
will be marked valid/online.

Any way to improve this handling is appreciated. In that patch I also
spelled out that we might want to mark such holes via a new page type,
e.g., PageHole(). Such a page is a memory hole, but has a valid memmap.
Any content in the memmap (zone/node) should be ignored.

But it's all quite confusing, especially across architectures and ...

> above is not enough, but it's a small step in this direction.
> 
> I believe that improving the early memory initialization would make many
> things simpler and more robust, but that's a different story :)

... I second that.
Baoquan He May 28, 2020, 9:07 a.m. UTC | #14
On 05/26/20 at 01:49pm, David Hildenbrand wrote:
> On 26.05.20 13:32, Mike Rapoport wrote:
> > Hello Baoquan,
> > 
> > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote:
> >> On 05/22/20 at 05:20pm, Mike Rapoport wrote:
> >>> Hello Baoquan,
> >>>
> >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
> >>>> On 05/22/20 at 03:01pm, Baoquan He wrote:
> >>>>>
> >>>>> So let's add these unavailable ranges into memblock and reserve them
> >>>>> in init_unavailable_range() instead. With this change, they will be added
> >>>>> into appropriate node and zone in memmap_init(), and initialized in
> >>>>> reserve_bootmem_region() just like any other memblock reserved regions.
> >>>>
> >>>> Seems this is not right. They can't get nid in init_unavailable_range().
> >>>> Adding e820 ranges may let them get nid. But the hole range won't be
> >>>> added to memblock, and still has the issue.
> >>>>
> >>>> Nack this one for now, still considering.
> >>>
> >>> Why won't we add  the e820 reserved ranges to memblock.memory during
> >>> early boot as I suggested?
> >>>
> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >>> index c5399e80c59c..b0940c618ed9 100644
> >>> --- a/arch/x86/kernel/e820.c
> >>> +++ b/arch/x86/kernel/e820.c
> >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
> >>>  		if (end != (resource_size_t)end)
> >>>  			continue;
> >>>  
> >>> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
> >>> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
> >>> +		    entry->type == E820_TYPE_RESERVED) {
> >>> +			memblock_add(entry->addr, entry->size);
> >>>  			memblock_reserve(entry->addr, entry->size);
> >>> +		}
> >>>  
> >>>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> >>>  			continue;
> >>>
> >>> The setting of node later  in numa_init() will assign the proper node
> >>> for these regions as it does for the usable memory.
> >>
> >> Yes, if it's only related to e820 reserved region, this truly works.
> >>
> >> However, it also has ACPI table regions. That's why I changed to call
> >> the problematic area as firmware reserved ranges later.
> >>
> >> Bisides, you can see below line, there's another reserved region which only
> >> occupies one page in one memory seciton. If adding to memblock.memory, we also
> >> will build struct mem_section and the relevant struct pages for the whole
> >> section. And then the holes around that page will be added and initialized in
> >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
> >> and memblock.reserved, but won't assign proper node for the holes.
> >>
> >> ~~~
> >> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> >> ~~~
> >>
> >> So I still think we should not add firmware reserved range into
> >> memblock for fixing this issue.
> >>
> >> And, the fix in the original patch seems necessary. You can see in
> >> compaction code, the migration source is chosen from LRU pages or
> >> movable pages, the migration target has to be got from Buddy. However,
> >> only the min_pfn in fast_isolate_freepages(), it's calculated by
> >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
> >> safe, then use it as the target to handle.
> > 
> > I do not object to your original fix with careful check for pfn validity.
> > 
> > But I still think that the memory reserved by the firmware is still
> > memory and it should be added to memblock.memory. This way the memory
> 
> If it's really memory that could be read/written, I think I agree.

I would say some of them may not be allowed to be read/written, if I
understand it correctly. I roughly went through the x86 init code, there
are some places where mem region is marked as E820_TYPE_RESERVED so that
they are not touched after initialization. E.g:

1) pfn 0
In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
see the code comment, this is a BIOS owned area, but not kernel RAM.

2)GART reserved region
In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
in an area, firmware designer won't map system RAM into that area.

And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
regions are not system RAM backed area, reading from or writting into
these area may cause error.

Futhermore, there's a KASLR bug found by HPE, its triggering and root
cause are written into below commit log. You can see that accessing to
firmware reserved region caused BIOS to halt system when cpu doing
speculative.

commit 2aa85f246c181b1fa89f27e8e20c5636426be624
Author: Steve Wahl <steve.wahl@hpe.com>
Date:   Tue Sep 24 16:03:55 2019 -0500

    x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area

    Our hardware (UV aka Superdome Flex) has address ranges marked
    reserved by the BIOS. Access to these ranges is caught as an error,
    causing the BIOS to halt the system.

> 
> > map will be properly initialized from the very beginning and we won't
> > need init_unavailable_mem() and alike workarounds and. Obviously, the patch
> 
> I remember init_unavailable_mem() is necessary for holes within
> sections, where we actually *don't* have memory, but we still have have
> a valid memmap (full section) that we have to initialize.
> 
> See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap
> of unavailable memory directly"). Our main memory ends within a section,
> so we have to initialize the remaining parts because the whole section
> will be marked valid/online.

Yes, memory hole need be handled in init_unavailable_mem(). Since we
have created struct page for them, need initialize them. We can't
discard init_unavailable_mem() for now.

> 
> Any way to improve this handling is appreciated. In that patch I also
> spelled out that we might want to mark such holes via a new page type,
> e.g., PageHole(). Such a page is a memory hole, but has a valid memmap.
> Any content in the memmap (zone/node) should be ignored.

As I said at above, I am a little conservative to add all those regions of
E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because
most of them are firmware reserved region, they may be not backed by normal
RAM.

I was thinking to step back to use mm_zero_struct_page() inside
init_unavailable_range() as below. But it doesn't differ much
from __init_single_page(), except of the _refcount and mapcount.
Zeroing struct page equals to putting them into node 0, zero 0.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3973b5fdfe3f..4e4b72cf5283 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
                 * (in memblock.reserved but not in memblock.memory) will
                 * get re-initialized via reserve_bootmem_region() later.
                 */
-               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+               mm_zero_struct_page(pfn_to_page(pfn));
                __SetPageReserved(pfn_to_page(pfn));
                pgcnt++;
        }

About adding these unavailable ranges into node/zone, in the old code,
it just happened to add them into expected node/zone. You can see in
early_pfn_in_nid(), if no nid found from memblock, the returned '-1'
will make it true ironically. But that is not saying the bad thing
always got good result. If the last zone of node 0 is DMA32 zone, the
deferred init will skip the only chance to add some of unavailable
rnages into expected node/zone. Means they were not always added into
appropriate node/zone before, the change of iterating memblock.memory in
memmap_init() dones't introduce regression.

static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
        int nid;

        nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
        if (nid >= 0 && nid != node)
                return false;
        return true;
}

So if no anybody need access them after boot, not adding them into any
node/zone sounds better. Otherwise, better add them in the appropriate
node/zone.

> 
> But it's all quite confusing, especially across architectures and ...
> 
> > above is not enough, but it's a small step in this direction.
> > 
> > I believe that improving the early memory initialization would make many
> > things simpler and more robust, but that's a different story :)
> 
> ... I second that.
> 
> -- 
> Thanks,
> 
> David / dhildenb
David Hildenbrand May 28, 2020, 9:08 a.m. UTC | #15
On 28.05.20 11:07, Baoquan He wrote:
> On 05/26/20 at 01:49pm, David Hildenbrand wrote:
>> On 26.05.20 13:32, Mike Rapoport wrote:
>>> Hello Baoquan,
>>>
>>> On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote:
>>>> On 05/22/20 at 05:20pm, Mike Rapoport wrote:
>>>>> Hello Baoquan,
>>>>>
>>>>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
>>>>>> On 05/22/20 at 03:01pm, Baoquan He wrote:
>>>>>>>
>>>>>>> So let's add these unavailable ranges into memblock and reserve them
>>>>>>> in init_unavailable_range() instead. With this change, they will be added
>>>>>>> into appropriate node and zone in memmap_init(), and initialized in
>>>>>>> reserve_bootmem_region() just like any other memblock reserved regions.
>>>>>>
>>>>>> Seems this is not right. They can't get nid in init_unavailable_range().
>>>>>> Adding e820 ranges may let them get nid. But the hole range won't be
>>>>>> added to memblock, and still has the issue.
>>>>>>
>>>>>> Nack this one for now, still considering.
>>>>>
>>>>> Why won't we add  the e820 reserved ranges to memblock.memory during
>>>>> early boot as I suggested?
>>>>>
>>>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>>>> index c5399e80c59c..b0940c618ed9 100644
>>>>> --- a/arch/x86/kernel/e820.c
>>>>> +++ b/arch/x86/kernel/e820.c
>>>>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
>>>>>  		if (end != (resource_size_t)end)
>>>>>  			continue;
>>>>>  
>>>>> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
>>>>> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
>>>>> +		    entry->type == E820_TYPE_RESERVED) {
>>>>> +			memblock_add(entry->addr, entry->size);
>>>>>  			memblock_reserve(entry->addr, entry->size);
>>>>> +		}
>>>>>  
>>>>>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
>>>>>  			continue;
>>>>>
>>>>> The setting of node later  in numa_init() will assign the proper node
>>>>> for these regions as it does for the usable memory.
>>>>
>>>> Yes, if it's only related to e820 reserved region, this truly works.
>>>>
>>>> However, it also has ACPI table regions. That's why I changed to call
>>>> the problematic area as firmware reserved ranges later.
>>>>
>>>> Bisides, you can see below line, there's another reserved region which only
>>>> occupies one page in one memory seciton. If adding to memblock.memory, we also
>>>> will build struct mem_section and the relevant struct pages for the whole
>>>> section. And then the holes around that page will be added and initialized in
>>>> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
>>>> and memblock.reserved, but won't assign proper node for the holes.
>>>>
>>>> ~~~
>>>> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
>>>> ~~~
>>>>
>>>> So I still think we should not add firmware reserved range into
>>>> memblock for fixing this issue.
>>>>
>>>> And, the fix in the original patch seems necessary. You can see in
>>>> compaction code, the migration source is chosen from LRU pages or
>>>> movable pages, the migration target has to be got from Buddy. However,
>>>> only the min_pfn in fast_isolate_freepages(), it's calculated by
>>>> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
>>>> safe, then use it as the target to handle.
>>>
>>> I do not object to your original fix with careful check for pfn validity.
>>>
>>> But I still think that the memory reserved by the firmware is still
>>> memory and it should be added to memblock.memory. This way the memory
>>
>> If it's really memory that could be read/written, I think I agree.
> 
> I would say some of them may not be allowed to be read/written, if I
> understand it correctly. I roughly went through the x86 init code, there
> are some places where mem region is marked as E820_TYPE_RESERVED so that
> they are not touched after initialization. E.g:
> 
> 1) pfn 0
> In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
> see the code comment, this is a BIOS owned area, but not kernel RAM.
> 
> 2)GART reserved region
> In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
> in an area, firmware designer won't map system RAM into that area.
> 
> And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
> regions are not system RAM backed area, reading from or writting into
> these area may cause error.
> 
> Futhermore, there's a KASLR bug found by HPE, its triggering and root
> cause are written into below commit log. You can see that accessing to
> firmware reserved region caused BIOS to halt system when cpu doing
> speculative.
> 
> commit 2aa85f246c181b1fa89f27e8e20c5636426be624
> Author: Steve Wahl <steve.wahl@hpe.com>
> Date:   Tue Sep 24 16:03:55 2019 -0500
> 
>     x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
> 
>     Our hardware (UV aka Superdome Flex) has address ranges marked
>     reserved by the BIOS. Access to these ranges is caught as an error,
>     causing the BIOS to halt the system.
> 
>>
>>> map will be properly initialized from the very beginning and we won't
>>> need init_unavailable_mem() and alike workarounds and. Obviously, the patch
>>
>> I remember init_unavailable_mem() is necessary for holes within
>> sections, where we actually *don't* have memory, but we still have have
>> a valid memmap (full section) that we have to initialize.
>>
>> See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap
>> of unavailable memory directly"). Our main memory ends within a section,
>> so we have to initialize the remaining parts because the whole section
>> will be marked valid/online.
> 
> Yes, memory hole need be handled in init_unavailable_mem(). Since we
> have created struct page for them, need initialize them. We can't
> discard init_unavailable_mem() for now.
> 
>>
>> Any way to improve this handling is appreciated. In that patch I also
>> spelled out that we might want to mark such holes via a new page type,
>> e.g., PageHole(). Such a page is a memory hole, but has a valid memmap.
>> Any content in the memmap (zone/node) should be ignored.
> 
> As I said at above, I am a little conservative to add all those regions of
> E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because
> most of them are firmware reserved region, they may be not backed by normal
> RAM.
> 
> I was thinking to step back to use mm_zero_struct_page() inside
> init_unavailable_range() as below. But it doesn't differ much
> from __init_single_page(), except of the _refcount and mapcount.
> Zeroing struct page equals to putting them into node 0, zero 0.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3973b5fdfe3f..4e4b72cf5283 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
>                  * (in memblock.reserved but not in memblock.memory) will
>                  * get re-initialized via reserve_bootmem_region() later.
>                  */
> -               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> +               mm_zero_struct_page(pfn_to_page(pfn));
>                 __SetPageReserved(pfn_to_page(pfn));
>                 pgcnt++;
>         }
> 
> About adding these unavailable ranges into node/zone, in the old code,
> it just happened to add them into expected node/zone. You can see in
> early_pfn_in_nid(), if no nid found from memblock, the returned '-1'
> will make it true ironically. But that is not saying the bad thing
> always got good result. If the last zone of node 0 is DMA32 zone, the
> deferred init will skip the only chance to add some of unavailable
> rnages into expected node/zone. Means they were not always added into
> appropriate node/zone before, the change of iterating memblock.memory in
> memmap_init() dones't introduce regression.
> 
> static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> {
>         int nid;
> 
>         nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
>         if (nid >= 0 && nid != node)
>                 return false;
>         return true;
> }
> 
> So if no anybody need access them after boot, not adding them into any
> node/zone sounds better. Otherwise, better add them in the appropriate
> node/zone.

Yes, the node/zone is just completely irrelevant for these pages I'd say.

As I said, maybe we can flag these memmaps somehow as "while this is an
initialized memmap, the node/zone is garbage and this memmap should just
be ignored completely in any kind of node/zone aware code".
Steve Wahl May 28, 2020, 3:15 p.m. UTC | #16
On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote:
> On 05/26/20 at 01:49pm, David Hildenbrand wrote:
> > On 26.05.20 13:32, Mike Rapoport wrote:
> > > Hello Baoquan,
> > > 
> > > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote:
> > >> On 05/22/20 at 05:20pm, Mike Rapoport wrote:
> > >>> Hello Baoquan,
> > >>>
> > >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
> > >>>> On 05/22/20 at 03:01pm, Baoquan He wrote:
> > >>>>>
> > >>>>> So let's add these unavailable ranges into memblock and reserve them
> > >>>>> in init_unavailable_range() instead. With this change, they will be added
> > >>>>> into appropriate node and zone in memmap_init(), and initialized in
> > >>>>> reserve_bootmem_region() just like any other memblock reserved regions.
> > >>>>
> > >>>> Seems this is not right. They can't get nid in init_unavailable_range().
> > >>>> Adding e820 ranges may let them get nid. But the hole range won't be
> > >>>> added to memblock, and still has the issue.
> > >>>>
> > >>>> Nack this one for now, still considering.
> > >>>
> > >>> Why won't we add  the e820 reserved ranges to memblock.memory during
> > >>> early boot as I suggested?
> > >>>
> > >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > >>> index c5399e80c59c..b0940c618ed9 100644
> > >>> --- a/arch/x86/kernel/e820.c
> > >>> +++ b/arch/x86/kernel/e820.c
> > >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
> > >>>  		if (end != (resource_size_t)end)
> > >>>  			continue;
> > >>>  
> > >>> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
> > >>> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
> > >>> +		    entry->type == E820_TYPE_RESERVED) {
> > >>> +			memblock_add(entry->addr, entry->size);
> > >>>  			memblock_reserve(entry->addr, entry->size);
> > >>> +		}
> > >>>  
> > >>>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> > >>>  			continue;
> > >>>
> > >>> The setting of node later  in numa_init() will assign the proper node
> > >>> for these regions as it does for the usable memory.
> > >>
> > >> Yes, if it's only related to e820 reserved region, this truly works.
> > >>
> > >> However, it also has ACPI table regions. That's why I changed to call
> > >> the problematic area as firmware reserved ranges later.
> > >>
> > >> Bisides, you can see below line, there's another reserved region which only
> > >> occupies one page in one memory seciton. If adding to memblock.memory, we also
> > >> will build struct mem_section and the relevant struct pages for the whole
> > >> section. And then the holes around that page will be added and initialized in
> > >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
> > >> and memblock.reserved, but won't assign proper node for the holes.
> > >>
> > >> ~~~
> > >> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> > >> ~~~
> > >>
> > >> So I still think we should not add firmware reserved range into
> > >> memblock for fixing this issue.
> > >>
> > >> And, the fix in the original patch seems necessary. You can see in
> > >> compaction code, the migration source is chosen from LRU pages or
> > >> movable pages, the migration target has to be got from Buddy. However,
> > >> only the min_pfn in fast_isolate_freepages(), it's calculated by
> > >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
> > >> safe, then use it as the target to handle.
> > > 
> > > I do not object to your original fix with careful check for pfn validity.
> > > 
> > > But I still think that the memory reserved by the firmware is still
> > > memory and it should be added to memblock.memory. This way the memory
> > 
> > If it's really memory that could be read/written, I think I agree.
> 
> I would say some of them may not be allowed to be read/written, if I
> understand it correctly. I roughly went through the x86 init code, there
> are some places where mem region is marked as E820_TYPE_RESERVED so that
> they are not touched after initialization. E.g:
> 
> 1) pfn 0
> In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
> see the code comment, this is a BIOS owned area, but not kernel RAM.
> 
> 2)GART reserved region
> In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
> in an area, firmware designer won't map system RAM into that area.
> 
> And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
> regions are not system RAM backed area, reading from or writting into
> these area may cause error.
> 
> Futhermore, there's a KASLR bug found by HPE, its triggering and root
> cause are written into below commit log. You can see that accessing to
> firmware reserved region caused BIOS to halt system when cpu doing
> speculative.
> 
> commit 2aa85f246c181b1fa89f27e8e20c5636426be624
> Author: Steve Wahl <steve.wahl@hpe.com>
> Date:   Tue Sep 24 16:03:55 2019 -0500
> 
>     x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
> 
>     Our hardware (UV aka Superdome Flex) has address ranges marked
>     reserved by the BIOS. Access to these ranges is caught as an error,
>     causing the BIOS to halt the system.

Thank you for CC'ing me.  Coming into the middle of the conversation,
I am trying to understand the implications of what's being discussed
here, but not being very successful at it.

For the HPE UV hardware, the addresses listed in the reserved range
must never be accessed, or (as we discovered) even be reachable by an
active page table entry.  Any active page table entry covering the
region allows the processor to issue speculative accesses to the
region, resulting in the BIOS halt mentioned.

I'm not sure if the discussion above about adding the region to
memblock.memory would result in an active mapping of the region or
not.  If it did, we'd have problems.

Thanks,

Steve Wahl, HPE


> > 
> > > map will be properly initialized from the very beginning and we won't
> > > need init_unavailable_mem() and alike workarounds and. Obviously, the patch
> > 
> > I remember init_unavailable_mem() is necessary for holes within
> > sections, where we actually *don't* have memory, but we still have have
> > a valid memmap (full section) that we have to initialize.
> > 
> > See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap
> > of unavailable memory directly"). Our main memory ends within a section,
> > so we have to initialize the remaining parts because the whole section
> > will be marked valid/online.
> 
> Yes, memory hole need be handled in init_unavailable_mem(). Since we
> have created struct page for them, need initialize them. We can't
> discard init_unavailable_mem() for now.
> 
> > 
> > Any way to improve this handling is appreciated. In that patch I also
> > spelled out that we might want to mark such holes via a new page type,
> > e.g., PageHole(). Such a page is a memory hole, but has a valid memmap.
> > Any content in the memmap (zone/node) should be ignored.
> 
> As I said at above, I am a little conservative to add all those regions of
> E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because
> most of them are firmware reserved region, they may be not backed by normal
> RAM.
> 
> I was thinking to step back to use mm_zero_struct_page() inside
> init_unavailable_range() as below. But it doesn't differ much
> from __init_single_page(), except of the _refcount and mapcount.
> Zeroing struct page equals to putting them into node 0, zero 0.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3973b5fdfe3f..4e4b72cf5283 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
>                  * (in memblock.reserved but not in memblock.memory) will
>                  * get re-initialized via reserve_bootmem_region() later.
>                  */
> -               __init_single_page(pfn_to_page(pfn), pfn, 0, 0);
> +               mm_zero_struct_page(pfn_to_page(pfn));
>                 __SetPageReserved(pfn_to_page(pfn));
>                 pgcnt++;
>         }
> 
> About adding these unavailable ranges into node/zone, in the old code,
> it just happened to add them into expected node/zone. You can see in
> early_pfn_in_nid(), if no nid found from memblock, the returned '-1'
> will make it true ironically. But that is not saying the bad thing
> always got good result. If the last zone of node 0 is DMA32 zone, the
> deferred init will skip the only chance to add some of unavailable
> rnages into expected node/zone. Means they were not always added into
> appropriate node/zone before, the change of iterating memblock.memory in
> memmap_init() dones't introduce regression.
> 
> static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> {
>         int nid;
> 
>         nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
>         if (nid >= 0 && nid != node)
>                 return false;
>         return true;
> }
> 
> So if no anybody need access them after boot, not adding them into any
> node/zone sounds better. Otherwise, better add them in the appropriate
> node/zone.
> 
> > 
> > But it's all quite confusing, especially across architectures and ...
> > 
> > > above is not enough, but it's a small step in this direction.
> > > 
> > > I believe that improving the early memory initialization would make many
> > > things simpler and more robust, but that's a different story :)
> > 
> > ... I second that.
> > 
> > -- 
> > Thanks,
> > 
> > David / dhildenb
>
Mike Rapoport June 1, 2020, 11:42 a.m. UTC | #17
On Thu, May 28, 2020 at 10:15:10AM -0500, Steve Wahl wrote:
> On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote:
> > On 05/26/20 at 01:49pm, David Hildenbrand wrote:
> > > On 26.05.20 13:32, Mike Rapoport wrote:
> > > > Hello Baoquan,
> > > > 
> > > > I do not object to your original fix with careful check for pfn validity.
> > > > 
> > > > But I still think that the memory reserved by the firmware is still
> > > > memory and it should be added to memblock.memory. This way the memory
> > > 
> > > If it's really memory that could be read/written, I think I agree.
> > 
> > I would say some of them may not be allowed to be read/written, if I
> > understand it correctly. I roughly went through the x86 init code, there
> > are some places where mem region is marked as E820_TYPE_RESERVED so that
> > they are not touched after initialization. E.g:
> > 
> > 1) pfn 0
> > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
> > see the code comment, this is a BIOS owned area, but not kernel RAM.
> > 
> > 2)GART reserved region
> > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
> > in an area, firmware designer won't map system RAM into that area.
> > 
> > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
> > regions are not system RAM backed area, reading from or writting into
> > these area may cause error.
> > 
> > Futhermore, there's a KASLR bug found by HPE, its triggering and root
> > cause are written into below commit log. You can see that accessing to
> > firmware reserved region caused BIOS to halt system when cpu doing
> > speculative.
> > 
> > commit 2aa85f246c181b1fa89f27e8e20c5636426be624
> > Author: Steve Wahl <steve.wahl@hpe.com>
> > Date:   Tue Sep 24 16:03:55 2019 -0500
> > 
> >     x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
> > 
> >     Our hardware (UV aka Superdome Flex) has address ranges marked
> >     reserved by the BIOS. Access to these ranges is caught as an error,
> >     causing the BIOS to halt the system.
> 
> Thank you for CC'ing me.  Coming into the middle of the conversation,
> I am trying to understand the implications of what's being discussed
> here, but not being very successful at it.
> 
> For the HPE UV hardware, the addresses listed in the reserved range
> must never be accessed, or (as we discovered) even be reachable by an
> active page table entry.  Any active page table entry covering the
> region allows the processor to issue speculative accesses to the
> region, resulting in the BIOS halt mentioned.
> 
> I'm not sure if the discussion above about adding the region to
> memblock.memory would result in an active mapping of the region or
> not.  If it did, we'd have problems.

The discussion is whether regions marked as E820_RESERVED should be
considered as RAM or not.

For the hardware that cannot tolerate acceses to these areas like HPE
UV, it should not :)

I still think that keeping them only in memblock.reserved creates more
problems than it solves, but simply adding E820_RESERVED areas to
memblock.memory just won't work.

I'll try to come up with something better Really Soon (c).

> Thanks,
> 
> Steve Wahl, HPE
Baoquan He June 1, 2020, 1:31 p.m. UTC | #18
On 06/01/20 at 02:42pm, Mike Rapoport wrote:
> On Thu, May 28, 2020 at 10:15:10AM -0500, Steve Wahl wrote:
> > On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote:
> > > On 05/26/20 at 01:49pm, David Hildenbrand wrote:
> > > > On 26.05.20 13:32, Mike Rapoport wrote:
> > > > > Hello Baoquan,
> > > > > 
> > > > > I do not object to your original fix with careful check for pfn validity.
> > > > > 
> > > > > But I still think that the memory reserved by the firmware is still
> > > > > memory and it should be added to memblock.memory. This way the memory
> > > > 
> > > > If it's really memory that could be read/written, I think I agree.
> > > 
> > > I would say some of them may not be allowed to be read/written, if I
> > > understand it correctly. I roughly went through the x86 init code, there
> > > are some places where mem region is marked as E820_TYPE_RESERVED so that
> > > they are not touched after initialization. E.g:
> > > 
> > > 1) pfn 0
> > > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
> > > see the code comment, this is a BIOS owned area, but not kernel RAM.
> > > 
> > > 2)GART reserved region
> > > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
> > > in an area, firmware designer won't map system RAM into that area.
> > > 
> > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
> > > regions are not system RAM backed area, reading from or writting into
> > > these area may cause error.
> > > 
> > > Futhermore, there's a KASLR bug found by HPE, its triggering and root
> > > cause are written into below commit log. You can see that accessing to
> > > firmware reserved region caused BIOS to halt system when cpu doing
> > > speculative.
> > > 
> > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624
> > > Author: Steve Wahl <steve.wahl@hpe.com>
> > > Date:   Tue Sep 24 16:03:55 2019 -0500
> > > 
> > >     x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
> > > 
> > >     Our hardware (UV aka Superdome Flex) has address ranges marked
> > >     reserved by the BIOS. Access to these ranges is caught as an error,
> > >     causing the BIOS to halt the system.
> > 
> > Thank you for CC'ing me.  Coming into the middle of the conversation,
> > I am trying to understand the implications of what's being discussed
> > here, but not being very successful at it.
> > 
> > For the HPE UV hardware, the addresses listed in the reserved range
> > must never be accessed, or (as we discovered) even be reachable by an
> > active page table entry.  Any active page table entry covering the
> > region allows the processor to issue speculative accesses to the
> > region, resulting in the BIOS halt mentioned.
> > 
> > I'm not sure if the discussion above about adding the region to
> > memblock.memory would result in an active mapping of the region or
> > not.  If it did, we'd have problems.
> 
> The discussion is whether regions marked as E820_RESERVED should be
> considered as RAM or not.

IMHO, the discussion is about whether firmware reserved regions like
E820_TYPE_RESERVED should be added into memory allocators.

> 
> For the hardware that cannot tolerate acceses to these areas like HPE
> UV, it should not :)
> 
> I still think that keeping them only in memblock.reserved creates more
> problems than it solves, but simply adding E820_RESERVED areas to
> memblock.memory just won't work.

Currently, we hardly add E820_RESERVED into memblock, including
memblock.memmory and memblock.reserved. But, in several places, people
add them to memblock.reserved when they have marked them as
E820_TYPE_RESERVED.

I agree with David that these pages aren't available for any later
memory allocator. "the node/zone is just completely irrelevant for
these pages". Keeping them out of the memory management system is safer
so that nobody won't touch them. I tried to mark them out with a new
type PageUnavail as David suggested, am still struggling to think of
where I should detect and exclude them.

Below is a draft patch about my attempt on marking out unavailable pages.
Any comment or suggestions are welcomed. Of course for any better idea
with code change, I would like to test and review. Hope we can finally fix
this issue after its exposure. 

From 25ca041258a40bb315f94a23f0465b510b1614a0 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@redhat.com>
Date: Fri, 29 May 2020 20:23:13 +0800
Subject: [RFC PATCH] mm: Add a new page type to mark unavailable pages

Firstly, the last user of early_pfn_valid() is memmap_init_zone(), but that code
has been optimized away. Let's remove it.

Secondly, add a new page type to mark unavailale pages.

Thirdly, add a new early_pfn_valid() so that init_unavailable_range()
can initialize those pages in memblock.reserved.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 arch/x86/include/asm/mmzone_32.h |  2 --
 include/linux/mmzone.h           | 18 ++++++++++--------
 include/linux/page-flags.h       |  8 ++++++++
 mm/page_alloc.c                  |  3 ++-
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h
index 73d8dd14dda2..4da45eff2942 100644
--- a/arch/x86/include/asm/mmzone_32.h
+++ b/arch/x86/include/asm/mmzone_32.h
@@ -49,8 +49,6 @@ static inline int pfn_valid(int pfn)
 	return 0;
 }
 
-#define early_pfn_valid(pfn)	pfn_valid((pfn))
-
 #endif /* CONFIG_DISCONTIGMEM */
 
 #endif /* _ASM_X86_MMZONE_32_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e8d7d0b0acf4..47d09be73dca 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1315,20 +1315,23 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
 #endif
 
 #ifndef CONFIG_HAVE_ARCH_PFN_VALID
-static inline int pfn_valid(unsigned long pfn)
+static inline int early_pfn_valid(unsigned long pfn)
 {
-	struct mem_section *ms;
-
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
-	ms = __nr_to_section(pfn_to_section_nr(pfn));
-	if (!valid_section(ms))
+	return valid_section(__pfn_to_section(pfn))
+}
+
+static inline int pfn_valid(unsigned long pfn)
+{
+	if (!early_pfn_valid(pfn))
 		return 0;
 	/*
 	 * Traditionally early sections always returned pfn_valid() for
 	 * the entire section-sized span.
 	 */
-	return early_section(ms) || pfn_section_valid(ms, pfn);
+	return (early_section(ms) && PageUnavail(pfn_to_page(pfn))) ||
+	       pfn_section_valid(ms, pfn);
 }
 #endif
 
@@ -1364,7 +1367,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr)
 #define pfn_to_nid(pfn)		(0)
 #endif
 
-#define early_pfn_valid(pfn)	pfn_valid(pfn)
 void sparse_init(void);
 #else
 #define sparse_init()	do {} while (0)
@@ -1385,7 +1387,7 @@ struct mminit_pfnnid_cache {
 };
 
 #ifndef early_pfn_valid
-#define early_pfn_valid(pfn)	(1)
+#define early_pfn_valid(pfn)	pfn_valid(pfn)
 #endif
 
 void memory_present(int nid, unsigned long start, unsigned long end);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 222f6f7b2bb3..1c011008100c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -740,6 +740,7 @@ PAGEFLAG_FALSE(DoubleMap)
 #define PG_kmemcg	0x00000200
 #define PG_table	0x00000400
 #define PG_guard	0x00000800
+#define PG_unavail	0x00001000
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -796,6 +797,13 @@ PAGE_TYPE_OPS(Table, table)
  */
 PAGE_TYPE_OPS(Guard, guard)
 
+/*
+ *  Mark pages which are unavailable to memory allocators after boot.
+ *  They includes pages reserved by firmware pre-boot or during boot,
+ *  holes which share the same setcion with usable RAM.
+ */
+PAGE_TYPE_OPS(Unavail, unavail)
+
 extern bool is_free_buddy_page(struct page *page);
 
 __PAGEFLAG(Isolated, isolated, PF_ANY);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 940cdce96864..e9ebef6ffbec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6972,7 +6972,8 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
 		 * (in memblock.reserved but not in memblock.memory) will
 		 * get re-initialized via reserve_bootmem_region() later.
 		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+		mm_zero_struct_page(pfn_to_page(pfn));
+		__SetPageUnavail(pfn_to_page(pfn));
 		__SetPageReserved(pfn_to_page(pfn));
 		pgcnt++;
 	}

Patch
diff mbox series

diff --git a/mm/compaction.c b/mm/compaction.c
index 67fd317f78db..9ce4cff4d407 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1418,7 +1418,9 @@  fast_isolate_freepages(struct compact_control *cc)
 				cc->free_pfn = highest;
 			} else {
 				if (cc->direct_compaction && pfn_valid(min_pfn)) {
-					page = pfn_to_page(min_pfn);
+					page = pageblock_pfn_to_page(min_pfn,
+						pageblock_end_pfn(min_pfn),
+						cc->zone);
 					cc->free_pfn = min_pfn;
 				}
 			}