* [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() @ 2020-05-21 1:44 Baoquan He 2020-05-21 9:26 ` Mike Rapoport 2020-05-21 9:36 ` Mel Gorman 0 siblings, 2 replies; 19+ messages in thread From: Baoquan He @ 2020-05-21 1:44 UTC (permalink / raw) To: linux-kernel; +Cc: linux-mm, akpm, cai, mgorman, mhocko, rppt, bhe Qian reported that a crash happened in compaction. http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw LTP: starting swapping01 (swapping01 -i 5) page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0 flags: 0x1fffe000001000(reserved) raw: 001fffe000001000 ffffea0000aa0008 ffffea0000aa0008 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) page_owner info is not present (never set?) ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:533! ... CPU: 17 PID: 218 Comm: kcompactd0 Not tainted 5.7.0-rc2-next-20200423+ #7 ... RIP: 0010:set_pfnblock_flags_mask+0x150/0x210 ... RSP: 0018:ffffc900042ff858 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff9a002382 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8884535b8e6c RBP: ffffc900042ff8b8 R08: ffffed108a6b8459 R09: ffffed108a6b8459 R10: ffff8884535c22c7 R11: ffffed108a6b8458 R12: 000000000002a800 R13: ffffea0000aa0000 R14: ffff88847fff3000 R15: ffff88847fff3040 FS: 0000000000000000(0000) GS:ffff888453580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd1eb4a1000 CR3: 000000083154c000 CR4: 00000000003406e0 Call Trace: isolate_freepages+0xb20/0x1140 ? isolate_freepages_block+0x730/0x730 ? mark_held_locks+0x34/0xb0 ? free_unref_page+0x7d/0x90 ? free_unref_page+0x7d/0x90 ? check_flags.part.28+0x86/0x220 compaction_alloc+0xdd/0x100 migrate_pages+0x304/0x17e0 ? __ClearPageMovable+0x100/0x100 ? isolate_freepages+0x1140/0x1140 compact_zone+0x1249/0x1e90 ? compaction_suitable+0x260/0x260 kcompactd_do_work+0x231/0x650 ? sysfs_compact_node+0x80/0x80 ? finish_wait+0xe6/0x110 kcompactd+0x162/0x490 ? kcompactd_do_work+0x650/0x650 ? finish_wait+0x110/0x110 ? __kasan_check_read+0x11/0x20 ? __kthread_parkme+0xd4/0xf0 ? kcompactd_do_work+0x650/0x650 kthread+0x1f7/0x220 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x27/0x50 After investigation, it turns out that this is introduced by commit of linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock regions rather that check each PFN"). After investigation, it turns out that this is introduced by commit of linux-next, the patch subject is: "mm: memmap_init: iterate over memblock regions rather that check each PFN". Qian added debugging code. The debugging log shows that the fault page is 0x2a800000. From the system e820 map which is pasted at bottom, the page is in e820 reserved range: BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved And it's in section [0x28000000, 0x2fffffff]. In that secion, there are several usable ranges and some e820 reserved ranges. For this kind of e820 reserved range, it won't be added to memblock allocator. However, init_unavailable_mem() will initialize to add them into node 0, zone 0. Before that commit, later, memmap_init() will add e820 reserved ranges into the zone where they are contained, because it can pass the checking of early_pfn_valid() and early_pfn_in_nid(). In this case, the e820 reserved range where fault page 0x2a800000 is located is added into DMA32 zone. After that commit, the e820 reserved rgions are kept in node 0, zone 0, since we iterate over memblock regions to iniatialize in memmap_init() instead, their node and zone won't be changed. Now, fast_isolate_freepages() will use min mark directly as the migration target if no page is found from buddy. However, the min mark is not checked carefully, it could be in e820 reserved range, and trigger the VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it as skip. Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it will do careful checks and return NULL if the pfn is not qualified. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable [ 0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved [ 0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable [ 0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable [ 0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Baoquan He <bhe@redhat.com> --- mm/compaction.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 67fd317f78db..9ce4cff4d407 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc) cc->free_pfn = highest; } else { if (cc->direct_compaction && pfn_valid(min_pfn)) { - page = pfn_to_page(min_pfn); + page = pageblock_pfn_to_page(min_pfn, + pageblock_end_pfn(min_pfn), + cc->zone); cc->free_pfn = min_pfn; } } -- 2.17.2 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-21 1:44 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() Baoquan He @ 2020-05-21 9:26 ` Mike Rapoport 2020-05-21 15:52 ` Baoquan He 2020-05-21 9:36 ` Mel Gorman 1 sibling, 1 reply; 19+ messages in thread From: Mike Rapoport @ 2020-05-21 9:26 UTC (permalink / raw) To: Baoquan He; +Cc: linux-kernel, linux-mm, akpm, cai, mgorman, mhocko Hi Baoquan, On Thu, May 21, 2020 at 09:44:07AM +0800, Baoquan He wrote: > Qian reported that a crash happened in compaction. > http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw > V> LTP: starting swapping01 (swapping01 -i 5) > page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0 > flags: 0x1fffe000001000(reserved) > raw: 001fffe000001000 ffffea0000aa0008 ffffea0000aa0008 0000000000000000 > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 > page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) > page_owner info is not present (never set?) > ------------[ cut here ]------------ > kernel BUG at mm/page_alloc.c:533! ... > After investigation, it turns out that this is introduced by commit of > linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock > regions rather that check each PFN"). > > After investigation, it turns out that this is introduced by commit of > linux-next, the patch subject is: > "mm: memmap_init: iterate over memblock regions rather that check each PFN". > > Qian added debugging code. The debugging log shows that the fault page is > 0x2a800000. From the system e820 map which is pasted at bottom, the page > is in e820 reserved range: > BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved > And it's in section [0x28000000, 0x2fffffff]. In that secion, there are > several usable ranges and some e820 reserved ranges. > > For this kind of e820 reserved range, it won't be added to memblock allocator. > However, init_unavailable_mem() will initialize to add them into node 0, > zone 0. Before that commit, later, memmap_init() will add e820 reserved > ranges into the zone where they are contained, because it can pass > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case, > the e820 reserved range where fault page 0x2a800000 is located is added > into DMA32 zone. After that commit, the e820 reserved rgions are kept > in node 0, zone 0, since we iterate over memblock regions to iniatialize > in memmap_init() instead, their node and zone won't be changed. I wonder why woudn't we add the reserved memory to memblock from the very beginning... I've tried to undestand why this was not done, but I couldn't find the reasoning behind that. Can you please try the below patch and see if it helps or, on the contrary, breaks anything else :) diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index c5399e80c59c..b0940c618ed9 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) if (end != (resource_size_t)end) continue; - if (entry->type == E820_TYPE_SOFT_RESERVED) + if (entry->type == E820_TYPE_SOFT_RESERVED || + entry->type == E820_TYPE_RESERVED) { + memblock_add(entry->addr, entry->size); memblock_reserve(entry->addr, entry->size); + } if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; > Now, fast_isolate_freepages() will use min mark directly as the migration > target if no page is found from buddy. However, the min mark is not checked > carefully, it could be in e820 reserved range, and trigger the > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it > as skip. > > Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it > will do careful checks and return NULL if the pfn is not qualified. > > [ 0.000000] BIOS-provided physical RAM map: > [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable > [ 0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved > [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved > [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved > [ 0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved > [ 0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved > [ 0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data > [ 0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data > [ 0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data > [ 0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS > [ 0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data > [ 0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved > [ 0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data > [ 0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved > [ 0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS > [ 0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved > [ 0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS > [ 0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved > [ 0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS > [ 0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data > [ 0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable > [ 0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved > [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable > [ 0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved > > Reported-by: Qian Cai <cai@lca.pw> > Signed-off-by: Baoquan He <bhe@redhat.com> > --- > mm/compaction.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 67fd317f78db..9ce4cff4d407 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc) > cc->free_pfn = highest; > } else { > if (cc->direct_compaction && pfn_valid(min_pfn)) { > - page = pfn_to_page(min_pfn); > + page = pageblock_pfn_to_page(min_pfn, > + pageblock_end_pfn(min_pfn), > + cc->zone); > cc->free_pfn = min_pfn; > } > } > -- > 2.17.2 > -- Sincerely yours, Mike. ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-21 9:26 ` Mike Rapoport @ 2020-05-21 15:52 ` Baoquan He [not found] ` <20200521171836.GU1059226@linux.ibm.com> 0 siblings, 1 reply; 19+ messages in thread From: Baoquan He @ 2020-05-21 15:52 UTC (permalink / raw) To: Mike Rapoport; +Cc: linux-kernel, linux-mm, akpm, cai, mgorman, mhocko On 05/21/20 at 12:26pm, Mike Rapoport wrote: > > For this kind of e820 reserved range, it won't be added to memblock allocator. > > However, init_unavailable_mem() will initialize to add them into node 0, > > zone 0. Before that commit, later, memmap_init() will add e820 reserved > > ranges into the zone where they are contained, because it can pass > > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case, > > the e820 reserved range where fault page 0x2a800000 is located is added > > into DMA32 zone. After that commit, the e820 reserved rgions are kept > > in node 0, zone 0, since we iterate over memblock regions to iniatialize > > in memmap_init() instead, their node and zone won't be changed. > > I wonder why woudn't we add the reserved memory to memblock from the > very beginning... > I've tried to undestand why this was not done, but I couldn't find the > reasoning behind that. I have added some explanation when reply to Mel. Please check that in that thread. As I said, the unavailable range includes firmware reserved ranges, and holes inside one boot memory section, if that boot memory section haves useable memory range, and firmware reserved ranges, and holes. Adding them all into memblock seems a little unreasonable, since they are never used by system in memblock, buddy or high level memory allocator. But I can see that adding them into memblock may have the same effect as the old code which is beofre your your patchset applied. Let's see if Mel or other people have some saying. I pesonally would not suggest doing it like this though. > > Can you please try the below patch and see if it helps or, on the > contrary, breaks anything else :) > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index c5399e80c59c..b0940c618ed9 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) > if (end != (resource_size_t)end) > continue; > > - if (entry->type == E820_TYPE_SOFT_RESERVED) > + if (entry->type == E820_TYPE_SOFT_RESERVED || > + entry->type == E820_TYPE_RESERVED) { > + memblock_add(entry->addr, entry->size); > memblock_reserve(entry->addr, entry->size); > + } > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > > > Now, fast_isolate_freepages() will use min mark directly as the migration > > target if no page is found from buddy. However, the min mark is not checked > > carefully, it could be in e820 reserved range, and trigger the > > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it > > as skip. > > > > Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it > > will do careful checks and return NULL if the pfn is not qualified. > > > > [ 0.000000] BIOS-provided physical RAM map: > > [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable > > [ 0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved > > [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved > > [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved > > [ 0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved > > [ 0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved > > [ 0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data > > [ 0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data > > [ 0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data > > [ 0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS > > [ 0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data > > [ 0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved > > [ 0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data > > [ 0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved > > [ 0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS > > [ 0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved > > [ 0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS > > [ 0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved > > [ 0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS > > [ 0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data > > [ 0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable > > [ 0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved > > [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > > [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable > > [ 0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved > > > > Reported-by: Qian Cai <cai@lca.pw> > > Signed-off-by: Baoquan He <bhe@redhat.com> > > --- > > mm/compaction.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/mm/compaction.c b/mm/compaction.c > > index 67fd317f78db..9ce4cff4d407 100644 > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc) > > cc->free_pfn = highest; > > } else { > > if (cc->direct_compaction && pfn_valid(min_pfn)) { > > - page = pfn_to_page(min_pfn); > > + page = pageblock_pfn_to_page(min_pfn, > > + pageblock_end_pfn(min_pfn), > > + cc->zone); > > cc->free_pfn = min_pfn; > > } > > } > > -- > > 2.17.2 > > > > -- > Sincerely yours, > Mike. > ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <20200521171836.GU1059226@linux.ibm.com>]
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() [not found] ` <20200521171836.GU1059226@linux.ibm.com> @ 2020-05-22 7:01 ` Baoquan He 2020-05-22 7:25 ` Baoquan He 0 siblings, 1 reply; 19+ messages in thread From: Baoquan He @ 2020-05-22 7:01 UTC (permalink / raw) To: Mike Rapoport, mgorman; +Cc: linux-kernel, linux-mm, akpm, cai, mhocko On 05/21/20 at 08:18pm, Mike Rapoport wrote: > On Thu, May 21, 2020 at 11:52:25PM +0800, Baoquan He wrote: > > On 05/21/20 at 12:26pm, Mike Rapoport wrote: > > > > For this kind of e820 reserved range, it won't be added to memblock allocator. > > > > However, init_unavailable_mem() will initialize to add them into node 0, > > > > zone 0. Before that commit, later, memmap_init() will add e820 reserved > > > > ranges into the zone where they are contained, because it can pass > > > > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case, > > > > the e820 reserved range where fault page 0x2a800000 is located is added > > > > into DMA32 zone. After that commit, the e820 reserved rgions are kept > > > > in node 0, zone 0, since we iterate over memblock regions to iniatialize > > > > in memmap_init() instead, their node and zone won't be changed. > > > > > > I wonder why woudn't we add the reserved memory to memblock from the > > > very beginning... > > > I've tried to undestand why this was not done, but I couldn't find the > > > reasoning behind that. > > > > I have added some explanation when reply to Mel. Please check that in > > that thread. > > What I meant was that I've tried to find an explanation in the kernel logs > for why the reserved areas are not added to memblock.memory on x86. I > didn't mean that the your patch description lacks this explanation :) Ah, I misunderstood it, sorry about that. > > > As I said, the unavailable range includes firmware reserved ranges, and > > holes inside one boot memory section, if that boot memory section haves > > useable memory range, and firmware reserved ranges, and holes. Adding > > them all into memblock seems a little unreasonable, since they are never > > used by system in memblock, buddy or high level memory allocator. But I > > can see that adding them into memblock may have the same effect as the > > old code which is beofre your your patchset applied. Let's see if Mel or > > other people have some saying. I pesonally would not suggest doing it > > like this though. > > Adding reserved regions to memblock.memory will not have the same effect > as the old code. We anyway have to initialize struct page for these > areas, but unlike the old code we don't need to run them by the > early_pfn_in_nid() checks and we still get rid the > CONFIG_NODES_SPAN_OTHER_NODES option. Hmm, I mean adding them to memblock will let us have the same result, they are added into the node, zone where they should be, and marked as reserved, just as the old code did. Rethink about this, seems adding them into memblock is doable. But we may not need to add them from e820 reserved range, since that will skip hole range which share the same section with usable range, and may need to change code in different ARCHes. How about this: We add them into memblock in init_unavailable_range(), memmap_init() will add them into the right node and zone, reserve_bootmem_region() will initialize them and mark them as Reserved. From d019d0f9e7c958542dfcb142f93d07fcce6c7c22 Mon Sep 17 00:00:00 2001 From: Baoquan He <bhe@redhat.com> Date: Fri, 22 May 2020 14:36:13 +0800 Subject: [PATCH] mm/page_alloc.c: Add unavailable ranges into memblock These unavailable ranges shares the same section with the usable range in boot memory, e.g the firmware reserved ranges, and holes. Previously, they are added into node 0, zone 0 in function init_unavailable_range(), and marked as Reserved. Later, in function memmap_init(), they will be added to appropriate node and zone, where they are covered. However, after the patchset ("mm: rework free_area_init*() funcitons") is applied, we change to iterate over memblock regions. These unavailable ranges are skipped, and the node and zone adjustment won't be done any more as the old code did. This cause a crash in compaction which is triggered by VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)). So let's add these unavailable ranges into memblock and reserve them in init_unavailable_range() instead. With this change, they will be added into appropriate node and zone in memmap_init(), and initialized in reserve_bootmem_region() just like any other memblock reserved regions. Signed-off-by: Baoquan He <bhe@redhat.com> --- mm/page_alloc.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 603187800628..3973b5fdfe3f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6925,7 +6925,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) static void __init init_unavailable_mem(void) { phys_addr_t start, end; - u64 i, pgcnt; + u64 i, pgcnt, size; phys_addr_t next = 0; /* @@ -6934,9 +6934,11 @@ static void __init init_unavailable_mem(void) pgcnt = 0; for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) { - if (next < start) - pgcnt += init_unavailable_range(PFN_DOWN(next), - PFN_UP(start)); + if (next < start) { + size = PFN_UP(start) - PFN_DOWN(next); + memblock_add(PFN_DOWN(next), size); + memblock_reserve(PFN_DOWN(next), size); + } next = end; } @@ -6947,8 +6949,11 @@ static void __init init_unavailable_mem(void) * considered initialized. Make sure that memmap has a well defined * state. */ - pgcnt += init_unavailable_range(PFN_DOWN(next), - round_up(max_pfn, PAGES_PER_SECTION)); + size = round_up(max_pfn, PAGES_PER_SECTION) - PFN_DOWN(next); + if (size) { + memblock_add(PFN_DOWN(next), size); + memblock_reserve(PFN_DOWN(next), size); + } /* * Struct pages that do not have backing memory. This could be because -- 2.17.2 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-22 7:01 ` Baoquan He @ 2020-05-22 7:25 ` Baoquan He 2020-05-22 14:20 ` Mike Rapoport 0 siblings, 1 reply; 19+ messages in thread From: Baoquan He @ 2020-05-22 7:25 UTC (permalink / raw) To: Mike Rapoport, mgorman; +Cc: linux-kernel, linux-mm, akpm, cai, mhocko On 05/22/20 at 03:01pm, Baoquan He wrote: > > > As I said, the unavailable range includes firmware reserved ranges, and > > > holes inside one boot memory section, if that boot memory section haves > > > useable memory range, and firmware reserved ranges, and holes. Adding > > > them all into memblock seems a little unreasonable, since they are never > > > used by system in memblock, buddy or high level memory allocator. But I > > > can see that adding them into memblock may have the same effect as the > > > old code which is beofre your your patchset applied. Let's see if Mel or > > > other people have some saying. I pesonally would not suggest doing it > > > like this though. > > > > Adding reserved regions to memblock.memory will not have the same effect > > as the old code. We anyway have to initialize struct page for these > > areas, but unlike the old code we don't need to run them by the > > early_pfn_in_nid() checks and we still get rid the > > CONFIG_NODES_SPAN_OTHER_NODES option. > > Hmm, I mean adding them to memblock will let us have the same result, > they are added into the node, zone where they should be, and marked as > reserved, just as the old code did. > > Rethink about this, seems adding them into memblock is doable. But > we may not need to add them from e820 reserved range, since that will > skip hole range which share the same section with usable range, and may > need to change code in different ARCHes. How about this: > > We add them into memblock in init_unavailable_range(), memmap_init() will > add them into the right node and zone, reserve_bootmem_region() will > initialize them and mark them as Reserved. > > > From d019d0f9e7c958542dfcb142f93d07fcce6c7c22 Mon Sep 17 00:00:00 2001 > From: Baoquan He <bhe@redhat.com> > Date: Fri, 22 May 2020 14:36:13 +0800 > Subject: [PATCH] mm/page_alloc.c: Add unavailable ranges into memblock > > These unavailable ranges shares the same section with the usable range > in boot memory, e.g the firmware reserved ranges, and holes. > > Previously, they are added into node 0, zone 0 in function > init_unavailable_range(), and marked as Reserved. Later, in function > memmap_init(), they will be added to appropriate node and zone, where > they are covered. > > However, after the patchset ("mm: rework free_area_init*() funcitons") > is applied, we change to iterate over memblock regions. These unavailable > ranges are skipped, and the node and zone adjustment won't be done any > more as the old code did. This cause a crash in compaction which is triggered > by VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)). > > So let's add these unavailable ranges into memblock and reserve them > in init_unavailable_range() instead. With this change, they will be added > into appropriate node and zone in memmap_init(), and initialized in > reserve_bootmem_region() just like any other memblock reserved regions. Seems this is not right. They can't get nid in init_unavailable_range(). Adding e820 ranges may let them get nid. But the hole range won't be added to memblock, and still has the issue. Nack this one for now, still considering. > > Signed-off-by: Baoquan He <bhe@redhat.com> > --- > mm/page_alloc.c | 17 +++++++++++------ > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 603187800628..3973b5fdfe3f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6925,7 +6925,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) > static void __init init_unavailable_mem(void) > { > phys_addr_t start, end; > - u64 i, pgcnt; > + u64 i, pgcnt, size; > phys_addr_t next = 0; > > /* > @@ -6934,9 +6934,11 @@ static void __init init_unavailable_mem(void) > pgcnt = 0; > for_each_mem_range(i, &memblock.memory, NULL, > NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) { > - if (next < start) > - pgcnt += init_unavailable_range(PFN_DOWN(next), > - PFN_UP(start)); > + if (next < start) { > + size = PFN_UP(start) - PFN_DOWN(next); > + memblock_add(PFN_DOWN(next), size); > + memblock_reserve(PFN_DOWN(next), size); > + } > next = end; > } > > @@ -6947,8 +6949,11 @@ static void __init init_unavailable_mem(void) > * considered initialized. Make sure that memmap has a well defined > * state. > */ > - pgcnt += init_unavailable_range(PFN_DOWN(next), > - round_up(max_pfn, PAGES_PER_SECTION)); > + size = round_up(max_pfn, PAGES_PER_SECTION) - PFN_DOWN(next); > + if (size) { > + memblock_add(PFN_DOWN(next), size); > + memblock_reserve(PFN_DOWN(next), size); > + } > > /* > * Struct pages that do not have backing memory. This could be because > -- > 2.17.2 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-22 7:25 ` Baoquan He @ 2020-05-22 14:20 ` Mike Rapoport 2020-05-26 8:45 ` Baoquan He 0 siblings, 1 reply; 19+ messages in thread From: Mike Rapoport @ 2020-05-22 14:20 UTC (permalink / raw) To: Baoquan He; +Cc: mgorman, linux-kernel, linux-mm, akpm, cai, mhocko Hello Baoquan, On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: > On 05/22/20 at 03:01pm, Baoquan He wrote: > > > > So let's add these unavailable ranges into memblock and reserve them > > in init_unavailable_range() instead. With this change, they will be added > > into appropriate node and zone in memmap_init(), and initialized in > > reserve_bootmem_region() just like any other memblock reserved regions. > > Seems this is not right. They can't get nid in init_unavailable_range(). > Adding e820 ranges may let them get nid. But the hole range won't be > added to memblock, and still has the issue. > > Nack this one for now, still considering. Why won't we add the e820 reserved ranges to memblock.memory during early boot as I suggested? diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index c5399e80c59c..b0940c618ed9 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) if (end != (resource_size_t)end) continue; - if (entry->type == E820_TYPE_SOFT_RESERVED) + if (entry->type == E820_TYPE_SOFT_RESERVED || + entry->type == E820_TYPE_RESERVED) { + memblock_add(entry->addr, entry->size); memblock_reserve(entry->addr, entry->size); + } if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; The setting of node later in numa_init() will assign the proper node for these regions as it does for the usable memory. > > > > Signed-off-by: Baoquan He <bhe@redhat.com> > > --- > > mm/page_alloc.c | 17 +++++++++++------ > > 1 file changed, 11 insertions(+), 6 deletions(-) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 603187800628..3973b5fdfe3f 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -6925,7 +6925,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) > > static void __init init_unavailable_mem(void) > > { > > phys_addr_t start, end; > > - u64 i, pgcnt; > > + u64 i, pgcnt, size; > > phys_addr_t next = 0; > > > > /* > > @@ -6934,9 +6934,11 @@ static void __init init_unavailable_mem(void) > > pgcnt = 0; > > for_each_mem_range(i, &memblock.memory, NULL, > > NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, NULL) { > > - if (next < start) > > - pgcnt += init_unavailable_range(PFN_DOWN(next), > > - PFN_UP(start)); > > + if (next < start) { > > + size = PFN_UP(start) - PFN_DOWN(next); > > + memblock_add(PFN_DOWN(next), size); > > + memblock_reserve(PFN_DOWN(next), size); > > + } > > next = end; > > } > > > > @@ -6947,8 +6949,11 @@ static void __init init_unavailable_mem(void) > > * considered initialized. Make sure that memmap has a well defined > > * state. > > */ > > - pgcnt += init_unavailable_range(PFN_DOWN(next), > > - round_up(max_pfn, PAGES_PER_SECTION)); > > + size = round_up(max_pfn, PAGES_PER_SECTION) - PFN_DOWN(next); > > + if (size) { > > + memblock_add(PFN_DOWN(next), size); > > + memblock_reserve(PFN_DOWN(next), size); > > + } > > > > /* > > * Struct pages that do not have backing memory. This could be because > > -- > > 2.17.2 > > > -- Sincerely yours, Mike. ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-22 14:20 ` Mike Rapoport @ 2020-05-26 8:45 ` Baoquan He 2020-05-26 8:55 ` David Hildenbrand 2020-05-26 11:32 ` Mike Rapoport 0 siblings, 2 replies; 19+ messages in thread From: Baoquan He @ 2020-05-26 8:45 UTC (permalink / raw) To: Mike Rapoport, mgorman, david; +Cc: linux-kernel, linux-mm, akpm, cai, mhocko On 05/22/20 at 05:20pm, Mike Rapoport wrote: > Hello Baoquan, > > On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: > > On 05/22/20 at 03:01pm, Baoquan He wrote: > > > > > > So let's add these unavailable ranges into memblock and reserve them > > > in init_unavailable_range() instead. With this change, they will be added > > > into appropriate node and zone in memmap_init(), and initialized in > > > reserve_bootmem_region() just like any other memblock reserved regions. > > > > Seems this is not right. They can't get nid in init_unavailable_range(). > > Adding e820 ranges may let them get nid. But the hole range won't be > > added to memblock, and still has the issue. > > > > Nack this one for now, still considering. > > Why won't we add the e820 reserved ranges to memblock.memory during > early boot as I suggested? > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index c5399e80c59c..b0940c618ed9 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) > if (end != (resource_size_t)end) > continue; > > - if (entry->type == E820_TYPE_SOFT_RESERVED) > + if (entry->type == E820_TYPE_SOFT_RESERVED || > + entry->type == E820_TYPE_RESERVED) { > + memblock_add(entry->addr, entry->size); > memblock_reserve(entry->addr, entry->size); > + } > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > > The setting of node later in numa_init() will assign the proper node > for these regions as it does for the usable memory. Yes, if it's only related to e820 reserved region, this truly works. However, it also has ACPI table regions. That's why I changed to call the problematic area as firmware reserved ranges later. Bisides, you can see below line, there's another reserved region which only occupies one page in one memory seciton. If adding to memblock.memory, we also will build struct mem_section and the relevant struct pages for the whole section. And then the holes around that page will be added and initialized in init_unavailable_mem(). numa_init() will assign proper node for memblock.memory and memblock.reserved, but won't assign proper node for the holes. ~~~ [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved ~~~ So I still think we should not add firmware reserved range into memblock for fixing this issue. And, the fix in the original patch seems necessary. You can see in compaction code, the migration source is chosen from LRU pages or movable pages, the migration target has to be got from Buddy. However, only the min_pfn in fast_isolate_freepages(), it's calculated by distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's safe, then use it as the target to handle. Hi David, Meanwhile, I checked the history of init_unavailable_mem(). Till below commit From David, the unavailable ranges began to be added to zone 0 and node 0. Before that, we only zero the struct page of unavailable ranges and mark it as Reserved. Am wondering if we have to add it to node 0 and zone 0. From below commit, I don't get why. Could you help clarify so that I get what I missed? commit 4b094b7851bf4bf551ad456195d3f26e1c03bd74 Author: David Hildenbrand <david@redhat.com> Date: Mon Feb 3 17:33:55 2020 -0800 mm/page_alloc.c: initialize memmap of unavailable memory directly I have sent mail to Naoya trying to make clear question I have, see if we can do something to clean up the mess here. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable [ 0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable [ 0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved [ 0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved [ 0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable [ 0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS [ 0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data [ 0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable [ 0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable [ 0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-26 8:45 ` Baoquan He @ 2020-05-26 8:55 ` David Hildenbrand 2020-05-26 11:32 ` Mike Rapoport 1 sibling, 0 replies; 19+ messages in thread From: David Hildenbrand @ 2020-05-26 8:55 UTC (permalink / raw) To: Baoquan He, Mike Rapoport, mgorman Cc: linux-kernel, linux-mm, akpm, cai, mhocko On 26.05.20 10:45, Baoquan He wrote: > On 05/22/20 at 05:20pm, Mike Rapoport wrote: >> Hello Baoquan, >> >> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: >>> On 05/22/20 at 03:01pm, Baoquan He wrote: >>>> >>>> So let's add these unavailable ranges into memblock and reserve them >>>> in init_unavailable_range() instead. With this change, they will be added >>>> into appropriate node and zone in memmap_init(), and initialized in >>>> reserve_bootmem_region() just like any other memblock reserved regions. >>> >>> Seems this is not right. They can't get nid in init_unavailable_range(). >>> Adding e820 ranges may let them get nid. But the hole range won't be >>> added to memblock, and still has the issue. >>> >>> Nack this one for now, still considering. >> >> Why won't we add the e820 reserved ranges to memblock.memory during >> early boot as I suggested? >> >> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >> index c5399e80c59c..b0940c618ed9 100644 >> --- a/arch/x86/kernel/e820.c >> +++ b/arch/x86/kernel/e820.c >> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) >> if (end != (resource_size_t)end) >> continue; >> >> - if (entry->type == E820_TYPE_SOFT_RESERVED) >> + if (entry->type == E820_TYPE_SOFT_RESERVED || >> + entry->type == E820_TYPE_RESERVED) { >> + memblock_add(entry->addr, entry->size); >> memblock_reserve(entry->addr, entry->size); >> + } >> >> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) >> continue; >> >> The setting of node later in numa_init() will assign the proper node >> for these regions as it does for the usable memory. > > Yes, if it's only related to e820 reserved region, this truly works. > > However, it also has ACPI table regions. That's why I changed to call > the problematic area as firmware reserved ranges later. > > Bisides, you can see below line, there's another reserved region which only > occupies one page in one memory seciton. If adding to memblock.memory, we also > will build struct mem_section and the relevant struct pages for the whole > section. And then the holes around that page will be added and initialized in > init_unavailable_mem(). numa_init() will assign proper node for memblock.memory > and memblock.reserved, but won't assign proper node for the holes. > > ~~~ > [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > ~~~ > > So I still think we should not add firmware reserved range into > memblock for fixing this issue. > > And, the fix in the original patch seems necessary. You can see in > compaction code, the migration source is chosen from LRU pages or > movable pages, the migration target has to be got from Buddy. However, > only the min_pfn in fast_isolate_freepages(), it's calculated by > distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's > safe, then use it as the target to handle. > > Hi David, > > Meanwhile, I checked the history of init_unavailable_mem(). Till below > commit From David, the unavailable ranges began to be added to zone 0 > and node 0. Before that, we only zero the struct page of unavailable > ranges and mark it as Reserved. Am wondering if we have to add it to Nope, before, not all pages were marked reserved. See the patch description. > node 0 and zone 0. From below commit, I don't get why. Could you help > clarify so that I get what I missed? Node 0 / zone 0 is just like zeroing it IIRC - no change in that regard, my patch just called that out explicitly. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-26 8:45 ` Baoquan He 2020-05-26 8:55 ` David Hildenbrand @ 2020-05-26 11:32 ` Mike Rapoport 2020-05-26 11:49 ` David Hildenbrand 1 sibling, 1 reply; 19+ messages in thread From: Mike Rapoport @ 2020-05-26 11:32 UTC (permalink / raw) To: Baoquan He; +Cc: mgorman, david, linux-kernel, linux-mm, akpm, cai, mhocko Hello Baoquan, On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote: > On 05/22/20 at 05:20pm, Mike Rapoport wrote: > > Hello Baoquan, > > > > On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: > > > On 05/22/20 at 03:01pm, Baoquan He wrote: > > > > > > > > So let's add these unavailable ranges into memblock and reserve them > > > > in init_unavailable_range() instead. With this change, they will be added > > > > into appropriate node and zone in memmap_init(), and initialized in > > > > reserve_bootmem_region() just like any other memblock reserved regions. > > > > > > Seems this is not right. They can't get nid in init_unavailable_range(). > > > Adding e820 ranges may let them get nid. But the hole range won't be > > > added to memblock, and still has the issue. > > > > > > Nack this one for now, still considering. > > > > Why won't we add the e820 reserved ranges to memblock.memory during > > early boot as I suggested? > > > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > index c5399e80c59c..b0940c618ed9 100644 > > --- a/arch/x86/kernel/e820.c > > +++ b/arch/x86/kernel/e820.c > > @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) > > if (end != (resource_size_t)end) > > continue; > > > > - if (entry->type == E820_TYPE_SOFT_RESERVED) > > + if (entry->type == E820_TYPE_SOFT_RESERVED || > > + entry->type == E820_TYPE_RESERVED) { > > + memblock_add(entry->addr, entry->size); > > memblock_reserve(entry->addr, entry->size); > > + } > > > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > continue; > > > > The setting of node later in numa_init() will assign the proper node > > for these regions as it does for the usable memory. > > Yes, if it's only related to e820 reserved region, this truly works. > > However, it also has ACPI table regions. That's why I changed to call > the problematic area as firmware reserved ranges later. > > Bisides, you can see below line, there's another reserved region which only > occupies one page in one memory seciton. If adding to memblock.memory, we also > will build struct mem_section and the relevant struct pages for the whole > section. And then the holes around that page will be added and initialized in > init_unavailable_mem(). numa_init() will assign proper node for memblock.memory > and memblock.reserved, but won't assign proper node for the holes. > > ~~~ > [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > ~~~ > > So I still think we should not add firmware reserved range into > memblock for fixing this issue. > > And, the fix in the original patch seems necessary. You can see in > compaction code, the migration source is chosen from LRU pages or > movable pages, the migration target has to be got from Buddy. However, > only the min_pfn in fast_isolate_freepages(), it's calculated by > distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's > safe, then use it as the target to handle. I do not object to your original fix with careful check for pfn validity. But I still think that the memory reserved by the firmware is still memory and it should be added to memblock.memory. This way the memory map will be properly initialized from the very beginning and we won't need init_unavailable_mem() and alike workarounds and. Obviously, the patch above is not enough, but it's a small step in this direction. I believe that improving the early memory initialization would make many things simpler and more robust, but that's a different story :) -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-26 11:32 ` Mike Rapoport @ 2020-05-26 11:49 ` David Hildenbrand 2020-05-28 9:07 ` Baoquan He 0 siblings, 1 reply; 19+ messages in thread From: David Hildenbrand @ 2020-05-26 11:49 UTC (permalink / raw) To: Mike Rapoport, Baoquan He Cc: mgorman, linux-kernel, linux-mm, akpm, cai, mhocko On 26.05.20 13:32, Mike Rapoport wrote: > Hello Baoquan, > > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote: >> On 05/22/20 at 05:20pm, Mike Rapoport wrote: >>> Hello Baoquan, >>> >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: >>>> On 05/22/20 at 03:01pm, Baoquan He wrote: >>>>> >>>>> So let's add these unavailable ranges into memblock and reserve them >>>>> in init_unavailable_range() instead. With this change, they will be added >>>>> into appropriate node and zone in memmap_init(), and initialized in >>>>> reserve_bootmem_region() just like any other memblock reserved regions. >>>> >>>> Seems this is not right. They can't get nid in init_unavailable_range(). >>>> Adding e820 ranges may let them get nid. But the hole range won't be >>>> added to memblock, and still has the issue. >>>> >>>> Nack this one for now, still considering. >>> >>> Why won't we add the e820 reserved ranges to memblock.memory during >>> early boot as I suggested? >>> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >>> index c5399e80c59c..b0940c618ed9 100644 >>> --- a/arch/x86/kernel/e820.c >>> +++ b/arch/x86/kernel/e820.c >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) >>> if (end != (resource_size_t)end) >>> continue; >>> >>> - if (entry->type == E820_TYPE_SOFT_RESERVED) >>> + if (entry->type == E820_TYPE_SOFT_RESERVED || >>> + entry->type == E820_TYPE_RESERVED) { >>> + memblock_add(entry->addr, entry->size); >>> memblock_reserve(entry->addr, entry->size); >>> + } >>> >>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) >>> continue; >>> >>> The setting of node later in numa_init() will assign the proper node >>> for these regions as it does for the usable memory. >> >> Yes, if it's only related to e820 reserved region, this truly works. >> >> However, it also has ACPI table regions. That's why I changed to call >> the problematic area as firmware reserved ranges later. >> >> Bisides, you can see below line, there's another reserved region which only >> occupies one page in one memory seciton. If adding to memblock.memory, we also >> will build struct mem_section and the relevant struct pages for the whole >> section. And then the holes around that page will be added and initialized in >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory >> and memblock.reserved, but won't assign proper node for the holes. >> >> ~~~ >> [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved >> ~~~ >> >> So I still think we should not add firmware reserved range into >> memblock for fixing this issue. >> >> And, the fix in the original patch seems necessary. You can see in >> compaction code, the migration source is chosen from LRU pages or >> movable pages, the migration target has to be got from Buddy. However, >> only the min_pfn in fast_isolate_freepages(), it's calculated by >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's >> safe, then use it as the target to handle. > > I do not object to your original fix with careful check for pfn validity. > > But I still think that the memory reserved by the firmware is still > memory and it should be added to memblock.memory. This way the memory If it's really memory that could be read/written, I think I agree. > map will be properly initialized from the very beginning and we won't > need init_unavailable_mem() and alike workarounds and. Obviously, the patch I remember init_unavailable_mem() is necessary for holes within sections, where we actually *don't* have memory, but we still have have a valid memmap (full section) that we have to initialize. See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap of unavailable memory directly"). Our main memory ends within a section, so we have to initialize the remaining parts because the whole section will be marked valid/online. Any way to improve this handling is appreciated. In that patch I also spelled out that we might want to mark such holes via a new page type, e.g., PageHole(). Such a page is a memory hole, but has a valid memmap. Any content in the memmap (zone/node) should be ignored. But it's all quite confusing, especially across architectures and ... > above is not enough, but it's a small step in this direction. > > I believe that improving the early memory initialization would make many > things simpler and more robust, but that's a different story :) ... I second that. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-26 11:49 ` David Hildenbrand @ 2020-05-28 9:07 ` Baoquan He 2020-05-28 9:08 ` David Hildenbrand 2020-05-28 15:15 ` Steve Wahl 0 siblings, 2 replies; 19+ messages in thread From: Baoquan He @ 2020-05-28 9:07 UTC (permalink / raw) To: David Hildenbrand Cc: Mike Rapoport, mgorman, linux-kernel, linux-mm, akpm, cai, mhocko, steve.wahl On 05/26/20 at 01:49pm, David Hildenbrand wrote: > On 26.05.20 13:32, Mike Rapoport wrote: > > Hello Baoquan, > > > > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote: > >> On 05/22/20 at 05:20pm, Mike Rapoport wrote: > >>> Hello Baoquan, > >>> > >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: > >>>> On 05/22/20 at 03:01pm, Baoquan He wrote: > >>>>> > >>>>> So let's add these unavailable ranges into memblock and reserve them > >>>>> in init_unavailable_range() instead. With this change, they will be added > >>>>> into appropriate node and zone in memmap_init(), and initialized in > >>>>> reserve_bootmem_region() just like any other memblock reserved regions. > >>>> > >>>> Seems this is not right. They can't get nid in init_unavailable_range(). > >>>> Adding e820 ranges may let them get nid. But the hole range won't be > >>>> added to memblock, and still has the issue. > >>>> > >>>> Nack this one for now, still considering. > >>> > >>> Why won't we add the e820 reserved ranges to memblock.memory during > >>> early boot as I suggested? > >>> > >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > >>> index c5399e80c59c..b0940c618ed9 100644 > >>> --- a/arch/x86/kernel/e820.c > >>> +++ b/arch/x86/kernel/e820.c > >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) > >>> if (end != (resource_size_t)end) > >>> continue; > >>> > >>> - if (entry->type == E820_TYPE_SOFT_RESERVED) > >>> + if (entry->type == E820_TYPE_SOFT_RESERVED || > >>> + entry->type == E820_TYPE_RESERVED) { > >>> + memblock_add(entry->addr, entry->size); > >>> memblock_reserve(entry->addr, entry->size); > >>> + } > >>> > >>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > >>> continue; > >>> > >>> The setting of node later in numa_init() will assign the proper node > >>> for these regions as it does for the usable memory. > >> > >> Yes, if it's only related to e820 reserved region, this truly works. > >> > >> However, it also has ACPI table regions. That's why I changed to call > >> the problematic area as firmware reserved ranges later. > >> > >> Bisides, you can see below line, there's another reserved region which only > >> occupies one page in one memory seciton. If adding to memblock.memory, we also > >> will build struct mem_section and the relevant struct pages for the whole > >> section. And then the holes around that page will be added and initialized in > >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory > >> and memblock.reserved, but won't assign proper node for the holes. > >> > >> ~~~ > >> [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > >> ~~~ > >> > >> So I still think we should not add firmware reserved range into > >> memblock for fixing this issue. > >> > >> And, the fix in the original patch seems necessary. You can see in > >> compaction code, the migration source is chosen from LRU pages or > >> movable pages, the migration target has to be got from Buddy. However, > >> only the min_pfn in fast_isolate_freepages(), it's calculated by > >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's > >> safe, then use it as the target to handle. > > > > I do not object to your original fix with careful check for pfn validity. > > > > But I still think that the memory reserved by the firmware is still > > memory and it should be added to memblock.memory. This way the memory > > If it's really memory that could be read/written, I think I agree. I would say some of them may not be allowed to be read/written, if I understand it correctly. I roughly went through the x86 init code, there are some places where mem region is marked as E820_TYPE_RESERVED so that they are not touched after initialization. E.g: 1) pfn 0 In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can see the code comment, this is a BIOS owned area, but not kernel RAM. 2)GART reserved region In early_gart_iommu_check(), GART IOMMU firmware will reserve a region in an area, firmware designer won't map system RAM into that area. And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these regions are not system RAM backed area, reading from or writting into these area may cause error. Futhermore, there's a KASLR bug found by HPE, its triggering and root cause are written into below commit log. You can see that accessing to firmware reserved region caused BIOS to halt system when cpu doing speculative. commit 2aa85f246c181b1fa89f27e8e20c5636426be624 Author: Steve Wahl <steve.wahl@hpe.com> Date: Tue Sep 24 16:03:55 2019 -0500 x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Our hardware (UV aka Superdome Flex) has address ranges marked reserved by the BIOS. Access to these ranges is caught as an error, causing the BIOS to halt the system. > > > map will be properly initialized from the very beginning and we won't > > need init_unavailable_mem() and alike workarounds and. Obviously, the patch > > I remember init_unavailable_mem() is necessary for holes within > sections, where we actually *don't* have memory, but we still have have > a valid memmap (full section) that we have to initialize. > > See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap > of unavailable memory directly"). Our main memory ends within a section, > so we have to initialize the remaining parts because the whole section > will be marked valid/online. Yes, memory hole need be handled in init_unavailable_mem(). Since we have created struct page for them, need initialize them. We can't discard init_unavailable_mem() for now. > > Any way to improve this handling is appreciated. In that patch I also > spelled out that we might want to mark such holes via a new page type, > e.g., PageHole(). Such a page is a memory hole, but has a valid memmap. > Any content in the memmap (zone/node) should be ignored. As I said at above, I am a little conservative to add all those regions of E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because most of them are firmware reserved region, they may be not backed by normal RAM. I was thinking to step back to use mm_zero_struct_page() inside init_unavailable_range() as below. But it doesn't differ much from __init_single_page(), except of the _refcount and mapcount. Zeroing struct page equals to putting them into node 0, zero 0. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3973b5fdfe3f..4e4b72cf5283 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) * (in memblock.reserved but not in memblock.memory) will * get re-initialized via reserve_bootmem_region() later. */ - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); + mm_zero_struct_page(pfn_to_page(pfn)); __SetPageReserved(pfn_to_page(pfn)); pgcnt++; } About adding these unavailable ranges into node/zone, in the old code, it just happened to add them into expected node/zone. You can see in early_pfn_in_nid(), if no nid found from memblock, the returned '-1' will make it true ironically. But that is not saying the bad thing always got good result. If the last zone of node 0 is DMA32 zone, the deferred init will skip the only chance to add some of unavailable rnages into expected node/zone. Means they were not always added into appropriate node/zone before, the change of iterating memblock.memory in memmap_init() dones't introduce regression. static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) { int nid; nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache); if (nid >= 0 && nid != node) return false; return true; } So if no anybody need access them after boot, not adding them into any node/zone sounds better. Otherwise, better add them in the appropriate node/zone. > > But it's all quite confusing, especially across architectures and ... > > > above is not enough, but it's a small step in this direction. > > > > I believe that improving the early memory initialization would make many > > things simpler and more robust, but that's a different story :) > > ... I second that. > > -- > Thanks, > > David / dhildenb ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-28 9:07 ` Baoquan He @ 2020-05-28 9:08 ` David Hildenbrand 2020-05-28 15:15 ` Steve Wahl 1 sibling, 0 replies; 19+ messages in thread From: David Hildenbrand @ 2020-05-28 9:08 UTC (permalink / raw) To: Baoquan He Cc: Mike Rapoport, mgorman, linux-kernel, linux-mm, akpm, cai, mhocko, steve.wahl On 28.05.20 11:07, Baoquan He wrote: > On 05/26/20 at 01:49pm, David Hildenbrand wrote: >> On 26.05.20 13:32, Mike Rapoport wrote: >>> Hello Baoquan, >>> >>> On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote: >>>> On 05/22/20 at 05:20pm, Mike Rapoport wrote: >>>>> Hello Baoquan, >>>>> >>>>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: >>>>>> On 05/22/20 at 03:01pm, Baoquan He wrote: >>>>>>> >>>>>>> So let's add these unavailable ranges into memblock and reserve them >>>>>>> in init_unavailable_range() instead. With this change, they will be added >>>>>>> into appropriate node and zone in memmap_init(), and initialized in >>>>>>> reserve_bootmem_region() just like any other memblock reserved regions. >>>>>> >>>>>> Seems this is not right. They can't get nid in init_unavailable_range(). >>>>>> Adding e820 ranges may let them get nid. But the hole range won't be >>>>>> added to memblock, and still has the issue. >>>>>> >>>>>> Nack this one for now, still considering. >>>>> >>>>> Why won't we add the e820 reserved ranges to memblock.memory during >>>>> early boot as I suggested? >>>>> >>>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >>>>> index c5399e80c59c..b0940c618ed9 100644 >>>>> --- a/arch/x86/kernel/e820.c >>>>> +++ b/arch/x86/kernel/e820.c >>>>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) >>>>> if (end != (resource_size_t)end) >>>>> continue; >>>>> >>>>> - if (entry->type == E820_TYPE_SOFT_RESERVED) >>>>> + if (entry->type == E820_TYPE_SOFT_RESERVED || >>>>> + entry->type == E820_TYPE_RESERVED) { >>>>> + memblock_add(entry->addr, entry->size); >>>>> memblock_reserve(entry->addr, entry->size); >>>>> + } >>>>> >>>>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) >>>>> continue; >>>>> >>>>> The setting of node later in numa_init() will assign the proper node >>>>> for these regions as it does for the usable memory. >>>> >>>> Yes, if it's only related to e820 reserved region, this truly works. >>>> >>>> However, it also has ACPI table regions. That's why I changed to call >>>> the problematic area as firmware reserved ranges later. >>>> >>>> Bisides, you can see below line, there's another reserved region which only >>>> occupies one page in one memory seciton. If adding to memblock.memory, we also >>>> will build struct mem_section and the relevant struct pages for the whole >>>> section. And then the holes around that page will be added and initialized in >>>> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory >>>> and memblock.reserved, but won't assign proper node for the holes. >>>> >>>> ~~~ >>>> [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved >>>> ~~~ >>>> >>>> So I still think we should not add firmware reserved range into >>>> memblock for fixing this issue. >>>> >>>> And, the fix in the original patch seems necessary. You can see in >>>> compaction code, the migration source is chosen from LRU pages or >>>> movable pages, the migration target has to be got from Buddy. However, >>>> only the min_pfn in fast_isolate_freepages(), it's calculated by >>>> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's >>>> safe, then use it as the target to handle. >>> >>> I do not object to your original fix with careful check for pfn validity. >>> >>> But I still think that the memory reserved by the firmware is still >>> memory and it should be added to memblock.memory. This way the memory >> >> If it's really memory that could be read/written, I think I agree. > > I would say some of them may not be allowed to be read/written, if I > understand it correctly. I roughly went through the x86 init code, there > are some places where mem region is marked as E820_TYPE_RESERVED so that > they are not touched after initialization. E.g: > > 1) pfn 0 > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can > see the code comment, this is a BIOS owned area, but not kernel RAM. > > 2)GART reserved region > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region > in an area, firmware designer won't map system RAM into that area. > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these > regions are not system RAM backed area, reading from or writting into > these area may cause error. > > Futhermore, there's a KASLR bug found by HPE, its triggering and root > cause are written into below commit log. You can see that accessing to > firmware reserved region caused BIOS to halt system when cpu doing > speculative. > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624 > Author: Steve Wahl <steve.wahl@hpe.com> > Date: Tue Sep 24 16:03:55 2019 -0500 > > x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area > > Our hardware (UV aka Superdome Flex) has address ranges marked > reserved by the BIOS. Access to these ranges is caught as an error, > causing the BIOS to halt the system. > >> >>> map will be properly initialized from the very beginning and we won't >>> need init_unavailable_mem() and alike workarounds and. Obviously, the patch >> >> I remember init_unavailable_mem() is necessary for holes within >> sections, where we actually *don't* have memory, but we still have have >> a valid memmap (full section) that we have to initialize. >> >> See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap >> of unavailable memory directly"). Our main memory ends within a section, >> so we have to initialize the remaining parts because the whole section >> will be marked valid/online. > > Yes, memory hole need be handled in init_unavailable_mem(). Since we > have created struct page for them, need initialize them. We can't > discard init_unavailable_mem() for now. > >> >> Any way to improve this handling is appreciated. In that patch I also >> spelled out that we might want to mark such holes via a new page type, >> e.g., PageHole(). Such a page is a memory hole, but has a valid memmap. >> Any content in the memmap (zone/node) should be ignored. > > As I said at above, I am a little conservative to add all those regions of > E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because > most of them are firmware reserved region, they may be not backed by normal > RAM. > > I was thinking to step back to use mm_zero_struct_page() inside > init_unavailable_range() as below. But it doesn't differ much > from __init_single_page(), except of the _refcount and mapcount. > Zeroing struct page equals to putting them into node 0, zero 0. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3973b5fdfe3f..4e4b72cf5283 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) > * (in memblock.reserved but not in memblock.memory) will > * get re-initialized via reserve_bootmem_region() later. > */ > - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > + mm_zero_struct_page(pfn_to_page(pfn)); > __SetPageReserved(pfn_to_page(pfn)); > pgcnt++; > } > > About adding these unavailable ranges into node/zone, in the old code, > it just happened to add them into expected node/zone. You can see in > early_pfn_in_nid(), if no nid found from memblock, the returned '-1' > will make it true ironically. But that is not saying the bad thing > always got good result. If the last zone of node 0 is DMA32 zone, the > deferred init will skip the only chance to add some of unavailable > rnages into expected node/zone. Means they were not always added into > appropriate node/zone before, the change of iterating memblock.memory in > memmap_init() dones't introduce regression. > > static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) > { > int nid; > > nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache); > if (nid >= 0 && nid != node) > return false; > return true; > } > > So if no anybody need access them after boot, not adding them into any > node/zone sounds better. Otherwise, better add them in the appropriate > node/zone. Yes, the node/zone is just completely irrelevant for these pages I'd say. As I said, maybe we can flag these memmaps somehow as "while this is an initialized memmap, the node/zone is garbage and this memmap should just be ignored completely in any kind of node/zone aware code". -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-28 9:07 ` Baoquan He 2020-05-28 9:08 ` David Hildenbrand @ 2020-05-28 15:15 ` Steve Wahl 2020-06-01 11:42 ` Mike Rapoport 1 sibling, 1 reply; 19+ messages in thread From: Steve Wahl @ 2020-05-28 15:15 UTC (permalink / raw) To: Baoquan He Cc: David Hildenbrand, Mike Rapoport, mgorman, linux-kernel, linux-mm, akpm, cai, mhocko, steve.wahl On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote: > On 05/26/20 at 01:49pm, David Hildenbrand wrote: > > On 26.05.20 13:32, Mike Rapoport wrote: > > > Hello Baoquan, > > > > > > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote: > > >> On 05/22/20 at 05:20pm, Mike Rapoport wrote: > > >>> Hello Baoquan, > > >>> > > >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: > > >>>> On 05/22/20 at 03:01pm, Baoquan He wrote: > > >>>>> > > >>>>> So let's add these unavailable ranges into memblock and reserve them > > >>>>> in init_unavailable_range() instead. With this change, they will be added > > >>>>> into appropriate node and zone in memmap_init(), and initialized in > > >>>>> reserve_bootmem_region() just like any other memblock reserved regions. > > >>>> > > >>>> Seems this is not right. They can't get nid in init_unavailable_range(). > > >>>> Adding e820 ranges may let them get nid. But the hole range won't be > > >>>> added to memblock, and still has the issue. > > >>>> > > >>>> Nack this one for now, still considering. > > >>> > > >>> Why won't we add the e820 reserved ranges to memblock.memory during > > >>> early boot as I suggested? > > >>> > > >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > >>> index c5399e80c59c..b0940c618ed9 100644 > > >>> --- a/arch/x86/kernel/e820.c > > >>> +++ b/arch/x86/kernel/e820.c > > >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) > > >>> if (end != (resource_size_t)end) > > >>> continue; > > >>> > > >>> - if (entry->type == E820_TYPE_SOFT_RESERVED) > > >>> + if (entry->type == E820_TYPE_SOFT_RESERVED || > > >>> + entry->type == E820_TYPE_RESERVED) { > > >>> + memblock_add(entry->addr, entry->size); > > >>> memblock_reserve(entry->addr, entry->size); > > >>> + } > > >>> > > >>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > >>> continue; > > >>> > > >>> The setting of node later in numa_init() will assign the proper node > > >>> for these regions as it does for the usable memory. > > >> > > >> Yes, if it's only related to e820 reserved region, this truly works. > > >> > > >> However, it also has ACPI table regions. That's why I changed to call > > >> the problematic area as firmware reserved ranges later. > > >> > > >> Bisides, you can see below line, there's another reserved region which only > > >> occupies one page in one memory seciton. If adding to memblock.memory, we also > > >> will build struct mem_section and the relevant struct pages for the whole > > >> section. And then the holes around that page will be added and initialized in > > >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory > > >> and memblock.reserved, but won't assign proper node for the holes. > > >> > > >> ~~~ > > >> [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > > >> ~~~ > > >> > > >> So I still think we should not add firmware reserved range into > > >> memblock for fixing this issue. > > >> > > >> And, the fix in the original patch seems necessary. You can see in > > >> compaction code, the migration source is chosen from LRU pages or > > >> movable pages, the migration target has to be got from Buddy. However, > > >> only the min_pfn in fast_isolate_freepages(), it's calculated by > > >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's > > >> safe, then use it as the target to handle. > > > > > > I do not object to your original fix with careful check for pfn validity. > > > > > > But I still think that the memory reserved by the firmware is still > > > memory and it should be added to memblock.memory. This way the memory > > > > If it's really memory that could be read/written, I think I agree. > > I would say some of them may not be allowed to be read/written, if I > understand it correctly. I roughly went through the x86 init code, there > are some places where mem region is marked as E820_TYPE_RESERVED so that > they are not touched after initialization. E.g: > > 1) pfn 0 > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can > see the code comment, this is a BIOS owned area, but not kernel RAM. > > 2)GART reserved region > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region > in an area, firmware designer won't map system RAM into that area. > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these > regions are not system RAM backed area, reading from or writting into > these area may cause error. > > Futhermore, there's a KASLR bug found by HPE, its triggering and root > cause are written into below commit log. You can see that accessing to > firmware reserved region caused BIOS to halt system when cpu doing > speculative. > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624 > Author: Steve Wahl <steve.wahl@hpe.com> > Date: Tue Sep 24 16:03:55 2019 -0500 > > x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area > > Our hardware (UV aka Superdome Flex) has address ranges marked > reserved by the BIOS. Access to these ranges is caught as an error, > causing the BIOS to halt the system. Thank you for CC'ing me. Coming into the middle of the conversation, I am trying to understand the implications of what's being discussed here, but not being very successful at it. For the HPE UV hardware, the addresses listed in the reserved range must never be accessed, or (as we discovered) even be reachable by an active page table entry. Any active page table entry covering the region allows the processor to issue speculative accesses to the region, resulting in the BIOS halt mentioned. I'm not sure if the discussion above about adding the region to memblock.memory would result in an active mapping of the region or not. If it did, we'd have problems. Thanks, Steve Wahl, HPE > > > > > map will be properly initialized from the very beginning and we won't > > > need init_unavailable_mem() and alike workarounds and. Obviously, the patch > > > > I remember init_unavailable_mem() is necessary for holes within > > sections, where we actually *don't* have memory, but we still have have > > a valid memmap (full section) that we have to initialize. > > > > See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap > > of unavailable memory directly"). Our main memory ends within a section, > > so we have to initialize the remaining parts because the whole section > > will be marked valid/online. > > Yes, memory hole need be handled in init_unavailable_mem(). Since we > have created struct page for them, need initialize them. We can't > discard init_unavailable_mem() for now. > > > > > Any way to improve this handling is appreciated. In that patch I also > > spelled out that we might want to mark such holes via a new page type, > > e.g., PageHole(). Such a page is a memory hole, but has a valid memmap. > > Any content in the memmap (zone/node) should be ignored. > > As I said at above, I am a little conservative to add all those regions of > E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because > most of them are firmware reserved region, they may be not backed by normal > RAM. > > I was thinking to step back to use mm_zero_struct_page() inside > init_unavailable_range() as below. But it doesn't differ much > from __init_single_page(), except of the _refcount and mapcount. > Zeroing struct page equals to putting them into node 0, zero 0. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3973b5fdfe3f..4e4b72cf5283 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) > * (in memblock.reserved but not in memblock.memory) will > * get re-initialized via reserve_bootmem_region() later. > */ > - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > + mm_zero_struct_page(pfn_to_page(pfn)); > __SetPageReserved(pfn_to_page(pfn)); > pgcnt++; > } > > About adding these unavailable ranges into node/zone, in the old code, > it just happened to add them into expected node/zone. You can see in > early_pfn_in_nid(), if no nid found from memblock, the returned '-1' > will make it true ironically. But that is not saying the bad thing > always got good result. If the last zone of node 0 is DMA32 zone, the > deferred init will skip the only chance to add some of unavailable > rnages into expected node/zone. Means they were not always added into > appropriate node/zone before, the change of iterating memblock.memory in > memmap_init() dones't introduce regression. > > static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) > { > int nid; > > nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache); > if (nid >= 0 && nid != node) > return false; > return true; > } > > So if no anybody need access them after boot, not adding them into any > node/zone sounds better. Otherwise, better add them in the appropriate > node/zone. > > > > > But it's all quite confusing, especially across architectures and ... > > > > > above is not enough, but it's a small step in this direction. > > > > > > I believe that improving the early memory initialization would make many > > > things simpler and more robust, but that's a different story :) > > > > ... I second that. > > > > -- > > Thanks, > > > > David / dhildenb > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-28 15:15 ` Steve Wahl @ 2020-06-01 11:42 ` Mike Rapoport 2020-06-01 13:31 ` Baoquan He 0 siblings, 1 reply; 19+ messages in thread From: Mike Rapoport @ 2020-06-01 11:42 UTC (permalink / raw) To: Steve Wahl Cc: Baoquan He, David Hildenbrand, mgorman, linux-kernel, linux-mm, akpm, cai, mhocko On Thu, May 28, 2020 at 10:15:10AM -0500, Steve Wahl wrote: > On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote: > > On 05/26/20 at 01:49pm, David Hildenbrand wrote: > > > On 26.05.20 13:32, Mike Rapoport wrote: > > > > Hello Baoquan, > > > > > > > > I do not object to your original fix with careful check for pfn validity. > > > > > > > > But I still think that the memory reserved by the firmware is still > > > > memory and it should be added to memblock.memory. This way the memory > > > > > > If it's really memory that could be read/written, I think I agree. > > > > I would say some of them may not be allowed to be read/written, if I > > understand it correctly. I roughly went through the x86 init code, there > > are some places where mem region is marked as E820_TYPE_RESERVED so that > > they are not touched after initialization. E.g: > > > > 1) pfn 0 > > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can > > see the code comment, this is a BIOS owned area, but not kernel RAM. > > > > 2)GART reserved region > > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region > > in an area, firmware designer won't map system RAM into that area. > > > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these > > regions are not system RAM backed area, reading from or writting into > > these area may cause error. > > > > Futhermore, there's a KASLR bug found by HPE, its triggering and root > > cause are written into below commit log. You can see that accessing to > > firmware reserved region caused BIOS to halt system when cpu doing > > speculative. > > > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624 > > Author: Steve Wahl <steve.wahl@hpe.com> > > Date: Tue Sep 24 16:03:55 2019 -0500 > > > > x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area > > > > Our hardware (UV aka Superdome Flex) has address ranges marked > > reserved by the BIOS. Access to these ranges is caught as an error, > > causing the BIOS to halt the system. > > Thank you for CC'ing me. Coming into the middle of the conversation, > I am trying to understand the implications of what's being discussed > here, but not being very successful at it. > > For the HPE UV hardware, the addresses listed in the reserved range > must never be accessed, or (as we discovered) even be reachable by an > active page table entry. Any active page table entry covering the > region allows the processor to issue speculative accesses to the > region, resulting in the BIOS halt mentioned. > > I'm not sure if the discussion above about adding the region to > memblock.memory would result in an active mapping of the region or > not. If it did, we'd have problems. The discussion is whether regions marked as E820_RESERVED should be considered as RAM or not. For the hardware that cannot tolerate acceses to these areas like HPE UV, it should not :) I still think that keeping them only in memblock.reserved creates more problems than it solves, but simply adding E820_RESERVED areas to memblock.memory just won't work. I'll try to come up with something better Really Soon (c). > Thanks, > > Steve Wahl, HPE -- Sincerely yours, Mike. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-06-01 11:42 ` Mike Rapoport @ 2020-06-01 13:31 ` Baoquan He 0 siblings, 0 replies; 19+ messages in thread From: Baoquan He @ 2020-06-01 13:31 UTC (permalink / raw) To: Mike Rapoport Cc: Steve Wahl, David Hildenbrand, mgorman, linux-kernel, linux-mm, akpm, cai, mhocko On 06/01/20 at 02:42pm, Mike Rapoport wrote: > On Thu, May 28, 2020 at 10:15:10AM -0500, Steve Wahl wrote: > > On Thu, May 28, 2020 at 05:07:31PM +0800, Baoquan He wrote: > > > On 05/26/20 at 01:49pm, David Hildenbrand wrote: > > > > On 26.05.20 13:32, Mike Rapoport wrote: > > > > > Hello Baoquan, > > > > > > > > > > I do not object to your original fix with careful check for pfn validity. > > > > > > > > > > But I still think that the memory reserved by the firmware is still > > > > > memory and it should be added to memblock.memory. This way the memory > > > > > > > > If it's really memory that could be read/written, I think I agree. > > > > > > I would say some of them may not be allowed to be read/written, if I > > > understand it correctly. I roughly went through the x86 init code, there > > > are some places where mem region is marked as E820_TYPE_RESERVED so that > > > they are not touched after initialization. E.g: > > > > > > 1) pfn 0 > > > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can > > > see the code comment, this is a BIOS owned area, but not kernel RAM. > > > > > > 2)GART reserved region > > > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region > > > in an area, firmware designer won't map system RAM into that area. > > > > > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these > > > regions are not system RAM backed area, reading from or writting into > > > these area may cause error. > > > > > > Futhermore, there's a KASLR bug found by HPE, its triggering and root > > > cause are written into below commit log. You can see that accessing to > > > firmware reserved region caused BIOS to halt system when cpu doing > > > speculative. > > > > > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624 > > > Author: Steve Wahl <steve.wahl@hpe.com> > > > Date: Tue Sep 24 16:03:55 2019 -0500 > > > > > > x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area > > > > > > Our hardware (UV aka Superdome Flex) has address ranges marked > > > reserved by the BIOS. Access to these ranges is caught as an error, > > > causing the BIOS to halt the system. > > > > Thank you for CC'ing me. Coming into the middle of the conversation, > > I am trying to understand the implications of what's being discussed > > here, but not being very successful at it. > > > > For the HPE UV hardware, the addresses listed in the reserved range > > must never be accessed, or (as we discovered) even be reachable by an > > active page table entry. Any active page table entry covering the > > region allows the processor to issue speculative accesses to the > > region, resulting in the BIOS halt mentioned. > > > > I'm not sure if the discussion above about adding the region to > > memblock.memory would result in an active mapping of the region or > > not. If it did, we'd have problems. > > The discussion is whether regions marked as E820_RESERVED should be > considered as RAM or not. IMHO, the discussion is about whether firmware reserved regions like E820_TYPE_RESERVED should be added into memory allocators. > > For the hardware that cannot tolerate acceses to these areas like HPE > UV, it should not :) > > I still think that keeping them only in memblock.reserved creates more > problems than it solves, but simply adding E820_RESERVED areas to > memblock.memory just won't work. Currently, we hardly add E820_RESERVED into memblock, including memblock.memmory and memblock.reserved. But, in several places, people add them to memblock.reserved when they have marked them as E820_TYPE_RESERVED. I agree with David that these pages aren't available for any later memory allocator. "the node/zone is just completely irrelevant for these pages". Keeping them out of the memory management system is safer so that nobody won't touch them. I tried to mark them out with a new type PageUnavail as David suggested, am still struggling to think of where I should detect and exclude them. Below is a draft patch about my attempt on marking out unavailable pages. Any comment or suggestions are welcomed. Of course for any better idea with code change, I would like to test and review. Hope we can finally fix this issue after its exposure. From 25ca041258a40bb315f94a23f0465b510b1614a0 Mon Sep 17 00:00:00 2001 From: Baoquan He <bhe@redhat.com> Date: Fri, 29 May 2020 20:23:13 +0800 Subject: [RFC PATCH] mm: Add a new page type to mark unavailable pages Firstly, the last user of early_pfn_valid() is memmap_init_zone(), but that code has been optimized away. Let's remove it. Secondly, add a new page type to mark unavailale pages. Thirdly, add a new early_pfn_valid() so that init_unavailable_range() can initialize those pages in memblock.reserved. Signed-off-by: Baoquan He <bhe@redhat.com> --- arch/x86/include/asm/mmzone_32.h | 2 -- include/linux/mmzone.h | 18 ++++++++++-------- include/linux/page-flags.h | 8 ++++++++ mm/page_alloc.c | 3 ++- 4 files changed, 20 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/mmzone_32.h b/arch/x86/include/asm/mmzone_32.h index 73d8dd14dda2..4da45eff2942 100644 --- a/arch/x86/include/asm/mmzone_32.h +++ b/arch/x86/include/asm/mmzone_32.h @@ -49,8 +49,6 @@ static inline int pfn_valid(int pfn) return 0; } -#define early_pfn_valid(pfn) pfn_valid((pfn)) - #endif /* CONFIG_DISCONTIGMEM */ #endif /* _ASM_X86_MMZONE_32_H */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index e8d7d0b0acf4..47d09be73dca 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1315,20 +1315,23 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) #endif #ifndef CONFIG_HAVE_ARCH_PFN_VALID -static inline int pfn_valid(unsigned long pfn) +static inline int early_pfn_valid(unsigned long pfn) { - struct mem_section *ms; - if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; - ms = __nr_to_section(pfn_to_section_nr(pfn)); - if (!valid_section(ms)) + return valid_section(__pfn_to_section(pfn)) +} + +static inline int pfn_valid(unsigned long pfn) +{ + if (!early_pfn_valid(pfn)) return 0; /* * Traditionally early sections always returned pfn_valid() for * the entire section-sized span. */ - return early_section(ms) || pfn_section_valid(ms, pfn); + return (early_section(ms) && PageUnavail(pfn_to_page(pfn))) || + pfn_section_valid(ms, pfn); } #endif @@ -1364,7 +1367,6 @@ static inline unsigned long next_present_section_nr(unsigned long section_nr) #define pfn_to_nid(pfn) (0) #endif -#define early_pfn_valid(pfn) pfn_valid(pfn) void sparse_init(void); #else #define sparse_init() do {} while (0) @@ -1385,7 +1387,7 @@ struct mminit_pfnnid_cache { }; #ifndef early_pfn_valid -#define early_pfn_valid(pfn) (1) +#define early_pfn_valid(pfn) pfn_valid(pfn) #endif void memory_present(int nid, unsigned long start, unsigned long end); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 222f6f7b2bb3..1c011008100c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -740,6 +740,7 @@ PAGEFLAG_FALSE(DoubleMap) #define PG_kmemcg 0x00000200 #define PG_table 0x00000400 #define PG_guard 0x00000800 +#define PG_unavail 0x00001000 #define PageType(page, flag) \ ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) @@ -796,6 +797,13 @@ PAGE_TYPE_OPS(Table, table) */ PAGE_TYPE_OPS(Guard, guard) +/* + * Mark pages which are unavailable to memory allocators after boot. + * They includes pages reserved by firmware pre-boot or during boot, + * holes which share the same setcion with usable RAM. + */ +PAGE_TYPE_OPS(Unavail, unavail) + extern bool is_free_buddy_page(struct page *page); __PAGEFLAG(Isolated, isolated, PF_ANY); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 940cdce96864..e9ebef6ffbec 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6972,7 +6972,8 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) * (in memblock.reserved but not in memblock.memory) will * get re-initialized via reserve_bootmem_region() later. */ - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); + mm_zero_struct_page(pfn_to_page(pfn)); + __SetPageUnavail(pfn_to_page(pfn)); __SetPageReserved(pfn_to_page(pfn)); pgcnt++; } -- 2.17.2 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-21 1:44 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() Baoquan He 2020-05-21 9:26 ` Mike Rapoport @ 2020-05-21 9:36 ` Mel Gorman 2020-05-21 15:41 ` Baoquan He 1 sibling, 1 reply; 19+ messages in thread From: Mel Gorman @ 2020-05-21 9:36 UTC (permalink / raw) To: Baoquan He; +Cc: linux-kernel, linux-mm, akpm, cai, mhocko, rppt On Thu, May 21, 2020 at 09:44:07AM +0800, Baoquan He wrote: > After investigation, it turns out that this is introduced by commit of > linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock > regions rather that check each PFN"). > > After investigation, it turns out that this is introduced by commit of > linux-next, the patch subject is: > "mm: memmap_init: iterate over memblock regions rather that check each PFN". > Some repetition here. I assume it's because the commit ID is not stable because it's in linux-next. > Qian added debugging code. The debugging log shows that the fault page is > 0x2a800000. From the system e820 map which is pasted at bottom, the page > is in e820 reserved range: > BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved > And it's in section [0x28000000, 0x2fffffff]. In that secion, there are > several usable ranges and some e820 reserved ranges. > > For this kind of e820 reserved range, it won't be added to memblock allocator. > However, init_unavailable_mem() will initialize to add them into node 0, > zone 0. Why is it appropriate for init_unavailable_mem to add a e820 reserved range to the zone at all? The bug being triggered indicates there is a mismatch between the zone of a struct page and the PFN passed in. > Before that commit, later, memmap_init() will add e820 reserved > ranges into the zone where they are contained, because it can pass > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case, > the e820 reserved range where fault page 0x2a800000 is located is added > into DMA32 zone. After that commit, the e820 reserved rgions are kept > in node 0, zone 0, since we iterate over memblock regions to iniatialize > in memmap_init() instead, their node and zone won't be changed. > This implies that we have struct pages that should never be used (e820 reserved) but exist somehow in a zone range but with broken linkages to their node and zone. > Reported-by: Qian Cai <cai@lca.pw> > Signed-off-by: Baoquan He <bhe@redhat.com> > --- > mm/compaction.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/compaction.c b/mm/compaction.c > index 67fd317f78db..9ce4cff4d407 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc) > cc->free_pfn = highest; > } else { > if (cc->direct_compaction && pfn_valid(min_pfn)) { > - page = pfn_to_page(min_pfn); > + page = pageblock_pfn_to_page(min_pfn, > + pageblock_end_pfn(min_pfn), > + cc->zone); > cc->free_pfn = min_pfn; > } > } Why is the correct fix not to avoid creating struct pages for e820 ranges and make sure that struct pages that are reachable have proper node and zone linkages? -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() 2020-05-21 9:36 ` Mel Gorman @ 2020-05-21 15:41 ` Baoquan He 0 siblings, 0 replies; 19+ messages in thread From: Baoquan He @ 2020-05-21 15:41 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-kernel, linux-mm, akpm, cai, mhocko, rppt On 05/21/20 at 10:36am, Mel Gorman wrote: > On Thu, May 21, 2020 at 09:44:07AM +0800, Baoquan He wrote: > > After investigation, it turns out that this is introduced by commit of > > linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock > > regions rather that check each PFN"). > > > > After investigation, it turns out that this is introduced by commit of > > linux-next, the patch subject is: > > "mm: memmap_init: iterate over memblock regions rather that check each PFN". > > > > Some repetition here. I assume it's because the commit ID is not stable > because it's in linux-next. Yes, I plan to remove the temporary commit id of linux-next, but forgot cleaning up it. > > > Qian added debugging code. The debugging log shows that the fault page is > > 0x2a800000. From the system e820 map which is pasted at bottom, the page > > is in e820 reserved range: > > BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved > > And it's in section [0x28000000, 0x2fffffff]. In that secion, there are > > several usable ranges and some e820 reserved ranges. > > > > For this kind of e820 reserved range, it won't be added to memblock allocator. > > However, init_unavailable_mem() will initialize to add them into node 0, > > zone 0. > > Why is it appropriate for init_unavailable_mem to add a e820 reserved > range to the zone at all? The bug being triggered indicates there is a > mismatch between the zone of a struct page and the PFN passed in. Read the patch log again and reviewing comments, realize I could make the log misleading with inaccurate explanation. The root cause is not about e820 reserved only, it's about any unavailable range inside boot mem section. The unavailable ranges include firmware reserved ranges, and holes. Before Mike's below patchset, we call init_unavailable_mem() to initialize the unavailable ranges into node 0, zone 0, and mark pages inside the unvailable ranges as Reserved. Later in memmap_init_zone(), we will add the unvaiable ranges into zone which start and end cover them. Because the early_pfn_valid() and early_pfn_in_nid() incorrectly return true for these unavailable pages. This looks unreasonable, right? But it doesn't cause problem, because they are not added into buddy and is Reserved. After Mike's patchset applied, one of them is patch 15/21, it will iterate over memblock regions. These unavailable ranges are not added into memblock allocator, so they won't be added into zone which start and end cover them, but kept in node 0, zone 0. Then this issue is triggered by VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)). Note that init_unavailable_mem(), memmap_init(), they are all in generic MM code, not in a certain ARCH only. So I think the fix in this patch is needed, no matter whether we will fix the issue that unavailable rangs have struct page, and are added into node 0, zone 0. [PATCH 00/21] mm: rework free_area_init*() funcitons http://lkml.kernel.org/r/20200412194859.12663-1-rppt@kernel.org [PATCH 15/21] mm: memmap_init: iterate over memblock regions rather that check each PFN > > > Before that commit, later, memmap_init() will add e820 reserved > > ranges into the zone where they are contained, because it can pass > > the checking of early_pfn_valid() and early_pfn_in_nid(). In this case, > > the e820 reserved range where fault page 0x2a800000 is located is added > > into DMA32 zone. After that commit, the e820 reserved rgions are kept > > in node 0, zone 0, since we iterate over memblock regions to iniatialize > > in memmap_init() instead, their node and zone won't be changed. > > > > This implies that we have struct pages that should never be used (e820 > reserved) but exist somehow in a zone range but with broken linkages to > their node and zone. Yes, in one boot memory section, if it includes usable RAM, and unavailable ranges, like firmware reserved range, holes, these unavailable ranges will be added into node 0, zone 0, and have struct page, they are never used by system. > > > Reported-by: Qian Cai <cai@lca.pw> > > Signed-off-by: Baoquan He <bhe@redhat.com> > > --- > > mm/compaction.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/mm/compaction.c b/mm/compaction.c > > index 67fd317f78db..9ce4cff4d407 100644 > > --- a/mm/compaction.c > > +++ b/mm/compaction.c > > @@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc) > > cc->free_pfn = highest; > > } else { > > if (cc->direct_compaction && pfn_valid(min_pfn)) { > > - page = pfn_to_page(min_pfn); > > + page = pageblock_pfn_to_page(min_pfn, > > + pageblock_end_pfn(min_pfn), > > + cc->zone); > > cc->free_pfn = min_pfn; > > } > > } > > Why is the correct fix not to avoid creating struct pages for e820 > ranges and make sure that struct pages that are reachable have proper > node and zone linkages? Seems we can't avoid to create struct page for them, surely as I explained at above, they are not only e820 reserved ranges, there are also holes and other firmware reserved ranges. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages()^[ @ 2020-05-28 8:59 Baoquan He 2020-05-28 9:08 ` Baoquan He 0 siblings, 1 reply; 19+ messages in thread From: Baoquan He @ 2020-05-28 8:59 UTC (permalink / raw) To: David Hildenbrand, rppt; +Cc: mgorman, linux-kernel, linux-mm akpm@linux-foundation.org, cai@lca.pw, mhocko@kernel.org, steve.wahl@hpe.com, Bcc: bhe@redhat.com Subject: Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() Reply-To: In-Reply-To: <01beec81-565f-d335-5eff-22693fc09c0e@redhat.com> On 05/26/20 at 01:49pm, David Hildenbrand wrote: > On 26.05.20 13:32, Mike Rapoport wrote: > > Hello Baoquan, > > > > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote: > >> On 05/22/20 at 05:20pm, Mike Rapoport wrote: > >>> Hello Baoquan, > >>> > >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: > >>>> On 05/22/20 at 03:01pm, Baoquan He wrote: > >>>>> > >>>>> So let's add these unavailable ranges into memblock and reserve them > >>>>> in init_unavailable_range() instead. With this change, they will be added > >>>>> into appropriate node and zone in memmap_init(), and initialized in > >>>>> reserve_bootmem_region() just like any other memblock reserved regions. > >>>> > >>>> Seems this is not right. They can't get nid in init_unavailable_range(). > >>>> Adding e820 ranges may let them get nid. But the hole range won't be > >>>> added to memblock, and still has the issue. > >>>> > >>>> Nack this one for now, still considering. > >>> > >>> Why won't we add the e820 reserved ranges to memblock.memory during > >>> early boot as I suggested? > >>> > >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > >>> index c5399e80c59c..b0940c618ed9 100644 > >>> --- a/arch/x86/kernel/e820.c > >>> +++ b/arch/x86/kernel/e820.c > >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) > >>> if (end != (resource_size_t)end) > >>> continue; > >>> > >>> - if (entry->type == E820_TYPE_SOFT_RESERVED) > >>> + if (entry->type == E820_TYPE_SOFT_RESERVED || > >>> + entry->type == E820_TYPE_RESERVED) { > >>> + memblock_add(entry->addr, entry->size); > >>> memblock_reserve(entry->addr, entry->size); > >>> + } > >>> > >>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > >>> continue; > >>> > >>> The setting of node later in numa_init() will assign the proper node > >>> for these regions as it does for the usable memory. > >> > >> Yes, if it's only related to e820 reserved region, this truly works. > >> > >> However, it also has ACPI table regions. That's why I changed to call > >> the problematic area as firmware reserved ranges later. > >> > >> Bisides, you can see below line, there's another reserved region which only > >> occupies one page in one memory seciton. If adding to memblock.memory, we also > >> will build struct mem_section and the relevant struct pages for the whole > >> section. And then the holes around that page will be added and initialized in > >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory > >> and memblock.reserved, but won't assign proper node for the holes. > >> > >> ~~~ > >> [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > >> ~~~ > >> > >> So I still think we should not add firmware reserved range into > >> memblock for fixing this issue. > >> > >> And, the fix in the original patch seems necessary. You can see in > >> compaction code, the migration source is chosen from LRU pages or > >> movable pages, the migration target has to be got from Buddy. However, > >> only the min_pfn in fast_isolate_freepages(), it's calculated by > >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's > >> safe, then use it as the target to handle. > > > > I do not object to your original fix with careful check for pfn validity. > > > > But I still think that the memory reserved by the firmware is still > > memory and it should be added to memblock.memory. This way the memory > > If it's really memory that could be read/written, I think I agree. I would say some of them may not be allowed to be read/written, if I understand it correctly. I roughly went through the x86 init code, there are some places where mem region is marked as E820_TYPE_RESERVED so that they are not touched after initialization. E.g: 1) pfn 0 In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can see the code comment, this is a BIOS owned area, but not kernel RAM. 2)GART reserved region In early_gart_iommu_check(), GART IOMMU firmware will reserve a region in an area, firmware designer won't map system RAM into that area. And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these regions are not system RAM backed area, reading from or writting into these area may cause error. Futhermore, there's a KASLR bug found by HPE, its triggering and root cause are written into below commit log. You can see that accessing to firmware reserved region caused BIOS to halt system when cpu doing speculative. commit 2aa85f246c181b1fa89f27e8e20c5636426be624 Author: Steve Wahl <steve.wahl@hpe.com> Date: Tue Sep 24 16:03:55 2019 -0500 x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area Our hardware (UV aka Superdome Flex) has address ranges marked reserved by the BIOS. Access to these ranges is caught as an error, causing the BIOS to halt the system. > > > map will be properly initialized from the very beginning and we won't > > need init_unavailable_mem() and alike workarounds and. Obviously, the patch > > I remember init_unavailable_mem() is necessary for holes within > sections, where we actually *don't* have memory, but we still have have > a valid memmap (full section) that we have to initialize. > > See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap > of unavailable memory directly"). Our main memory ends within a section, > so we have to initialize the remaining parts because the whole section > will be marked valid/online. Yes, memory hole need be handled in init_unavailable_mem(). Since we have created struct page for them, need initialize them. We can't discard init_unavailable_mem() for now. > > Any way to improve this handling is appreciated. In that patch I also > spelled out that we might want to mark such holes via a new page type, > e.g., PageHole(). Such a page is a memory hole, but has a valid memmap. > Any content in the memmap (zone/node) should be ignored. As I said at above, I am a little conservative to add all those regions of E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because most of them are firmware reserved region, they may be not backed by normal RAM. I was thinking to step back to use mm_zero_struct_page() inside init_unavailable_range() as below. But it doesn't differ much from __init_single_page(), except of the _refcount and mapcount. Zeroing struct page equals to putting them into node 0, zero 0. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3973b5fdfe3f..4e4b72cf5283 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) * (in memblock.reserved but not in memblock.memory) will * get re-initialized via reserve_bootmem_region() later. */ - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); + mm_zero_struct_page(pfn_to_page(pfn)); __SetPageReserved(pfn_to_page(pfn)); pgcnt++; } About adding these unavailable ranges into node/zone, in the old code, it just happened to add them into expected node/zone. You can see in early_pfn_in_nid(), if no nid found from memblock, the returned '-1' will make it true ironically. But that is not saying the bad thing always got good result. If the last zone of node 0 is DMA32 zone, the deferred init will skip the only chance to add some of unavailable rnages into expected node/zone. Means they were not always added into appropriate node/zone before, the change of iterating memblock.memory in memmap_init() dones't introduce regression. static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) { int nid; nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache); if (nid >= 0 && nid != node) return false; return true; } So if no anybody need access them after boot, not adding them into any node/zone sounds better. Otherwise, better add them in the appropriate node/zone. > > But it's all quite confusing, especially across architectures and ... > > > above is not enough, but it's a small step in this direction. > > > > I believe that improving the early memory initialization would make many > > things simpler and more robust, but that's a different story :) > > ... I second that. > > -- > Thanks, > > David / dhildenb ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages()^[ 2020-05-28 8:59 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages()^[ Baoquan He @ 2020-05-28 9:08 ` Baoquan He 0 siblings, 0 replies; 19+ messages in thread From: Baoquan He @ 2020-05-28 9:08 UTC (permalink / raw) To: David Hildenbrand, rppt; +Cc: mgorman, linux-kernel, linux-mm On 05/28/20 at 04:59pm, Baoquan He wrote: > akpm@linux-foundation.org, cai@lca.pw, mhocko@kernel.org, steve.wahl@hpe.com, > Bcc: bhe@redhat.com > Subject: Re: [PATCH] mm/compaction: Fix the incorrect hole in > fast_isolate_freepages() > Reply-To: > In-Reply-To: <01beec81-565f-d335-5eff-22693fc09c0e@redhat.com> Sorry, mail client mess up the mail header, have resent one. Please ignore this one. > > On 05/26/20 at 01:49pm, David Hildenbrand wrote: > > On 26.05.20 13:32, Mike Rapoport wrote: > > > Hello Baoquan, > > > > > > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote: > > >> On 05/22/20 at 05:20pm, Mike Rapoport wrote: > > >>> Hello Baoquan, > > >>> > > >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote: > > >>>> On 05/22/20 at 03:01pm, Baoquan He wrote: > > >>>>> > > >>>>> So let's add these unavailable ranges into memblock and reserve them > > >>>>> in init_unavailable_range() instead. With this change, they will be added > > >>>>> into appropriate node and zone in memmap_init(), and initialized in > > >>>>> reserve_bootmem_region() just like any other memblock reserved regions. > > >>>> > > >>>> Seems this is not right. They can't get nid in init_unavailable_range(). > > >>>> Adding e820 ranges may let them get nid. But the hole range won't be > > >>>> added to memblock, and still has the issue. > > >>>> > > >>>> Nack this one for now, still considering. > > >>> > > >>> Why won't we add the e820 reserved ranges to memblock.memory during > > >>> early boot as I suggested? > > >>> > > >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > > >>> index c5399e80c59c..b0940c618ed9 100644 > > >>> --- a/arch/x86/kernel/e820.c > > >>> +++ b/arch/x86/kernel/e820.c > > >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void) > > >>> if (end != (resource_size_t)end) > > >>> continue; > > >>> > > >>> - if (entry->type == E820_TYPE_SOFT_RESERVED) > > >>> + if (entry->type == E820_TYPE_SOFT_RESERVED || > > >>> + entry->type == E820_TYPE_RESERVED) { > > >>> + memblock_add(entry->addr, entry->size); > > >>> memblock_reserve(entry->addr, entry->size); > > >>> + } > > >>> > > >>> if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > > >>> continue; > > >>> > > >>> The setting of node later in numa_init() will assign the proper node > > >>> for these regions as it does for the usable memory. > > >> > > >> Yes, if it's only related to e820 reserved region, this truly works. > > >> > > >> However, it also has ACPI table regions. That's why I changed to call > > >> the problematic area as firmware reserved ranges later. > > >> > > >> Bisides, you can see below line, there's another reserved region which only > > >> occupies one page in one memory seciton. If adding to memblock.memory, we also > > >> will build struct mem_section and the relevant struct pages for the whole > > >> section. And then the holes around that page will be added and initialized in > > >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory > > >> and memblock.reserved, but won't assign proper node for the holes. > > >> > > >> ~~~ > > >> [ 0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved > > >> ~~~ > > >> > > >> So I still think we should not add firmware reserved range into > > >> memblock for fixing this issue. > > >> > > >> And, the fix in the original patch seems necessary. You can see in > > >> compaction code, the migration source is chosen from LRU pages or > > >> movable pages, the migration target has to be got from Buddy. However, > > >> only the min_pfn in fast_isolate_freepages(), it's calculated by > > >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's > > >> safe, then use it as the target to handle. > > > > > > I do not object to your original fix with careful check for pfn validity. > > > > > > But I still think that the memory reserved by the firmware is still > > > memory and it should be added to memblock.memory. This way the memory > > > > If it's really memory that could be read/written, I think I agree. > > I would say some of them may not be allowed to be read/written, if I > understand it correctly. I roughly went through the x86 init code, there > are some places where mem region is marked as E820_TYPE_RESERVED so that > they are not touched after initialization. E.g: > > 1) pfn 0 > In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can > see the code comment, this is a BIOS owned area, but not kernel RAM. > > 2)GART reserved region > In early_gart_iommu_check(), GART IOMMU firmware will reserve a region > in an area, firmware designer won't map system RAM into that area. > > And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these > regions are not system RAM backed area, reading from or writting into > these area may cause error. > > Futhermore, there's a KASLR bug found by HPE, its triggering and root > cause are written into below commit log. You can see that accessing to > firmware reserved region caused BIOS to halt system when cpu doing > speculative. > > commit 2aa85f246c181b1fa89f27e8e20c5636426be624 > Author: Steve Wahl <steve.wahl@hpe.com> > Date: Tue Sep 24 16:03:55 2019 -0500 > > x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area > > Our hardware (UV aka Superdome Flex) has address ranges marked > reserved by the BIOS. Access to these ranges is caught as an error, > causing the BIOS to halt the system. > > > > > > map will be properly initialized from the very beginning and we won't > > > need init_unavailable_mem() and alike workarounds and. Obviously, the patch > > > > I remember init_unavailable_mem() is necessary for holes within > > sections, where we actually *don't* have memory, but we still have have > > a valid memmap (full section) that we have to initialize. > > > > See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap > > of unavailable memory directly"). Our main memory ends within a section, > > so we have to initialize the remaining parts because the whole section > > will be marked valid/online. > > Yes, memory hole need be handled in init_unavailable_mem(). Since we > have created struct page for them, need initialize them. We can't > discard init_unavailable_mem() for now. > > > > > Any way to improve this handling is appreciated. In that patch I also > > spelled out that we might want to mark such holes via a new page type, > > e.g., PageHole(). Such a page is a memory hole, but has a valid memmap. > > Any content in the memmap (zone/node) should be ignored. > > As I said at above, I am a little conservative to add all those regions of > E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because > most of them are firmware reserved region, they may be not backed by normal > RAM. > > I was thinking to step back to use mm_zero_struct_page() inside > init_unavailable_range() as below. But it doesn't differ much > from __init_single_page(), except of the _refcount and mapcount. > Zeroing struct page equals to putting them into node 0, zero 0. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 3973b5fdfe3f..4e4b72cf5283 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn) > * (in memblock.reserved but not in memblock.memory) will > * get re-initialized via reserve_bootmem_region() later. > */ > - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > + mm_zero_struct_page(pfn_to_page(pfn)); > __SetPageReserved(pfn_to_page(pfn)); > pgcnt++; > } > > About adding these unavailable ranges into node/zone, in the old code, > it just happened to add them into expected node/zone. You can see in > early_pfn_in_nid(), if no nid found from memblock, the returned '-1' > will make it true ironically. But that is not saying the bad thing > always got good result. If the last zone of node 0 is DMA32 zone, the > deferred init will skip the only chance to add some of unavailable > rnages into expected node/zone. Means they were not always added into > appropriate node/zone before, the change of iterating memblock.memory in > memmap_init() dones't introduce regression. > > static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node) > { > int nid; > > nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache); > if (nid >= 0 && nid != node) > return false; > return true; > } > > So if no anybody need access them after boot, not adding them into any > node/zone sounds better. Otherwise, better add them in the appropriate > node/zone. > > > > > But it's all quite confusing, especially across architectures and ... > > > > > above is not enough, but it's a small step in this direction. > > > > > > I believe that improving the early memory initialization would make many > > > things simpler and more robust, but that's a different story :) > > > > ... I second that. > > > > -- > > Thanks, > > > > David / dhildenb > > ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2020-06-01 13:32 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-21 1:44 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() Baoquan He 2020-05-21 9:26 ` Mike Rapoport 2020-05-21 15:52 ` Baoquan He [not found] ` <20200521171836.GU1059226@linux.ibm.com> 2020-05-22 7:01 ` Baoquan He 2020-05-22 7:25 ` Baoquan He 2020-05-22 14:20 ` Mike Rapoport 2020-05-26 8:45 ` Baoquan He 2020-05-26 8:55 ` David Hildenbrand 2020-05-26 11:32 ` Mike Rapoport 2020-05-26 11:49 ` David Hildenbrand 2020-05-28 9:07 ` Baoquan He 2020-05-28 9:08 ` David Hildenbrand 2020-05-28 15:15 ` Steve Wahl 2020-06-01 11:42 ` Mike Rapoport 2020-06-01 13:31 ` Baoquan He 2020-05-21 9:36 ` Mel Gorman 2020-05-21 15:41 ` Baoquan He 2020-05-28 8:59 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages()^[ Baoquan He 2020-05-28 9:08 ` Baoquan He
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).