linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages()
@ 2020-05-21  1:44 Baoquan He
  2020-05-21  9:26 ` Mike Rapoport
  2020-05-21  9:36 ` Mel Gorman
  0 siblings, 2 replies; 21+ messages in thread
From: Baoquan He @ 2020-05-21  1:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm, cai, mgorman, mhocko, rppt, bhe

Qian reported that a crash happened in compaction.
http://lkml.kernel.org/r/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw

LTP: starting swapping01 (swapping01 -i 5)
page:ffffea0000aa0000 refcount:1 mapcount:0 mapping:000000002243743b index:0x0
flags: 0x1fffe000001000(reserved)
raw: 001fffe000001000 ffffea0000aa0008 ffffea0000aa0008 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn))
page_owner info is not present (never set?)
------------[ cut here ]------------
kernel BUG at mm/page_alloc.c:533!
...
CPU: 17 PID: 218 Comm: kcompactd0 Not tainted 5.7.0-rc2-next-20200423+ #7
...
RIP: 0010:set_pfnblock_flags_mask+0x150/0x210
...
RSP: 0018:ffffc900042ff858 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffffffff9a002382
RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff8884535b8e6c
RBP: ffffc900042ff8b8 R08: ffffed108a6b8459 R09: ffffed108a6b8459
R10: ffff8884535c22c7 R11: ffffed108a6b8458 R12: 000000000002a800
R13: ffffea0000aa0000 R14: ffff88847fff3000 R15: ffff88847fff3040
FS:  0000000000000000(0000) GS:ffff888453580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd1eb4a1000 CR3: 000000083154c000 CR4: 00000000003406e0
Call Trace:
 isolate_freepages+0xb20/0x1140
 ? isolate_freepages_block+0x730/0x730
 ? mark_held_locks+0x34/0xb0
 ? free_unref_page+0x7d/0x90
 ? free_unref_page+0x7d/0x90
 ? check_flags.part.28+0x86/0x220
 compaction_alloc+0xdd/0x100
 migrate_pages+0x304/0x17e0
 ? __ClearPageMovable+0x100/0x100
 ? isolate_freepages+0x1140/0x1140
 compact_zone+0x1249/0x1e90
 ? compaction_suitable+0x260/0x260
 kcompactd_do_work+0x231/0x650
 ? sysfs_compact_node+0x80/0x80
 ? finish_wait+0xe6/0x110
 kcompactd+0x162/0x490
 ? kcompactd_do_work+0x650/0x650
 ? finish_wait+0x110/0x110
 ? __kasan_check_read+0x11/0x20
 ? __kthread_parkme+0xd4/0xf0
 ? kcompactd_do_work+0x650/0x650
 kthread+0x1f7/0x220
 ? kthread_create_worker_on_cpu+0xc0/0xc0
 ret_from_fork+0x27/0x50

After investigation, it turns out that this is introduced by commit of
linux-next: commit f6edbdb71877 ("mm: memmap_init: iterate over memblock
regions rather that check each PFN").

After investigation, it turns out that this is introduced by commit of
linux-next, the patch subject is:
  "mm: memmap_init: iterate over memblock regions rather that check each PFN".

Qian added debugging code. The debugging log shows that the fault page is
0x2a800000. From the system e820 map which is pasted at bottom, the page
is in e820 reserved range:
	BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
And it's in section [0x28000000, 0x2fffffff]. In that secion, there are
several usable ranges and some e820 reserved ranges.

For this kind of e820 reserved range, it won't be added to memblock allocator.
However, init_unavailable_mem() will initialize to add them into node 0,
zone 0. Before that commit, later, memmap_init() will add e820 reserved
ranges into the zone where they are contained, because it can pass
the checking of early_pfn_valid() and early_pfn_in_nid(). In this case,
the e820 reserved range where fault page 0x2a800000 is located is added
into DMA32 zone. After that commit, the e820 reserved rgions are kept
in node 0, zone 0, since we iterate over memblock regions to iniatialize
in memmap_init() instead, their node and zone won't be changed.

Now, fast_isolate_freepages() will use min mark directly as the migration
target if no page is found from buddy. However, the min mark is not checked
carefully, it could be in e820 reserved range, and trigger the
VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) when try to mark it
as skip.

Here, let's call pageblock_pfn_to_page() to get page of min_pfn, since it
will do careful checks and return NULL if the pfn is not qualified.

[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000008bfff] usable
[    0.000000] BIOS-e820: [mem 0x000000000008c000-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000028328fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000028329000-0x0000000028568fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000028569000-0x0000000028d85fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000028d86000-0x0000000028ee5fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000028ee6000-0x0000000029a04fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029a05000-0x0000000029a08fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000029a09000-0x0000000029ee4fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029ee5000-0x0000000029ef2fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029ef3000-0x0000000029f22fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029f23000-0x0000000029f23fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f24000-0x0000000029f24fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029f25000-0x0000000029f28fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f29000-0x0000000029f29fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000029f2a000-0x0000000029f2bfff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f2c000-0x0000000029f2cfff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000029f2d000-0x0000000029f2ffff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000029f30000-0x0000000029ffdfff] usable
[    0.000000] BIOS-e820: [mem 0x0000000029ffe000-0x000000002a80afff] reserved
[    0.000000] BIOS-e820: [mem 0x000000002a80b000-0x000000002a80efff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000002a80f000-0x000000002a81ffff] reserved
[    0.000000] BIOS-e820: [mem 0x000000002a820000-0x000000002a822fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x000000002a823000-0x0000000033a22fff] usable
[    0.000000] BIOS-e820: [mem 0x0000000033a23000-0x0000000033a32fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000033a33000-0x0000000033a42fff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x0000000033a43000-0x0000000033a52fff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000033a53000-0x0000000077ffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000078000000-0x000000008fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000087effffff] usable
[    0.000000] BIOS-e820: [mem 0x000000087f000000-0x000000087fffffff] reserved

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Baoquan He <bhe@redhat.com>
---
 mm/compaction.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 67fd317f78db..9ce4cff4d407 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1418,7 +1418,9 @@ fast_isolate_freepages(struct compact_control *cc)
 				cc->free_pfn = highest;
 			} else {
 				if (cc->direct_compaction && pfn_valid(min_pfn)) {
-					page = pfn_to_page(min_pfn);
+					page = pageblock_pfn_to_page(min_pfn,
+						pageblock_end_pfn(min_pfn),
+						cc->zone);
 					cc->free_pfn = min_pfn;
 				}
 			}
-- 
2.17.2


^ permalink raw reply	[flat|nested] 21+ messages in thread
* Re: [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() ^[
@ 2020-05-28  8:59 Baoquan He
  2020-05-28  9:08 ` Baoquan He
  0 siblings, 1 reply; 21+ messages in thread
From: Baoquan He @ 2020-05-28  8:59 UTC (permalink / raw)
  To: David Hildenbrand, rppt; +Cc: mgorman, linux-kernel, linux-mm

akpm@linux-foundation.org, cai@lca.pw, mhocko@kernel.org, steve.wahl@hpe.com, 
Bcc: bhe@redhat.com
Subject: Re: [PATCH] mm/compaction: Fix the incorrect hole in
 fast_isolate_freepages()
Reply-To: 
In-Reply-To: <01beec81-565f-d335-5eff-22693fc09c0e@redhat.com>

On 05/26/20 at 01:49pm, David Hildenbrand wrote:
> On 26.05.20 13:32, Mike Rapoport wrote:
> > Hello Baoquan,
> > 
> > On Tue, May 26, 2020 at 04:45:43PM +0800, Baoquan He wrote:
> >> On 05/22/20 at 05:20pm, Mike Rapoport wrote:
> >>> Hello Baoquan,
> >>>
> >>> On Fri, May 22, 2020 at 03:25:24PM +0800, Baoquan He wrote:
> >>>> On 05/22/20 at 03:01pm, Baoquan He wrote:
> >>>>>
> >>>>> So let's add these unavailable ranges into memblock and reserve them
> >>>>> in init_unavailable_range() instead. With this change, they will be added
> >>>>> into appropriate node and zone in memmap_init(), and initialized in
> >>>>> reserve_bootmem_region() just like any other memblock reserved regions.
> >>>>
> >>>> Seems this is not right. They can't get nid in init_unavailable_range().
> >>>> Adding e820 ranges may let them get nid. But the hole range won't be
> >>>> added to memblock, and still has the issue.
> >>>>
> >>>> Nack this one for now, still considering.
> >>>
> >>> Why won't we add  the e820 reserved ranges to memblock.memory during
> >>> early boot as I suggested?
> >>>
> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >>> index c5399e80c59c..b0940c618ed9 100644
> >>> --- a/arch/x86/kernel/e820.c
> >>> +++ b/arch/x86/kernel/e820.c
> >>> @@ -1301,8 +1301,11 @@ void __init e820__memblock_setup(void)
> >>>  		if (end != (resource_size_t)end)
> >>>  			continue;
> >>>  
> >>> -		if (entry->type == E820_TYPE_SOFT_RESERVED)
> >>> +		if (entry->type == E820_TYPE_SOFT_RESERVED ||
> >>> +		    entry->type == E820_TYPE_RESERVED) {
> >>> +			memblock_add(entry->addr, entry->size);
> >>>  			memblock_reserve(entry->addr, entry->size);
> >>> +		}
> >>>  
> >>>  		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
> >>>  			continue;
> >>>
> >>> The setting of node later  in numa_init() will assign the proper node
> >>> for these regions as it does for the usable memory.
> >>
> >> Yes, if it's only related to e820 reserved region, this truly works.
> >>
> >> However, it also has ACPI table regions. That's why I changed to call
> >> the problematic area as firmware reserved ranges later.
> >>
> >> Bisides, you can see below line, there's another reserved region which only
> >> occupies one page in one memory seciton. If adding to memblock.memory, we also
> >> will build struct mem_section and the relevant struct pages for the whole
> >> section. And then the holes around that page will be added and initialized in
> >> init_unavailable_mem(). numa_init() will assign proper node for memblock.memory
> >> and memblock.reserved, but won't assign proper node for the holes.
> >>
> >> ~~~
> >> [    0.000000] BIOS-e820: [mem 0x00000000fed80000-0x00000000fed80fff] reserved
> >> ~~~
> >>
> >> So I still think we should not add firmware reserved range into
> >> memblock for fixing this issue.
> >>
> >> And, the fix in the original patch seems necessary. You can see in
> >> compaction code, the migration source is chosen from LRU pages or
> >> movable pages, the migration target has to be got from Buddy. However,
> >> only the min_pfn in fast_isolate_freepages(), it's calculated by
> >> distance between cc->free_pfn - cc->migrate_pfn, we can't guarantee it's
> >> safe, then use it as the target to handle.
> > 
> > I do not object to your original fix with careful check for pfn validity.
> > 
> > But I still think that the memory reserved by the firmware is still
> > memory and it should be added to memblock.memory. This way the memory
> 
> If it's really memory that could be read/written, I think I agree.

I would say some of them may not be allowed to be read/written, if I
understand it correctly. I roughly went through the x86 init code, there
are some places where mem region is marked as E820_TYPE_RESERVED so that
they are not touched after initialization. E.g:

1) pfn 0
In trim_bios_range(), we set the pfn 0 as E820_TYPE_RESERVED. You can
see the code comment, this is a BIOS owned area, but not kernel RAM.

2)GART reserved region
In early_gart_iommu_check(), GART IOMMU firmware will reserve a region
in an area, firmware designer won't map system RAM into that area.

And also intel_graphics_stolen(), arch_rmrr_sanity_check(), these
regions are not system RAM backed area, reading from or writting into
these area may cause error. 

Futhermore, there's a KASLR bug found by HPE, its triggering and root
cause are written into below commit log. You can see that accessing to
firmware reserved region caused BIOS to halt system when cpu doing
speculative.

commit 2aa85f246c181b1fa89f27e8e20c5636426be624
Author: Steve Wahl <steve.wahl@hpe.com>
Date:   Tue Sep 24 16:03:55 2019 -0500

    x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
    
    Our hardware (UV aka Superdome Flex) has address ranges marked
    reserved by the BIOS. Access to these ranges is caught as an error,
    causing the BIOS to halt the system.

> 
> > map will be properly initialized from the very beginning and we won't
> > need init_unavailable_mem() and alike workarounds and. Obviously, the patch
> 
> I remember init_unavailable_mem() is necessary for holes within
> sections, where we actually *don't* have memory, but we still have have
> a valid memmap (full section) that we have to initialize.
> 
> See the example from 4b094b7851bf ("mm/page_alloc.c: initialize memmap
> of unavailable memory directly"). Our main memory ends within a section,
> so we have to initialize the remaining parts because the whole section
> will be marked valid/online.

Yes, memory hole need be handled in init_unavailable_mem(). Since we
have created struct page for them, need initialize them. We can't
discard init_unavailable_mem() for now.

> 
> Any way to improve this handling is appreciated. In that patch I also
> spelled out that we might want to mark such holes via a new page type,
> e.g., PageHole(). Such a page is a memory hole, but has a valid memmap.
> Any content in the memmap (zone/node) should be ignored.

As I said at above, I am a little conservative to add all those regions of
E820_TYPE_RESERVED into memblock.memory and memblock.reserved, because
most of them are firmware reserved region, they may be not backed by normal
RAM.

I was thinking to step back to use mm_zero_struct_page() inside
init_unavailable_range() as below. But it doesn't differ much
from __init_single_page(), except of the _refcount and mapcount. 
Zeroing struct page equals to putting them into node 0, zero 0.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3973b5fdfe3f..4e4b72cf5283 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6901,7 +6901,7 @@ static u64 __init init_unavailable_range(unsigned long spfn, unsigned long epfn)
 		 * (in memblock.reserved but not in memblock.memory) will
 		 * get re-initialized via reserve_bootmem_region() later.
 		 */
-		__init_single_page(pfn_to_page(pfn), pfn, 0, 0);
+		mm_zero_struct_page(pfn_to_page(pfn));
 		__SetPageReserved(pfn_to_page(pfn));
 		pgcnt++;
 	}

About adding these unavailable ranges into node/zone, in the old code,
it just happened to add them into expected node/zone. You can see in
early_pfn_in_nid(), if no nid found from memblock, the returned '-1'
will make it true ironically. But that is not saying the bad thing
always got good result. If the last zone of node 0 is DMA32 zone, the
deferred init will skip the only chance to add some of unavailable
rnages into expected node/zone. Means they were not always added into 
appropriate node/zone before, the change of iterating memblock.memory in
memmap_init() dones't introduce regression.

static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
        int nid;

        nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
        if (nid >= 0 && nid != node)
                return false;
        return true;
}

So if no anybody need access them after boot, not adding them into any
node/zone sounds better. Otherwise, better add them in the appropriate
node/zone.

> 
> But it's all quite confusing, especially across architectures and ...
> 
> > above is not enough, but it's a small step in this direction.
> > 
> > I believe that improving the early memory initialization would make many
> > things simpler and more robust, but that's a different story :)
> 
> ... I second that.
> 
> -- 
> Thanks,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-06-01 13:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-21  1:44 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() Baoquan He
2020-05-21  9:26 ` Mike Rapoport
2020-05-21 15:52   ` Baoquan He
2020-05-21 17:18     ` Mike Rapoport
2020-05-22  7:01       ` Baoquan He
2020-05-22  7:25         ` Baoquan He
2020-05-22 14:20           ` Mike Rapoport
2020-05-26  8:45             ` Baoquan He
2020-05-26  8:55               ` David Hildenbrand
2020-05-26 11:32               ` Mike Rapoport
2020-05-26 11:49                 ` David Hildenbrand
2020-05-28  9:07                   ` Baoquan He
2020-05-28  9:08                     ` David Hildenbrand
2020-05-28 15:15                     ` Steve Wahl
2020-06-01 11:42                       ` Mike Rapoport
2020-06-01 13:31                         ` Baoquan He
2020-05-21  9:36 ` Mel Gorman
2020-05-21 13:38   ` Mike Rapoport
2020-05-21 15:41   ` Baoquan He
2020-05-28  8:59 [PATCH] mm/compaction: Fix the incorrect hole in fast_isolate_freepages() ^[ Baoquan He
2020-05-28  9:08 ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).