Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages

From: David Hildenbrand <david@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Qian Cai <cai@lca.pw>,
	Michal Hocko <mhocko@kernel.org>,
	linux-kernel@vger.kernel.org, Mike Rapoport <rppt@linux.ibm.com>,
	Baoquan He <bhe@redhat.com>
Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
Date: Wed, 25 Nov 2020 20:27:21 +0100	[thread overview]
Message-ID: <a4cc62ba-8066-3e9c-cead-98cd74d313dd@redhat.com> (raw)
In-Reply-To: <X76iatgBErQH5El4@redhat.com>

On 25.11.20 19:28, Andrea Arcangeli wrote:
> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>> Before that change, the memmap of memory holes were only zeroed
>> out. So the zones/nid was 0, however, pages were not reserved and
>> had a refcount of zero - resulting in other issues.
> 
> So maybe that "0,0" zoneid/nid was not actually the thing that
> introduced the regression? Note: I didn't bisect anything yet, it was
> just a guess.

I guess 0/0 is the issue, but that existed before when we had a simple
memmset(0). The root issue should be what Mike said:

73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
that check each PFN")

>> Most pfn walkers shouldn‘t mess with reserved pages and simply skip
>> them. That would be the right fix here.
> 
> What would then be the advantage of keeping wrong page->flags in
> reserved pages compared to actually set it correct?

"correct" is problematic. If you have an actual memory hole, there is
not always a right answer - unless I am missing something important.

Assume you have a layout like this

[  zone X ] [ hole ] [ zone Y ]

If either X and or Y starts within a memory section, you have a valid
memmap for X - but what would be the right node/zone?

Assume you have a layout like this

[ zone X ]

whereby X ends inside a memory section. The you hotplug memory. Assume
it goes to X

[ zone X ][ hole in X ][ zone X]

or it goes to y

[ zone X ][ hole ][ zone Y ]

This can easily be reproduced by starting a VM in qemu with a memory
size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.

> 
> How do you plan to enforce that every VM bit will call PageReserved
> (as it should be introduced in pageblock_pfn_to_page in that case) if
> it's not possible to disambiguate a real DMA zone and nid 0 from the
> uninitialized value? How can we patch page_nodenum to enforce it won't
> read an uninitialized value because a PageReserved check was missing
> in the caller?

We can't. The general rule is (as I was once told by Michal IIRC) that
there is no trusting on memmap content when the reserved bit is set -
unless you know "why it was set" - e.g., to distinguish early boot
allocations from ordinary allocations.

See below for pointer to documentation.

> 
> I don't see this easily enforceable by code review, it'd be pretty
> flakey to leave a 0,0 wrong value in there with no way to verify if a
> PageResered check in the caller was missing.

I'm not rooting for "keep this at 0/0" - I'm saying that I think there
are corner cases where it might not be that easy.

> 
>>> It'd be preferable if the pfn_valid fails and the
>>> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
>>> step. Even better memset 0xff over the whole page struct until the
>>> second stage comes around.
>>
>> I recently discussed with Baoquan to
>> 1. Using a new pagetype to mark memory holes
>> 2. Using special nid/zid to mark memory holes
>>
>> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON.
> 
> What would need to call pfn_zone in between first and second stage?
> 
> If something calls pfn_zone in between first and second stage isn't it
> a feature if it crashes the kernel at boot?
> 
> Note: I suggested 0xff kernel crashing "until the second stage comes
> around" during meminit at boot, not permanently.

Yes, then it makes sense - if we're able to come up with a way to
initialize any memmap we might have - including actual memory holes that
have a memmap.

> 
> 		/*
> 		 * Use a fake node/zone (0) for now. Some of these pages
> 		 * (in memblock.reserved but not in memblock.memory) will
> 		 * get re-initialized via reserve_bootmem_region() later.
> 		 */
> 
> Specifically I relied on the comment "get re-initialized via
> reserve_bootmem_region() later".

Yes, but there is a "Some of these" :)

Boot a VM with "-M 4000" and observe the memmap in the last section -
they won't get initialized a second time.

> 
> I assumed the second stage overwrites the 0,0 to the real zoneid/nid
> value, which is clearly not happening, hence it'd be preferable to get
> a crash at boot reliably.
> 
> Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
> calling init_reserved_page(start_pfn) won't do much with
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
> page->flags were still wrong for reserved pages in the "Unknown E820
> type" region.
> 
>>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
>>> all times, otherwise if the second stage fails we end up in a bug with
>>> weird side effects.
>>
>> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?
> 
> zoneid is always a pure abstract concept, not just the RAM-holes but
> the RAM itself doesn't have a zoneid at all.
> 
> Nid is hardware defined for RAM, but for reserved RAM holes it becomes
> a pure software construct as the zoneid.
> 
> So while it's right that they have no zone/nid in the hardware, they
> still have a very well defined zoneid/nid in the software implementation.
> 
> The page struct belongs to one and only one mem_map array, that has a
> specific nodeid and nid. So it can be always initialized right even
> for non-RAM.

AFAIK, the mem_map array might have multiple NIDs - and it's set when
initializing the zones.

> 
> Only if the pfn is !pfn_valid, then it has no page struct and then
> there's no zone/nid association, but in that case there's no reason to
> even worry about it since nobody can possibly get a wrong value out of
> the page struct because there's no page struct in the case.
> 
> Last but not the least, RAM pages can be marked reserved and assigned
> to hardware and so it'd be really messy if real reserved RAM pages
> given to hw, behaved different than non-RAM that accidentally got a
> struct page because of section alignment issues.

Well, "reserved" is not a good indication "what" something actually is.

I documented that a while ago in include/linux/page-flags.h

"PG_reserved is set for special pages. The "struct page" of such a page
 should in general not be touched (e.g. set dirty) except by its owner.
 Pages marked as PG_reserved include:."

I suggest looking at that.

AFAIR, we have been setting *most* memmap in memory holes/non-ram
reserved for a long time - long before I added the __init_single_page -
see init_reserved_page() for example.

-- 
Thanks,

David / dhildenb