All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Qian Cai <cai@lca.pw>,
	Michal Hocko <mhocko@kernel.org>,
	linux-kernel@vger.kernel.org, Mike Rapoport <rppt@linux.ibm.com>,
	Baoquan He <bhe@redhat.com>
Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
Date: Wed, 25 Nov 2020 20:27:21 +0100	[thread overview]
Message-ID: <a4cc62ba-8066-3e9c-cead-98cd74d313dd@redhat.com> (raw)
In-Reply-To: <X76iatgBErQH5El4@redhat.com>

On 25.11.20 19:28, Andrea Arcangeli wrote:
> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>> Before that change, the memmap of memory holes were only zeroed
>> out. So the zones/nid was 0, however, pages were not reserved and
>> had a refcount of zero - resulting in other issues.
> 
> So maybe that "0,0" zoneid/nid was not actually the thing that
> introduced the regression? Note: I didn't bisect anything yet, it was
> just a guess.

I guess 0/0 is the issue, but that existed before when we had a simple
memmset(0). The root issue should be what Mike said:

73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
that check each PFN")


>> Most pfn walkers shouldn‘t mess with reserved pages and simply skip
>> them. That would be the right fix here.
> 
> What would then be the advantage of keeping wrong page->flags in
> reserved pages compared to actually set it correct?

"correct" is problematic. If you have an actual memory hole, there is
not always a right answer - unless I am missing something important.


Assume you have a layout like this

[  zone X ] [ hole ] [ zone Y ]

If either X and or Y starts within a memory section, you have a valid
memmap for X - but what would be the right node/zone?


Assume you have a layout like this

[ zone X ]

whereby X ends inside a memory section. The you hotplug memory. Assume
it goes to X

[ zone X ][ hole in X ][ zone X]

or it goes to y

[ zone X ][ hole ][ zone Y ]

This can easily be reproduced by starting a VM in qemu with a memory
size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.

> 
> How do you plan to enforce that every VM bit will call PageReserved
> (as it should be introduced in pageblock_pfn_to_page in that case) if
> it's not possible to disambiguate a real DMA zone and nid 0 from the
> uninitialized value? How can we patch page_nodenum to enforce it won't
> read an uninitialized value because a PageReserved check was missing
> in the caller?

We can't. The general rule is (as I was once told by Michal IIRC) that
there is no trusting on memmap content when the reserved bit is set -
unless you know "why it was set" - e.g., to distinguish early boot
allocations from ordinary allocations.

See below for pointer to documentation.

> 
> I don't see this easily enforceable by code review, it'd be pretty
> flakey to leave a 0,0 wrong value in there with no way to verify if a
> PageResered check in the caller was missing.

I'm not rooting for "keep this at 0/0" - I'm saying that I think there
are corner cases where it might not be that easy.

> 
>>> It'd be preferable if the pfn_valid fails and the
>>> pfn_to_section_nr(pfn) returns an invalid section for the intermediate
>>> step. Even better memset 0xff over the whole page struct until the
>>> second stage comes around.
>>
>> I recently discussed with Baoquan to
>> 1. Using a new pagetype to mark memory holes
>> 2. Using special nid/zid to mark memory holes
>>
>> Setting the memmap to 0xff would be even more dangerous - pfn_zone() might simply BUG_ON.
> 
> What would need to call pfn_zone in between first and second stage?
> 
> If something calls pfn_zone in between first and second stage isn't it
> a feature if it crashes the kernel at boot?
> 
> Note: I suggested 0xff kernel crashing "until the second stage comes
> around" during meminit at boot, not permanently.

Yes, then it makes sense - if we're able to come up with a way to
initialize any memmap we might have - including actual memory holes that
have a memmap.

> 
> 		/*
> 		 * Use a fake node/zone (0) for now. Some of these pages
> 		 * (in memblock.reserved but not in memblock.memory) will
> 		 * get re-initialized via reserve_bootmem_region() later.
> 		 */
> 
> Specifically I relied on the comment "get re-initialized via
> reserve_bootmem_region() later".

Yes, but there is a "Some of these" :)

Boot a VM with "-M 4000" and observe the memmap in the last section -
they won't get initialized a second time.

> 
> I assumed the second stage overwrites the 0,0 to the real zoneid/nid
> value, which is clearly not happening, hence it'd be preferable to get
> a crash at boot reliably.
> 
> Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage
> calling init_reserved_page(start_pfn) won't do much with
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable
> CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the
> page->flags were still wrong for reserved pages in the "Unknown E820
> type" region.
> 
>>> Whenever pfn_valid is true, it's better that the zoneid/nid is correct
>>> all times, otherwise if the second stage fails we end up in a bug with
>>> weird side effects.
>>
>> Memory holes with a valid memmap might not have a zone/nid. For now, skipping reserved pages should be good enough, no?
> 
> zoneid is always a pure abstract concept, not just the RAM-holes but
> the RAM itself doesn't have a zoneid at all.
> 
> Nid is hardware defined for RAM, but for reserved RAM holes it becomes
> a pure software construct as the zoneid.
> 
> So while it's right that they have no zone/nid in the hardware, they
> still have a very well defined zoneid/nid in the software implementation.
> 
> The page struct belongs to one and only one mem_map array, that has a
> specific nodeid and nid. So it can be always initialized right even
> for non-RAM.

AFAIK, the mem_map array might have multiple NIDs - and it's set when
initializing the zones.

> 
> Only if the pfn is !pfn_valid, then it has no page struct and then
> there's no zone/nid association, but in that case there's no reason to
> even worry about it since nobody can possibly get a wrong value out of
> the page struct because there's no page struct in the case.
> 
> Last but not the least, RAM pages can be marked reserved and assigned
> to hardware and so it'd be really messy if real reserved RAM pages
> given to hw, behaved different than non-RAM that accidentally got a
> struct page because of section alignment issues.

Well, "reserved" is not a good indication "what" something actually is.

I documented that a while ago in include/linux/page-flags.h

"PG_reserved is set for special pages. The "struct page" of such a page
 should in general not be touched (e.g. set dirty) except by its owner.
 Pages marked as PG_reserved include:."

I suggest looking at that.

AFAIR, we have been setting *most* memmap in memory holes/non-ram
reserved for a long time - long before I added the __init_single_page -
see init_reserved_page() for example.

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2020-11-25 19:27 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-23 21:25 compaction: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) Qian Cai
2020-04-24  3:43 ` Baoquan He
2020-04-24 13:45   ` Qian Cai
2020-05-05 12:43     ` Baoquan He
2020-05-05 13:20       ` Qian Cai
2020-05-11  1:21         ` Baoquan He
2020-04-26 14:41 ` Mike Rapoport
2020-04-27 13:45   ` Qian Cai
2020-11-21 19:45 ` [PATCH 0/1] VM_BUG_ON_PAGE(!zone_spans_pfn) in set_pfnblock_flags_mask Andrea Arcangeli
2020-11-21 19:45   ` [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages Andrea Arcangeli
2020-11-21 19:53     ` Andrea Arcangeli
2020-11-23 11:26       ` David Hildenbrand
2020-11-23 13:01     ` Vlastimil Babka
2020-11-24 13:32       ` Mel Gorman
2020-11-24 20:56         ` Andrea Arcangeli
2020-11-25 10:30           ` Mel Gorman
2020-11-25 17:59             ` Andrea Arcangeli
2020-11-26 10:47               ` Mel Gorman
2020-12-06  2:26                 ` Andrea Arcangeli
2020-12-06 23:47                   ` Mel Gorman
2020-11-25  5:34       ` Andrea Arcangeli
2020-11-25  6:45         ` David Hildenbrand
2020-11-25  8:51           ` Mike Rapoport
2020-11-25 10:39           ` Mel Gorman
2020-11-25 11:04             ` David Hildenbrand
2020-11-25 11:41               ` David Hildenbrand
2020-11-25 18:47                 ` Andrea Arcangeli
2020-11-25 13:33               ` Mel Gorman
2020-11-25 13:41                 ` David Hildenbrand
2020-11-25 18:28           ` Andrea Arcangeli
2020-11-25 19:27             ` David Hildenbrand [this message]
2020-11-25 20:41               ` Andrea Arcangeli
2020-11-25 21:13                 ` David Hildenbrand
2020-11-25 21:04               ` Mike Rapoport
2020-11-25 21:38                 ` Andrea Arcangeli
2020-11-26  9:36                   ` Mike Rapoport
2020-11-26 10:05                     ` David Hildenbrand
2020-11-26 17:46                       ` Mike Rapoport
2020-11-29 12:32                         ` Mike Rapoport
2020-12-02  0:44                           ` Andrea Arcangeli
2020-12-02 17:39                             ` Mike Rapoport
2020-12-03  6:23                               ` Andrea Arcangeli
2020-12-03 10:51                                 ` Mike Rapoport
2020-12-03 17:31                                   ` Andrea Arcangeli
2020-12-06  8:09                                     ` Mike Rapoport
2020-11-26 18:15                       ` Andrea Arcangeli
2020-11-26 18:29                     ` Andrea Arcangeli
2020-11-26 19:44                       ` Mike Rapoport
2020-11-26 20:30                         ` Andrea Arcangeli
2020-11-26 21:03                           ` Mike Rapoport
2020-11-26 19:21                     ` Andrea Arcangeli
2020-11-25 12:08         ` Vlastimil Babka
2020-11-25 13:32           ` David Hildenbrand
2020-11-25 14:13             ` Mike Rapoport
2020-11-25 14:42               ` David Hildenbrand
2020-11-26 10:51                 ` Mel Gorman
2020-11-25 19:14               ` Andrea Arcangeli
2020-11-25 19:01           ` Andrea Arcangeli
2020-11-25 19:33             ` David Hildenbrand
2020-11-26  3:40         ` Andrea Arcangeli
2020-11-26 10:43           ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a4cc62ba-8066-3e9c-cead-98cd74d313dd@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=cai@lca.pw \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=rppt@linux.ibm.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.