All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Qian Cai <cai@lca.pw>,
	Michal Hocko <mhocko@kernel.org>,
	linux-kernel@vger.kernel.org, Mike Rapoport <rppt@linux.ibm.com>,
	Baoquan He <bhe@redhat.com>
Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
Date: Wed, 25 Nov 2020 22:13:46 +0100	[thread overview]
Message-ID: <89B17C54-671B-4363-B425-CCFE17DD8FDD@redhat.com> (raw)
In-Reply-To: <X77BgHiTR3R7biho@redhat.com>


> Am 25.11.2020 um 21:41 schrieb Andrea Arcangeli <aarcange@redhat.com>:
> 
> On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote:
>>> On 25.11.20 19:28, Andrea Arcangeli wrote:
>>> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote:
>>>> Before that change, the memmap of memory holes were only zeroed
>>>> out. So the zones/nid was 0, however, pages were not reserved and
>>>> had a refcount of zero - resulting in other issues.
>>> 
>>> So maybe that "0,0" zoneid/nid was not actually the thing that
>>> introduced the regression? Note: I didn't bisect anything yet, it was
>>> just a guess.
>> 
>> I guess 0/0 is the issue, but that existed before when we had a simple
>> memmset(0). The root issue should be what Mike said:
> 
> Yes, the second stage must have stopped running somehow.
> 
> Is there anything we can do to induce a deterministically reproducible
> kernel crashing behavior if the second stage doesn't run?
> 
> Why did we start doing a more graceful initialization in the first
> stage, instead of making a less graceful by setting it to 0xff instead
> of 0x00?

I guess because we weren‘t aware of the issues we have :)

> 
>> 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
>> that check each PFN")
> 
> So if that's not intentional, are you suggesting nodeid/nid was a bug
> if it was set to 0,0 for a non-RAM valid pfn?
> 

Depends on how we think checks for reserved pages should be performed. I am more of a friend of indicating „this memmap is just garbage, skip it“. If the reserved flag is not good enough, then via a special node/zone - as you also suggest below.

>> "correct" is problematic. If you have an actual memory hole, there is
>> not always a right answer - unless I am missing something important.
>> 
>> 
>> Assume you have a layout like this
>> 
>> [  zone X ] [ hole ] [ zone Y ]
>> 
>> If either X and or Y starts within a memory section, you have a valid
>> memmap for X - but what would be the right node/zone?
>> 
>> 
>> Assume you have a layout like this
>> 
>> [ zone X ]
>> 
>> whereby X ends inside a memory section. The you hotplug memory. Assume
>> it goes to X
>> 
>> [ zone X ][ hole in X ][ zone X]
>> 
>> or it goes to y
>> 
>> [ zone X ][ hole ][ zone Y ]
>> 
>> This can easily be reproduced by starting a VM in qemu with a memory
>> size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory.
> 
> I don't get what the problem is sorry.
> 
> You have a pfn, if pfn_valid() is true, pfn_to_page returns a page
> deterministically.
> 
> It's up to the kernel to decide which page structure blongs to any pfn
> in the pfn_to_page function.
> 
> Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id
> in page->flags points to a node->zone whose zone_start_pfn -
> end_zone_pfn range doesn't contain "pfn" that is a bug in
> page_alloc.c.
> 
> I don't see how is it not possible to deterministically enforce the
> above never happens. Only then it would be true that there's not
> always a right answer.
> 
> zone can overlap, but it can't be that you do pfn_to_page of a
> pfn_valid and you obtain a page whose zone doesn't contain that
> pfn. Which is what is currently crashing compaction.
> 
> I don't see how this is an unsolvable problem and why we should accept
> to live with a bogus page->flags for reserved pages.
> 

I said it‘s problematic, not unsolvable. Using a special zone/node is certainly easier - but might reveal some issues we have to fix - I guess? Fair enough.

>> We can't. The general rule is (as I was once told by Michal IIRC) that
> 
> The fact we can't kernel crash reliably when somebody uses the wrong
> 0,0 uninitialized value by not adding an explicit PageReserved check,
> is my primary concern in keeping those nodeid/nid uninitialized, but
> non-kernel-crashing, since it already created this unreproducible bug.

Agreed.

> 
>> I'm not rooting for "keep this at 0/0" - I'm saying that I think there
>> are corner cases where it might not be that easy.
> 
> I'm not saying it's easy. What I don't see is how you don't always
> have the right answer and why it would be an unsolvable problem.

„Problematic“ does not imply unsolvable.

> 
> It is certainly problematic and difficult to solve in the mem_map
> iniitalization logic, but to me having pfn_valid() &&
> page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first
> node also looks problematic and difficult to handle across all VM
> code, so overall it looks preferable to keep the complexity of the
> mem_map initialization self contained and not spilling over the rest
> of the VM.
> 
>> Yes, but there is a "Some of these" :)
>> 
>> Boot a VM with "-M 4000" and observe the memmap in the last section -
>> they won't get initialized a second time.
> 
> Is the beyond the end of the zone yet another case? I guess that's
> less likely to give us problems because it's beyond the end of the
> zone. Would pfn_valid return true for those pfn? If pfn_valid is not

Yes. Especially, exposed after memory hotplug when zone/nid span changes.

> true it's not really a concern but the again I'd rather prefer if
> those struct pages beyond the end of the zone were kernel crashing set
> to 0xff.
> 
> In other words I just don't see why we should ever prefer to leave
> some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't
> easily induce a crash if used.

I agree.

> 
>> AFAIK, the mem_map array might have multiple NIDs - and it's set when
>> initializing the zones.
> 
> Well because there's no mem_map array with SPARSEMEM, but it's not
> conceptually too different than if there was one. Even with flatmem
> there could be multiple page struct for each pfn, the disambiguation
> has to be handled by pfn_to_page regardless of SPARSEMEM or not.
> 
> The point is that if zone_page(pfn_to_page(pfn)) points to DMA zone of
> first node, and the pfn isn't part of the DMA of first node that looks
> a bug and it can be enforced it doesn't happen.
> 
>> Well, "reserved" is not a good indication "what" something actually is.
>> 
>> I documented that a while ago in include/linux/page-flags.h
>> 
>> "PG_reserved is set for special pages. The "struct page" of such a page
>> should in general not be touched (e.g. set dirty) except by its owner.
>> Pages marked as PG_reserved include:."
>> 
>> I suggest looking at that.
>> 
>> AFAIR, we have been setting *most* memmap in memory holes/non-ram
>> reserved for a long time - long before I added the __init_single_page -
>> see init_reserved_page() for example.
> 
> Sure, non-RAM with valid page struct always has been marked
> PG_reserved. I wasn't suggesting that it shouldn't be PG_reserved.
> 
> I was pointing out that RAM can also be marked PG_reserved later by
> the kernel, long after boot, as you mentioned for all other cases of
> PG_reserved, the most notable are drivers doing PG_reserved after
> allocating RAM either vmalloc or GART swapping RAM around at other
> alias physical address.
> 
> That is all born as RAM at boot, it gets page->flags done right, with
> the right zoneid, and it becomes PG_reserved later.
> 
> So I was suggesting physical ranges "pfn" of non-RAM (be those holes
> withtin zones, or in between zones doesn't matter) with a pfn_valid
> returning true and a pfn_to_page pointing deterministically to one and
> only one struct page, should have such struct page initialized exactly
> the same as if it was RAM.
> 
> Either that or we can define a new NO_ZONE NO_ID id and crash in
> page_zonenum or page_to_nid if it is ever called on such a page
> struct.

I feel like that is easier and maybe cleaner. Mark memmaps that exist but should be completely ignored. Could even check that in pfn_valid() and return „false“ - might be expensive, though.

Anyhow, I do agree that properly catching these problematic pages, bailing out and fixing them (however we decide) is the right approach.

> 
> Thanks,
> Andrea


  reply	other threads:[~2020-11-25 21:14 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-23 21:25 compaction: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) Qian Cai
2020-04-24  3:43 ` Baoquan He
2020-04-24 13:45   ` Qian Cai
2020-05-05 12:43     ` Baoquan He
2020-05-05 13:20       ` Qian Cai
2020-05-11  1:21         ` Baoquan He
2020-04-26 14:41 ` Mike Rapoport
2020-04-27 13:45   ` Qian Cai
2020-11-21 19:45 ` [PATCH 0/1] VM_BUG_ON_PAGE(!zone_spans_pfn) in set_pfnblock_flags_mask Andrea Arcangeli
2020-11-21 19:45   ` [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages Andrea Arcangeli
2020-11-21 19:53     ` Andrea Arcangeli
2020-11-23 11:26       ` David Hildenbrand
2020-11-23 13:01     ` Vlastimil Babka
2020-11-24 13:32       ` Mel Gorman
2020-11-24 20:56         ` Andrea Arcangeli
2020-11-25 10:30           ` Mel Gorman
2020-11-25 17:59             ` Andrea Arcangeli
2020-11-26 10:47               ` Mel Gorman
2020-12-06  2:26                 ` Andrea Arcangeli
2020-12-06 23:47                   ` Mel Gorman
2020-11-25  5:34       ` Andrea Arcangeli
2020-11-25  6:45         ` David Hildenbrand
2020-11-25  8:51           ` Mike Rapoport
2020-11-25 10:39           ` Mel Gorman
2020-11-25 11:04             ` David Hildenbrand
2020-11-25 11:41               ` David Hildenbrand
2020-11-25 18:47                 ` Andrea Arcangeli
2020-11-25 13:33               ` Mel Gorman
2020-11-25 13:41                 ` David Hildenbrand
2020-11-25 18:28           ` Andrea Arcangeli
2020-11-25 19:27             ` David Hildenbrand
2020-11-25 20:41               ` Andrea Arcangeli
2020-11-25 21:13                 ` David Hildenbrand [this message]
2020-11-25 21:04               ` Mike Rapoport
2020-11-25 21:38                 ` Andrea Arcangeli
2020-11-26  9:36                   ` Mike Rapoport
2020-11-26 10:05                     ` David Hildenbrand
2020-11-26 17:46                       ` Mike Rapoport
2020-11-29 12:32                         ` Mike Rapoport
2020-12-02  0:44                           ` Andrea Arcangeli
2020-12-02 17:39                             ` Mike Rapoport
2020-12-03  6:23                               ` Andrea Arcangeli
2020-12-03 10:51                                 ` Mike Rapoport
2020-12-03 17:31                                   ` Andrea Arcangeli
2020-12-06  8:09                                     ` Mike Rapoport
2020-11-26 18:15                       ` Andrea Arcangeli
2020-11-26 18:29                     ` Andrea Arcangeli
2020-11-26 19:44                       ` Mike Rapoport
2020-11-26 20:30                         ` Andrea Arcangeli
2020-11-26 21:03                           ` Mike Rapoport
2020-11-26 19:21                     ` Andrea Arcangeli
2020-11-25 12:08         ` Vlastimil Babka
2020-11-25 13:32           ` David Hildenbrand
2020-11-25 14:13             ` Mike Rapoport
2020-11-25 14:42               ` David Hildenbrand
2020-11-26 10:51                 ` Mel Gorman
2020-11-25 19:14               ` Andrea Arcangeli
2020-11-25 19:01           ` Andrea Arcangeli
2020-11-25 19:33             ` David Hildenbrand
2020-11-26  3:40         ` Andrea Arcangeli
2020-11-26 10:43           ` Mike Rapoport

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89B17C54-671B-4363-B425-CCFE17DD8FDD@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=cai@lca.pw \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=rppt@linux.ibm.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.