All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>, Linux MM <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
Date: Tue, 12 Jan 2021 01:18:20 -0800	[thread overview]
Message-ID: <CAPcyv4gXaUswZS_a4-oiQZWVZ4QDJrKps4XGs=xxLE0Ls=PSmg@mail.gmail.com> (raw)
In-Reply-To: <75bb1429-d133-d303-a67a-be16c654ada8@redhat.com>

On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> [...]
>
> >>> Well, I would love to have no surprises either. So far there was not
> >>> actual argument why the pmem reserved space cannot be fully initialized.
> >>
> >> Yes, I'm still hoping Dan can clarify that.
> >
> > Complexity and effective utility (once pfn_to_online_page() is fixed)
> > are the roadblocks in my mind. The altmap is there to allow for PMEM
> > capacity to be used as memmap space, so there would need to be code to
> > break that circular dependency and allocate a memmap for the metadata
> > space from DRAM and the rest of the memmap space for the data capacity
> > from pmem itself. That memmap-for-pmem-metadata will still represent
> > offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
> > is going to be doing pfn_to_page() on PMEM metadata? Secondly, there
>
> Assume I do
>
> pgmap = get_dev_pagemap(pfn, NULL);
> if (pgmap)
>         return pfn_to_page(pfn);
> return NULL;
>
> on a random pfn because I want to inspect ZONE_DEVICE PFNs.

I keep getting hung up on the motivation to do random pfn inspection?

The problems we have found to date have required different solutions.
The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it
could rely on the fact that the page already had an elevated reference
count. The get_user_pages() path only looks up ZONE_DEVICE pfns when
it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers
have been a problem, but with pfn_to_online_page() fixed what is the
remaining motivation to inspect ZONE_DEVICE pfns?

> IIUC, the memmap I get might usually be initialized, except we're
> hitting a PFN in the reserved altmap space. Correct?

The pagemap currently returns true for every pfn in the range
including those in the altmap.



>
> Do we expect this handling to not be valid - i.e., initialization to be
> of no utility? (to make it fly we would have to explicitly check for
> PFNs in the altmap reserved space, which wouldn't be too problematic)
>
> > is a PMEM namespace mode called "raw" that eschews DAX and 'struct
> > page' for pmem and just behaves like a memory-backed block device. The
> > end result is still that pfn walkers need to discover if a PMEM pfn
> > has a page, so I don't see what "sometimes there's an
> > memmap-for-pmem-metadata" buys us?
>
> Right, but that's as easy as doing a pfn_valid() first.
>
>
> Let me try to express what I (and I think Michal) mean:
>
> In pagemap_range(), we
>
> 1. move_pfn_range_to_zone()->memmap_init_zone(): initialize the memmap
> of the PMEM device used as memmap itself ("self host", confusing). We
> skip initializing the memmap for the the reserved altmap space.
>
> 2. memmap_init_zone_device(): initialize the memmap of everything
> outside of the altmap space.
>
> IIUC, the memmap of the reserved altmap space remains uninitialized.
> What speaks against just initializing that part in e.g., 2. or similarly
> after 1.?
>
>
> I'm planning on documenting the result of this discussion in the code,
> so people don't have to scratch their head whenever stumbling over the
> altmap reserved space.
>
> >
> >>
> >>> On the other hand making sure that pfn_to_online_page sounds like the
> >>> right thing to do. And having an explicit check for zone device there in
> >>> a slow path makes sense to me.
> >>
> >> As I said, I'd favor to simplify and just get rid of the special case,
> >> instead of coming up with increasingly complex ways to deal with it.
> >> pfn_to_online_page() used to be simple, essentially checking a single
> >> flag was sufficient in most setups.
> >
> > I think the logic to throw away System RAM that might collide with
> > PMEM and soft-reserved memory within a section is on the order of the
> > same code complexity as the patch proposed here, no? Certainly the
> > throw-away concept itself is easier to grasp, but I don't think that
> > would be reflected in the code patch to achieve it... willing to be
> > proved wrong with a patch.
>
> Well, at least it sounds easier to go over memblock holes and
> align-up/down some relevant PFNs to section boundaries, ending up with
> no affect to runtime performance later (e.g., pfn_to_online_page()). But
> I agree that most probably the devil is in the detail - e.g., handling
> different kind of holes (different flavors of "reserved") and syncing up
> other data structures (e.g., resource tree).
>
> I don't have time to look into that right now, but might look into it in
> the future. For now I'm fine with this approach, assuming we don't
> discover other corner cases that turn it even more complex. I'm happy
> that we finally talk about it and fix it!
>
>
> --
> Thanks,
>
> David / dhildenb
>

  reply	other threads:[~2021-01-12  9:19 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-06  4:07 [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions Dan Williams
2021-01-06  9:55 ` Michal Hocko
2021-01-12  9:15   ` Dan Williams
2021-01-12  9:15     ` Dan Williams
2021-01-06  9:56 ` David Hildenbrand
2021-01-06 10:42   ` Michal Hocko
2021-01-06 11:22     ` David Hildenbrand
2021-01-06 11:38       ` Michal Hocko
2021-01-06 20:02       ` Dan Williams
2021-01-06 20:02         ` Dan Williams
2021-01-07  9:15         ` David Hildenbrand
2021-01-12  9:18           ` Dan Williams [this message]
2021-01-12  9:18             ` Dan Williams
2021-01-12  9:44             ` David Hildenbrand
2021-01-12 20:52               ` Dan Williams
2021-01-12 20:52                 ` Dan Williams
2021-01-06 10:04 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4gXaUswZS_a4-oiQZWVZ4QDJrKps4XGs=xxLE0Ls=PSmg@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.