linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Dan Williams <dan.j.williams@intel.com>, linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions
Date: Wed, 6 Jan 2021 10:56:19 +0100	[thread overview]
Message-ID: <785b9095-eca4-8100-33ea-6ae84e02a92e@redhat.com> (raw)
In-Reply-To: <160990599013.2430134.11556277600719835946.stgit@dwillia2-desk3.amr.corp.intel.com>

On 06.01.21 05:07, Dan Williams wrote:
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixed with ZONE_DEVICE pfns.
> 
> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section.
> 
> With this implementation of pfn_to_online_page() pfn-walkers mostly only
> need to check section metadata to determine pfn validity. In the
> rare case of mixed-zone sections the pfn-walker will skip offline
> ZONE_DEVICE pfns as expected.
> 
> Other notes:
> 
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
> 
> pfn_to_online_page() was already borderline too large to be a macro /
> inline function, but the additional logic certainly pushed it over that
> threshold, and so it is moved to an out-of-line helper.
> 
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reported-by: Michal Hocko <mhocko@suse.com>
> Reported-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

[...]

> +#define SECTION_MARKED_PRESENT		(1UL<<0)
> +#define SECTION_HAS_MEM_MAP		(1UL<<1)
> +#define SECTION_IS_ONLINE		(1UL<<2)
> +#define SECTION_IS_EARLY		(1UL<<3)
> +#define SECTION_TAINT_ZONE_DEVICE	(1UL<<4)
> +#define SECTION_MAP_LAST_BIT		(1UL<<5)
> +#define SECTION_MAP_MASK		(~(SECTION_MAP_LAST_BIT-1))
> +#define SECTION_NID_SHIFT		3
>  
>  static inline struct page *__section_mem_map_addr(struct mem_section *section)
>  {
> @@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section *section)
>  	return (section && (section->section_mem_map & SECTION_IS_ONLINE));
>  }
>  
> +static inline int online_device_section(struct mem_section *section)
> +{
> +	unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
> +
> +	return section && ((section->section_mem_map & flags) == flags);
> +}
> +
>  static inline int online_section_nr(unsigned long nr)
>  {
>  	return online_section(__nr_to_section(nr));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f9d57b9be8c7..9f36968e6188 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,47 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
>  	return 0;
>  }
>  
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +struct page *pfn_to_online_page(unsigned long pfn)
> +{
> +	unsigned long nr = pfn_to_section_nr(pfn);
> +	struct dev_pagemap *pgmap;
> +	struct mem_section *ms;
> +
> +	if (nr >= NR_MEM_SECTIONS)
> +		return NULL;
> +
> +	ms = __nr_to_section(nr);
> +
> +	if (!online_section(ms))
> +		return NULL;
> +
> +	if (!pfn_valid_within(pfn))
> +		return NULL;
> +
> +	if (!online_device_section(ms))
> +		return pfn_to_page(pfn);
> +
> +	/*
> +	 * Slowpath: when ZONE_DEVICE collides with
> +	 * ZONE_{NORMAL,MOVABLE} within the same section some pfns in
> +	 * the section may be 'offline' but 'valid'. Only
> +	 * get_dev_pagemap() can determine sub-section online status.
> +	 */
> +	pgmap = get_dev_pagemap(pfn, NULL);
> +	put_dev_pagemap(pgmap);
> +
> +	/* The presence of a pgmap indicates ZONE_DEVICE offline pfn */
> +	if (pgmap)
> +		return NULL;
> +	return pfn_to_page(pfn);
> +}
> +EXPORT_SYMBOL_GPL(pfn_to_online_page);

Note that this is not sufficient in the general case. I already
mentioned that we effectively override an already initialized memmap.

---

[        SECTION             ]
Before:
[ ZONE_NORMAL ][    Hole     ]

The hole has some node/zone (currently 0/0, discussions ongoing on how
to optimize that to e.g., ZONE_NORMAL in this example) and is
PG_reserved - looks like an ordinary memory hole.

After memremap:
[ ZONE_NORMAL ][ ZONE_DEVICE ]

The already initialized memmap was converted to ZONE_DEVICE. Your
slowpath will work.

After memunmap (no poisioning):
[ ZONE_NORMAL ][ ZONE_DEVICE ]

The slow path is no longer working. pfn_to_online_page() might return
something that is ZONE_DEVICE.

After memunmap (poisioning):
[ ZONE_NORMAL ][ POISONED    ]

The slow path is no longer working. pfn_to_online_page() might return
something that will BUG_ON via page_to_nid() etc.

---

Reason is that pfn_to_online_page() does no care about sub-sections. And
for now, it didn't had to. If there was an online section, it either was

a) Completely present. The whole memmap is initialized to sane values.
b) Partially present. The whole memmap is initialized to sane values.

memremap/memunmap messes with case b)

Well have to further tweak pfn_to_online_page(). You'll have to also
check pfn_section_valid() *at least* on the slow path. Less-hacky would
be checking it also in the "somehwat-faster" path - that would cover
silently overriding a memmap that's visible via pfn_to_online_page().
Might slow down things a bit.


Not completely opposed to this, but I would certainly still prefer just
avoiding this corner case completely instead of patching around it. Thanks!

-- 
Thanks,

David / dhildenb


  parent reply	other threads:[~2021-01-06  9:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-06  4:07 [PATCH] mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions Dan Williams
2021-01-06  9:55 ` Michal Hocko
2021-01-12  9:15   ` Dan Williams
2021-01-06  9:56 ` David Hildenbrand [this message]
2021-01-06 10:42   ` Michal Hocko
2021-01-06 11:22     ` David Hildenbrand
2021-01-06 11:38       ` Michal Hocko
2021-01-06 20:02       ` Dan Williams
2021-01-07  9:15         ` David Hildenbrand
2021-01-12  9:18           ` Dan Williams
2021-01-12  9:44             ` David Hildenbrand
2021-01-12 20:52               ` Dan Williams
2021-01-06 10:04 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=785b9095-eca4-8100-33ea-6ae84e02a92e@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).