From: David Hildenbrand <david@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Michal Hocko <mhocko@kernel.org>,
Pavel Tatashin <pasha.tatashin@soleen.com>,
Oscar Salvador <osalvador@suse.de>,
Joonsoo Kim <iamjoonsoo.kim@lge.com>,
Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH v3 7/7] mm, page_alloc: disable pcplists during memory offline
Date: Wed, 11 Nov 2020 18:58:35 +0100 [thread overview]
Message-ID: <6fdaaeeb-154b-5de1-3008-e56a8be53a5a@redhat.com> (raw)
In-Reply-To: <20201111092812.11329-8-vbabka@suse.cz>
On 11.11.20 10:28, Vlastimil Babka wrote:
> Memory offlining relies on page isolation to guarantee a forward
> progress because pages cannot be reused while they are isolated. But the
> page isolation itself doesn't prevent from races while freed pages are
> stored on pcp lists and thus can be reused. This can be worked around by
> repeated draining of pcplists, as done by commit 968318261221
> ("mm/memory_hotplug: drain per-cpu pages again during memory offline").
>
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
>
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
>
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
>
> This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
> page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages.
>
> Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
> not empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
> racy check and drain on all cpus, and use this option in zone_pcp_disable().
>
> As we have to avoid external updates to high and batch while pcplists are
> disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
> zone_pcp_enable(). This also synchronizes multiple users of
> zone_pcp_disable()/enable().
>
> Currently the only user of this functionality is offline_pages().
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> mm/internal.h | 2 ++
> mm/memory_hotplug.c | 28 ++++++++----------
> mm/page_alloc.c | 69 +++++++++++++++++++++++++++++++++++----------
> mm/page_isolation.c | 6 ++--
> 4 files changed, 71 insertions(+), 34 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index c43ccdddb0f6..2966496680bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
>
> extern void zone_pcp_update(struct zone *zone);
> extern void zone_pcp_reset(struct zone *zone);
> +extern void zone_pcp_disable(struct zone *zone);
> +extern void zone_pcp_enable(struct zone *zone);
>
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3c494ab0d075..e0a561c550b3 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1491,17 +1491,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> }
> node = zone_to_nid(zone);
>
> + /*
> + * Disable pcplists so that page isolation cannot race with freeing
> + * in a way that pages from isolated pageblock are left on pcplists.
> + */
> + zone_pcp_disable(zone);
> +
> /* set above range as isolated */
> ret = start_isolate_page_range(start_pfn, end_pfn,
> MIGRATE_MOVABLE,
> MEMORY_OFFLINE | REPORT_FAILURE);
> if (ret) {
> reason = "failure to isolate range";
> - goto failed_removal;
> + goto failed_removal_pcplists_disabled;
> }
>
> - drain_all_pages(zone);
> -
> arg.start_pfn = start_pfn;
> arg.nr_pages = nr_pages;
> node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1551,20 +1555,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> goto failed_removal_isolated;
> }
>
> - /*
> - * per-cpu pages are drained after start_isolate_page_range, but
> - * if there are still pages that are not free, make sure that we
> - * drain again, because when we isolated range we might have
> - * raced with another thread that was adding pages to pcp list.
> - *
> - * Forward progress should be still guaranteed because
> - * pages on the pcp list can only belong to MOVABLE_ZONE
> - * because has_unmovable_pages explicitly checks for
> - * PageBuddy on freed pages on other zones.
> - */
> ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> - if (ret)
> - drain_all_pages(zone);
> +
Why two empty lines before the "} while (ret);" ? (unless I'm confused
while looking at this diff)
[...]
> +void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> {
> int cpu;
>
> @@ -3076,7 +3069,13 @@ void drain_all_pages(struct zone *zone)
> struct zone *z;
> bool has_pcps = false;
>
> - if (zone) {
> + if (force_all_cpus) {
> + /*
> + * The pcp.count check is racy, some callers need a
> + * guarantee that no cpu is missed.
Why this comment is helpful, it doesn't tell the whole story. Who
exactly/in which situations?
> + */
> + has_pcps = true;
> + } else if (zone) {
> pcp = per_cpu_ptr(zone->pageset, cpu);
> if (pcp->pcp.count)
> has_pcps = true;
> @@ -3109,6 +3108,18 @@ void drain_all_pages(struct zone *zone)
> mutex_unlock(&pcpu_drain_mutex);
> }
>
> +/*
> + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> + *
> + * When zone parameter is non-NULL, spill just the single zone's pages.
> + *
> + * Note that this can be extremely slow as the draining happens in a workqueue.
> + */
> +void drain_all_pages(struct zone *zone)
> +{
> + __drain_all_pages(zone, false);
It's still somewhat unclear to me why we don't need "force_all_cpus"
here. Can you clarify that? (e.g., add a comment somewhere?)
[...]
> +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
> + unsigned long batch)
> +{
> + struct per_cpu_pageset *p;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_update(&p->pcp, high, batch);
> + }
> +}
> +
> /*
> * Calculate and set new high and batch values for all per-cpu pagesets of a
> * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> @@ -6315,8 +6338,6 @@ static void pageset_init(struct per_cpu_pageset *p)
> static void zone_set_pageset_high_and_batch(struct zone *zone)
> {
> unsigned long new_high, new_batch;
> - struct per_cpu_pageset *p;
> - int cpu;
>
> if (percpu_pagelist_fraction) {
> new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6336,10 +6357,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
> zone->pageset_high = new_high;
> zone->pageset_batch = new_batch;
>
> - for_each_possible_cpu(cpu) {
> - p = per_cpu_ptr(zone->pageset, cpu);
> - pageset_update(&p->pcp, new_high, new_batch);
> - }
> + __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
> }
These two hunks look like an unrelated cleanup, or am I missing something?
Thanks for looking into this!
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2020-11-11 17:58 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-11 9:28 [PATCH v3 0/7] disable pcplists during memory offline Vlastimil Babka
2020-11-11 9:28 ` [PATCH v3 1/7] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
2020-11-11 9:55 ` Pankaj Gupta
2020-11-11 9:28 ` [PATCH v3 2/7] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
2020-11-11 10:19 ` Pankaj Gupta
2020-11-11 9:28 ` [PATCH v3 3/7] mm, page_alloc: remove setup_pageset() Vlastimil Babka
2020-11-11 10:23 ` Pankaj Gupta
2020-11-11 9:28 ` [PATCH v3 4/7] mm, page_alloc: simplify pageset_update() Vlastimil Babka
2020-11-11 9:28 ` [PATCH v3 5/7] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
2020-11-12 16:26 ` David Hildenbrand
2020-11-11 9:28 ` [PATCH v3 6/7] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
2020-11-11 9:28 ` [PATCH v3 7/7] mm, page_alloc: disable pcplists during memory offline Vlastimil Babka
2020-11-11 17:58 ` David Hildenbrand [this message]
2020-11-12 15:18 ` Vlastimil Babka
2020-11-12 16:09 ` David Hildenbrand
2020-11-11 17:59 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6fdaaeeb-154b-5de1-3008-e56a8be53a5a@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=iamjoonsoo.kim@lge.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=mhocko@suse.com \
--cc=osalvador@suse.de \
--cc=pasha.tatashin@soleen.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).