All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Vlastimil Babka <vbabka@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Michal Hocko <mhocko@kernel.org>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Oscar Salvador <osalvador@suse.de>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH v3 7/7] mm, page_alloc: disable pcplists during memory offline
Date: Wed, 11 Nov 2020 18:58:35 +0100	[thread overview]
Message-ID: <6fdaaeeb-154b-5de1-3008-e56a8be53a5a@redhat.com> (raw)
In-Reply-To: <20201111092812.11329-8-vbabka@suse.cz>

On 11.11.20 10:28, Vlastimil Babka wrote:
> Memory offlining relies on page isolation to guarantee a forward
> progress because pages cannot be reused while they are isolated. But the
> page isolation itself doesn't prevent from races while freed pages are
> stored on pcp lists and thus can be reused.  This can be worked around by
> repeated draining of pcplists, as done by commit 968318261221
> ("mm/memory_hotplug: drain per-cpu pages again during memory offline").
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
> 
> This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
> page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages.
> 
> Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
> not empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
> racy check and drain on all cpus, and use this option in zone_pcp_disable().
> 
> As we have to avoid external updates to high and batch while pcplists are
> disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
> zone_pcp_enable(). This also synchronizes multiple users of
> zone_pcp_disable()/enable().
> 
> Currently the only user of this functionality is offline_pages().
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>   mm/internal.h       |  2 ++
>   mm/memory_hotplug.c | 28 ++++++++----------
>   mm/page_alloc.c     | 69 +++++++++++++++++++++++++++++++++++----------
>   mm/page_isolation.c |  6 ++--
>   4 files changed, 71 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c43ccdddb0f6..2966496680bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
>   
>   extern void zone_pcp_update(struct zone *zone);
>   extern void zone_pcp_reset(struct zone *zone);
> +extern void zone_pcp_disable(struct zone *zone);
> +extern void zone_pcp_enable(struct zone *zone);
>   
>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>   
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3c494ab0d075..e0a561c550b3 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1491,17 +1491,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   	}
>   	node = zone_to_nid(zone);
>   
> +	/*
> +	 * Disable pcplists so that page isolation cannot race with freeing
> +	 * in a way that pages from isolated pageblock are left on pcplists.
> +	 */
> +	zone_pcp_disable(zone);
> +
>   	/* set above range as isolated */
>   	ret = start_isolate_page_range(start_pfn, end_pfn,
>   				       MIGRATE_MOVABLE,
>   				       MEMORY_OFFLINE | REPORT_FAILURE);
>   	if (ret) {
>   		reason = "failure to isolate range";
> -		goto failed_removal;
> +		goto failed_removal_pcplists_disabled;
>   	}
>   
> -	drain_all_pages(zone);
> -
>   	arg.start_pfn = start_pfn;
>   	arg.nr_pages = nr_pages;
>   	node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1551,20 +1555,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>   			goto failed_removal_isolated;
>   		}
>   
> -		/*
> -		 * per-cpu pages are drained after start_isolate_page_range, but
> -		 * if there are still pages that are not free, make sure that we
> -		 * drain again, because when we isolated range we might have
> -		 * raced with another thread that was adding pages to pcp list.
> -		 *
> -		 * Forward progress should be still guaranteed because
> -		 * pages on the pcp list can only belong to MOVABLE_ZONE
> -		 * because has_unmovable_pages explicitly checks for
> -		 * PageBuddy on freed pages on other zones.
> -		 */
>   		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> -		if (ret)
> -			drain_all_pages(zone);
> +

Why two empty lines before the "} while (ret);" ? (unless I'm confused 
while looking at this diff)

[...]

> +void __drain_all_pages(struct zone *zone, bool force_all_cpus)
>   {
>   	int cpu;
>   
> @@ -3076,7 +3069,13 @@ void drain_all_pages(struct zone *zone)
>   		struct zone *z;
>   		bool has_pcps = false;
>   
> -		if (zone) {
> +		if (force_all_cpus) {
> +			/*
> +			 * The pcp.count check is racy, some callers need a
> +			 * guarantee that no cpu is missed.

Why this comment is helpful, it doesn't tell the whole story. Who 
exactly/in which situations?

> +			 */
> +			has_pcps = true;
> +		} else if (zone) {
>   			pcp = per_cpu_ptr(zone->pageset, cpu);
>   			if (pcp->pcp.count)
>   				has_pcps = true;
> @@ -3109,6 +3108,18 @@ void drain_all_pages(struct zone *zone)
>   	mutex_unlock(&pcpu_drain_mutex);
>   }
>   
> +/*
> + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> + *
> + * When zone parameter is non-NULL, spill just the single zone's pages.
> + *
> + * Note that this can be extremely slow as the draining happens in a workqueue.
> + */
> +void drain_all_pages(struct zone *zone)
> +{
> +	__drain_all_pages(zone, false);

It's still somewhat unclear to me why we don't need "force_all_cpus" 
here. Can you clarify that? (e.g., add a comment somewhere?)

[...]

> +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
> +		unsigned long batch)
> +{
> +	struct per_cpu_pageset *p;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		p = per_cpu_ptr(zone->pageset, cpu);
> +		pageset_update(&p->pcp, high, batch);
> +	}
> +}
> +
>   /*
>    * Calculate and set new high and batch values for all per-cpu pagesets of a
>    * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> @@ -6315,8 +6338,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>   static void zone_set_pageset_high_and_batch(struct zone *zone)
>   {
>   	unsigned long new_high, new_batch;
> -	struct per_cpu_pageset *p;
> -	int cpu;
>   
>   	if (percpu_pagelist_fraction) {
>   		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6336,10 +6357,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>   	zone->pageset_high = new_high;
>   	zone->pageset_batch = new_batch;
>   
> -	for_each_possible_cpu(cpu) {
> -		p = per_cpu_ptr(zone->pageset, cpu);
> -		pageset_update(&p->pcp, new_high, new_batch);
> -	}
> +	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
>   }

These two hunks look like an unrelated cleanup, or am I missing something?

Thanks for looking into this!

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2020-11-11 17:58 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-11  9:28 [PATCH v3 0/7] disable pcplists during memory offline Vlastimil Babka
2020-11-11  9:28 ` [PATCH v3 1/7] mm, page_alloc: clean up pageset high and batch update Vlastimil Babka
2020-11-11  9:55   ` Pankaj Gupta
2020-11-11  9:55     ` Pankaj Gupta
2020-11-11  9:28 ` [PATCH v3 2/7] mm, page_alloc: calculate pageset high and batch once per zone Vlastimil Babka
2020-11-11 10:19   ` Pankaj Gupta
2020-11-11 10:19     ` Pankaj Gupta
2020-11-11  9:28 ` [PATCH v3 3/7] mm, page_alloc: remove setup_pageset() Vlastimil Babka
2020-11-11 10:23   ` Pankaj Gupta
2020-11-11 10:23     ` Pankaj Gupta
2020-11-11  9:28 ` [PATCH v3 4/7] mm, page_alloc: simplify pageset_update() Vlastimil Babka
2020-11-11  9:28 ` [PATCH v3 5/7] mm, page_alloc: cache pageset high and batch in struct zone Vlastimil Babka
2020-11-12 16:26   ` David Hildenbrand
2020-11-11  9:28 ` [PATCH v3 6/7] mm, page_alloc: move draining pcplists to page isolation users Vlastimil Babka
2020-11-11  9:28 ` [PATCH v3 7/7] mm, page_alloc: disable pcplists during memory offline Vlastimil Babka
2020-11-11 17:58   ` David Hildenbrand [this message]
2020-11-12 15:18     ` Vlastimil Babka
2020-11-12 16:09       ` David Hildenbrand
2020-11-11 17:59   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6fdaaeeb-154b-5de1-3008-e56a8be53a5a@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    --cc=pasha.tatashin@soleen.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.