From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3KsH=DP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 41DC2C4363A
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 12:45:42 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 7E66421D46
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 12:45:41 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=suse.com header.i=@suse.com header.b="EkRz3Xur"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7E66421D46
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=suse.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 818DE6B005D; Thu,  8 Oct 2020 08:45:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7A2156B0062; Thu,  8 Oct 2020 08:45:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 61CD56B0068; Thu,  8 Oct 2020 08:45:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0186.hostedemail.com [216.40.44.186])
	by kanga.kvack.org (Postfix) with ESMTP id 322EF6B005D
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 08:45:40 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id C70D71EE6
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 12:45:39 +0000 (UTC)
X-FDA: 77348729598.27.ant80_2c08888271d8
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin27.hostedemail.com (Postfix) with ESMTP id 3A6893D668
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 12:45:38 +0000 (UTC)
X-HE-Tag: ant80_2c08888271d8
X-Filterd-Recvd-Size: 13051
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf11.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 12:45:37 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1;
	t=1602161135;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=Dc40UxHKgO0qHrCEWBLeezhstwhH/0dBzH6mPVRq3t8=;
	b=EkRz3XurY0LSFmi11MHRGfmYTj+BIgQBT34gH8QgFD20r9xAwhJkCeA7K5RafDJOJdk1hc
	/DeUkM1WXaMlkTs0vUSeSKI5kmIW2lUsdK+1oKG82wmTmIoy1sOFnrrsJlEZ+CzdxgGV8o
	gX9PyG5k+uluyeFnK1q4h3VmqVgP9E4=
Received: from relay2.suse.de (unknown [195.135.221.27])
	by mx2.suse.de (Postfix) with ESMTP id C69B8AD8D;
	Thu,  8 Oct 2020 12:45:35 +0000 (UTC)
Date: Thu, 8 Oct 2020 14:45:34 +0200
From: Michal Hocko <mhocko@suse.com>
To: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	David Hildenbrand <david@redhat.com>,
	Oscar Salvador <osalvador@suse.de>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>
Subject: Re: [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory
 offline
Message-ID: <20201008124534.GD4967@dhcp22.suse.cz>
References: <20201008114201.18824-1-vbabka@suse.cz>
 <20201008114201.18824-8-vbabka@suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20201008114201.18824-8-vbabka@suse.cz>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu 08-10-20 13:42:01, Vlastimil Babka wrote:
> Memory offline relies on page isolation can race with process freeing pages to
> pcplists in a way that a page from isolated pageblock can end up on pcplist.

"Memory offlining relies on page isolation to guarantee a forward
progress because pages cannot be reused while they are isolated. But the
page isolation itself doesn't prevent from races while freed pages are
stored on pcp lists and thus can be reused.
"

> This can be worked around by repeated draining of pcplists, as done by commit
> 968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory
> offline").
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
> 
> This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions that
> page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages.
> 
> Also, drain_all_pages() is optimized to only execute on cpus where pcplists are
> not empty. The check can however race with a free to pcplist that has not yet
> increased the pcp->count from 0 to 1. Thus make the drain optionally skip the
> racy check and drain on all cpus, and use this option in zone_pcp_disable().
> 
> As we have to avoid external updates to high and batch while pcplists are
> disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release it in
> zone_pcp_enable(). This also synchronizes multiple users of
> zone_pcp_disable()/enable().
> 
> Currently the only user of this functionality is offline_pages().

Thanks for simplifying the implementation!

> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

Btw. I suspect this functionality might become handy for hwpoisoning as
well. I didn't get around to check the current state of the
implementation but I believe they would appreciate a guanratee to not
free into pcp lists as well. Oscar will surely know better though.

Thanks!
> ---
>  mm/internal.h       |  2 ++
>  mm/memory_hotplug.c | 28 ++++++++----------
>  mm/page_alloc.c     | 69 +++++++++++++++++++++++++++++++++++----------
>  mm/page_isolation.c |  6 ++--
>  4 files changed, 71 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index c43ccdddb0f6..2966496680bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
>  
>  extern void zone_pcp_update(struct zone *zone);
>  extern void zone_pcp_reset(struct zone *zone);
> +extern void zone_pcp_disable(struct zone *zone);
> +extern void zone_pcp_enable(struct zone *zone);
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2e6ad899c55e..4382b585c76c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1510,17 +1510,21 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	}
>  	node = zone_to_nid(zone);
>  
> +	/*
> +	 * Disable pcplists so that page isolation cannot race with freeing
> +	 * in a way that pages from isolated pageblock are left on pcplists.
> +	 */
> +	zone_pcp_disable(zone);
> +
>  	/* set above range as isolated */
>  	ret = start_isolate_page_range(start_pfn, end_pfn,
>  				       MIGRATE_MOVABLE,
>  				       MEMORY_OFFLINE | REPORT_FAILURE);
>  	if (ret) {
>  		reason = "failure to isolate range";
> -		goto failed_removal;
> +		goto failed_removal_pcplists_disabled;
>  	}
>  
> -	drain_all_pages(zone);
> -
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
>  	node_states_check_changes_offline(nr_pages, zone, &arg);
> @@ -1570,20 +1574,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  			goto failed_removal_isolated;
>  		}
>  
> -		/*
> -		 * per-cpu pages are drained after start_isolate_page_range, but
> -		 * if there are still pages that are not free, make sure that we
> -		 * drain again, because when we isolated range we might have
> -		 * raced with another thread that was adding pages to pcp list.
> -		 *
> -		 * Forward progress should be still guaranteed because
> -		 * pages on the pcp list can only belong to MOVABLE_ZONE
> -		 * because has_unmovable_pages explicitly checks for
> -		 * PageBuddy on freed pages on other zones.
> -		 */
>  		ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
> -		if (ret)
> -			drain_all_pages(zone);
> +
>  	} while (ret);
>  
>  	/* Mark all sections offline and remove free pages from the buddy. */
> @@ -1599,6 +1591,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  	zone->nr_isolate_pageblock -= nr_pages / pageblock_nr_pages;
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
> +	zone_pcp_enable(zone);
> +
>  	/* removal success */
>  	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
>  	zone->present_pages -= nr_pages;
> @@ -1631,6 +1625,8 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  failed_removal_isolated:
>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>  	memory_notify(MEM_CANCEL_OFFLINE, &arg);
> +failed_removal_pcplists_disabled:
> +	zone_pcp_enable(zone);
>  failed_removal:
>  	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
>  		 (unsigned long long) start_pfn << PAGE_SHIFT,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1f7108fe9a0b..366c516c9062 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3018,14 +3018,7 @@ static void drain_local_pages_wq(struct work_struct *work)
>  	preempt_enable();
>  }
>  
> -/*
> - * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> - *
> - * When zone parameter is non-NULL, spill just the single zone's pages.
> - *
> - * Note that this can be extremely slow as the draining happens in a workqueue.
> - */
> -void drain_all_pages(struct zone *zone)
> +void __drain_all_pages(struct zone *zone, bool force_all_cpus)
>  {
>  	int cpu;
>  
> @@ -3064,7 +3057,13 @@ void drain_all_pages(struct zone *zone)
>  		struct zone *z;
>  		bool has_pcps = false;
>  
> -		if (zone) {
> +		if (force_all_cpus) {
> +			/*
> +			 * The pcp.count check is racy, some callers need a
> +			 * guarantee that no cpu is missed.
> +			 */
> +			has_pcps = true;
> +		} else if (zone) {
>  			pcp = per_cpu_ptr(zone->pageset, cpu);
>  			if (pcp->pcp.count)
>  				has_pcps = true;
> @@ -3097,6 +3096,18 @@ void drain_all_pages(struct zone *zone)
>  	mutex_unlock(&pcpu_drain_mutex);
>  }
>  
> +/*
> + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> + *
> + * When zone parameter is non-NULL, spill just the single zone's pages.
> + *
> + * Note that this can be extremely slow as the draining happens in a workqueue.
> + */
> +void drain_all_pages(struct zone *zone)
> +{
> +	__drain_all_pages(zone, false);
> +}
> +
>  #ifdef CONFIG_HIBERNATION
>  
>  /*
> @@ -6296,6 +6307,18 @@ static void pageset_init(struct per_cpu_pageset *p)
>  	pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
> +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
> +		unsigned long batch)
> +{
> +	struct per_cpu_pageset *p;
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		p = per_cpu_ptr(zone->pageset, cpu);
> +		pageset_update(&p->pcp, high, batch);
> +	}
> +}
> +
>  /*
>   * Calculate and set new high and batch values for all per-cpu pagesets of a
>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
> @@ -6303,8 +6326,6 @@ static void pageset_init(struct per_cpu_pageset *p)
>  static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
>  	unsigned long new_high, new_batch;
> -	struct per_cpu_pageset *p;
> -	int cpu;
>  
>  	if (percpu_pagelist_fraction) {
>  		new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6325,10 +6346,7 @@ static void zone_set_pageset_high_and_batch(struct zone *zone)
>  		return;
>  	}
>  
> -	for_each_possible_cpu(cpu) {
> -		p = per_cpu_ptr(zone->pageset, cpu);
> -		pageset_update(&p->pcp, new_high, new_batch);
> -	}
> +	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
>  }
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
> @@ -8723,6 +8741,27 @@ void __meminit zone_pcp_update(struct zone *zone)
>  	mutex_unlock(&pcp_batch_high_lock);
>  }
>  
> +/*
> + * Effectively disable pcplists for the zone by setting the high limit to 0
> + * and draining all cpus. A concurrent page freeing on another CPU that's about
> + * to put the page on pcplist will either finish before the drain and the page
> + * will be drained, or observe the new high limit and skip the pcplist.
> + *
> + * Must be paired with a call to zone_pcp_enable().
> + */
> +void zone_pcp_disable(struct zone *zone)
> +{
> +	mutex_lock(&pcp_batch_high_lock);
> +	__zone_set_pageset_high_and_batch(zone, 0, 1);
> +	__drain_all_pages(zone, true);
> +}
> +
> +void zone_pcp_enable(struct zone *zone)
> +{
> +	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
> +	mutex_unlock(&pcp_batch_high_lock);
> +}
> +
>  void zone_pcp_reset(struct zone *zone)
>  {
>  	unsigned long flags;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index feab446d1982..a254e1f370a3 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -174,9 +174,9 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   * A call to drain_all_pages() after isolation can flush most of them. However
>   * in some cases pages might still end up on pcp lists and that would allow
>   * for their allocation even when they are in fact isolated already. Depending
> - * on how strong of a guarantee the caller needs, further drain_all_pages()
> - * might be needed (e.g. __offline_pages will need to call it after check for
> - * isolated range for a next retry).
> + * on how strong of a guarantee the caller needs, zone_pcp_disable/enable()
> + * might be used to flush and disable pcplist before isolation and enable after
> + * unisolation.
>   *
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs