From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3KsH=DP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 82328C47095
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 11:42:19 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id EFFA620B1F
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 11:42:18 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EFFA620B1F
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id B0F146B0072; Thu,  8 Oct 2020 07:42:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A937D6B0075; Thu,  8 Oct 2020 07:42:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8C29B6B0074; Thu,  8 Oct 2020 07:42:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com [216.40.44.102])
	by kanga.kvack.org (Postfix) with ESMTP id 485346B0070
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 07:42:11 -0400 (EDT)
Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id D29F6180AD804
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 11:42:10 +0000 (UTC)
X-FDA: 77348569620.20.hat53_4c17c97271d7
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin20.hostedemail.com (Postfix) with ESMTP id B5F0E180C07A3
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 11:42:10 +0000 (UTC)
X-HE-Tag: hat53_4c17c97271d7
X-Filterd-Recvd-Size: 11291
Received: from mx2.suse.de (mx2.suse.de [195.135.220.15])
	by imf38.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 11:42:10 +0000 (UTC)
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (unknown [195.135.221.27])
	by mx2.suse.de (Postfix) with ESMTP id EC6FDB2A2;
	Thu,  8 Oct 2020 11:42:07 +0000 (UTC)
From: Vlastimil Babka <vbabka@suse.cz>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	Michal Hocko <mhocko@kernel.org>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	David Hildenbrand <david@redhat.com>,
	Oscar Salvador <osalvador@suse.de>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Michal Hocko <mhocko@suse.com>
Subject: [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline
Date: Thu,  8 Oct 2020 13:42:01 +0200
Message-Id: <20201008114201.18824-8-vbabka@suse.cz>
X-Mailer: git-send-email 2.28.0
In-Reply-To: <20201008114201.18824-1-vbabka@suse.cz>
References: <20201008114201.18824-1-vbabka@suse.cz>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Memory offline relies on page isolation can race with process freeing pag=
es to
pcplists in a way that a page from isolated pageblock can end up on pcpli=
st.
This can be worked around by repeated draining of pcplists, as done by co=
mmit
968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory
offline").

David and Michal would prefer that this race was closed in a way that cal=
lers
of page isolation who need stronger guarantees don't need to repeatedly d=
rain.
David suggested disabling pcplists usage completely during page isolation=
,
instead of repeatedly draining them.

To achieve this without adding special cases in alloc/free fastpath, we c=
an use
the same approach as boot pagesets - when pcp->high is 0, any pcplist add=
ition
will be immediately flushed.

The race can thus be closed by setting pcp->high to 0 and draining pcplis=
ts
once, before calling start_isolate_page_range(). The draining will serial=
ize
after processes that already disabled interrupts and read the old value o=
f
pcp->high in free_unref_page_commit(), and processes that have not yet di=
sabled
interrupts, will observe pcp->high =3D=3D 0 when they are rescheduled, an=
d skip
pcplists. This guarantees no stray pages on pcplists in zones where isola=
tion
happens.

This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions t=
hat
page isolation users can call before start_isolate_page_range() and after
unisolating (or offlining) the isolated pages.

Also, drain_all_pages() is optimized to only execute on cpus where pcplis=
ts are
not empty. The check can however race with a free to pcplist that has not=
 yet
increased the pcp->count from 0 to 1. Thus make the drain optionally skip=
 the
racy check and drain on all cpus, and use this option in zone_pcp_disable=
().

As we have to avoid external updates to high and batch while pcplists are
disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release i=
t in
zone_pcp_enable(). This also synchronizes multiple users of
zone_pcp_disable()/enable().

Currently the only user of this functionality is offline_pages().

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h       |  2 ++
 mm/memory_hotplug.c | 28 ++++++++----------
 mm/page_alloc.c     | 69 +++++++++++++++++++++++++++++++++++----------
 mm/page_isolation.c |  6 ++--
 4 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c43ccdddb0f6..2966496680bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -201,6 +201,8 @@ extern int user_min_free_kbytes;
=20
 extern void zone_pcp_update(struct zone *zone);
 extern void zone_pcp_reset(struct zone *zone);
+extern void zone_pcp_disable(struct zone *zone);
+extern void zone_pcp_enable(struct zone *zone);
=20
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
=20
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2e6ad899c55e..4382b585c76c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1510,17 +1510,21 @@ int __ref offline_pages(unsigned long start_pfn, =
unsigned long nr_pages)
 	}
 	node =3D zone_to_nid(zone);
=20
+	/*
+	 * Disable pcplists so that page isolation cannot race with freeing
+	 * in a way that pages from isolated pageblock are left on pcplists.
+	 */
+	zone_pcp_disable(zone);
+
 	/* set above range as isolated */
 	ret =3D start_isolate_page_range(start_pfn, end_pfn,
 				       MIGRATE_MOVABLE,
 				       MEMORY_OFFLINE | REPORT_FAILURE);
 	if (ret) {
 		reason =3D "failure to isolate range";
-		goto failed_removal;
+		goto failed_removal_pcplists_disabled;
 	}
=20
-	drain_all_pages(zone);
-
 	arg.start_pfn =3D start_pfn;
 	arg.nr_pages =3D nr_pages;
 	node_states_check_changes_offline(nr_pages, zone, &arg);
@@ -1570,20 +1574,8 @@ int __ref offline_pages(unsigned long start_pfn, u=
nsigned long nr_pages)
 			goto failed_removal_isolated;
 		}
=20
-		/*
-		 * per-cpu pages are drained after start_isolate_page_range, but
-		 * if there are still pages that are not free, make sure that we
-		 * drain again, because when we isolated range we might have
-		 * raced with another thread that was adding pages to pcp list.
-		 *
-		 * Forward progress should be still guaranteed because
-		 * pages on the pcp list can only belong to MOVABLE_ZONE
-		 * because has_unmovable_pages explicitly checks for
-		 * PageBuddy on freed pages on other zones.
-		 */
 		ret =3D test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
-		if (ret)
-			drain_all_pages(zone);
+
 	} while (ret);
=20
 	/* Mark all sections offline and remove free pages from the buddy. */
@@ -1599,6 +1591,8 @@ int __ref offline_pages(unsigned long start_pfn, un=
signed long nr_pages)
 	zone->nr_isolate_pageblock -=3D nr_pages / pageblock_nr_pages;
 	spin_unlock_irqrestore(&zone->lock, flags);
=20
+	zone_pcp_enable(zone);
+
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
 	zone->present_pages -=3D nr_pages;
@@ -1631,6 +1625,8 @@ int __ref offline_pages(unsigned long start_pfn, un=
signed long nr_pages)
 failed_removal_isolated:
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
+failed_removal_pcplists_disabled:
+	zone_pcp_enable(zone);
 failed_removal:
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
 		 (unsigned long long) start_pfn << PAGE_SHIFT,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f7108fe9a0b..366c516c9062 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3018,14 +3018,7 @@ static void drain_local_pages_wq(struct work_struc=
t *work)
 	preempt_enable();
 }
=20
-/*
- * Spill all the per-cpu pages from all CPUs back into the buddy allocat=
or.
- *
- * When zone parameter is non-NULL, spill just the single zone's pages.
- *
- * Note that this can be extremely slow as the draining happens in a wor=
kqueue.
- */
-void drain_all_pages(struct zone *zone)
+void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 {
 	int cpu;
=20
@@ -3064,7 +3057,13 @@ void drain_all_pages(struct zone *zone)
 		struct zone *z;
 		bool has_pcps =3D false;
=20
-		if (zone) {
+		if (force_all_cpus) {
+			/*
+			 * The pcp.count check is racy, some callers need a
+			 * guarantee that no cpu is missed.
+			 */
+			has_pcps =3D true;
+		} else if (zone) {
 			pcp =3D per_cpu_ptr(zone->pageset, cpu);
 			if (pcp->pcp.count)
 				has_pcps =3D true;
@@ -3097,6 +3096,18 @@ void drain_all_pages(struct zone *zone)
 	mutex_unlock(&pcpu_drain_mutex);
 }
=20
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocat=
or.
+ *
+ * When zone parameter is non-NULL, spill just the single zone's pages.
+ *
+ * Note that this can be extremely slow as the draining happens in a wor=
kqueue.
+ */
+void drain_all_pages(struct zone *zone)
+{
+	__drain_all_pages(zone, false);
+}
+
 #ifdef CONFIG_HIBERNATION
=20
 /*
@@ -6296,6 +6307,18 @@ static void pageset_init(struct per_cpu_pageset *p=
)
 	pcp->batch =3D BOOT_PAGESET_BATCH;
 }
=20
+void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long =
high,
+		unsigned long batch)
+{
+	struct per_cpu_pageset *p;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		p =3D per_cpu_ptr(zone->pageset, cpu);
+		pageset_update(&p->pcp, high, batch);
+	}
+}
+
 /*
  * Calculate and set new high and batch values for all per-cpu pagesets =
of a
  * zone, based on the zone's size and the percpu_pagelist_fraction sysct=
l.
@@ -6303,8 +6326,6 @@ static void pageset_init(struct per_cpu_pageset *p)
 static void zone_set_pageset_high_and_batch(struct zone *zone)
 {
 	unsigned long new_high, new_batch;
-	struct per_cpu_pageset *p;
-	int cpu;
=20
 	if (percpu_pagelist_fraction) {
 		new_high =3D zone_managed_pages(zone) / percpu_pagelist_fraction;
@@ -6325,10 +6346,7 @@ static void zone_set_pageset_high_and_batch(struct=
 zone *zone)
 		return;
 	}
=20
-	for_each_possible_cpu(cpu) {
-		p =3D per_cpu_ptr(zone->pageset, cpu);
-		pageset_update(&p->pcp, new_high, new_batch);
-	}
+	__zone_set_pageset_high_and_batch(zone, new_high, new_batch);
 }
=20
 void __meminit setup_zone_pageset(struct zone *zone)
@@ -8723,6 +8741,27 @@ void __meminit zone_pcp_update(struct zone *zone)
 	mutex_unlock(&pcp_batch_high_lock);
 }
=20
+/*
+ * Effectively disable pcplists for the zone by setting the high limit t=
o 0
+ * and draining all cpus. A concurrent page freeing on another CPU that'=
s about
+ * to put the page on pcplist will either finish before the drain and th=
e page
+ * will be drained, or observe the new high limit and skip the pcplist.
+ *
+ * Must be paired with a call to zone_pcp_enable().
+ */
+void zone_pcp_disable(struct zone *zone)
+{
+	mutex_lock(&pcp_batch_high_lock);
+	__zone_set_pageset_high_and_batch(zone, 0, 1);
+	__drain_all_pages(zone, true);
+}
+
+void zone_pcp_enable(struct zone *zone)
+{
+	__zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pages=
et_batch);
+	mutex_unlock(&pcp_batch_high_lock);
+}
+
 void zone_pcp_reset(struct zone *zone)
 {
 	unsigned long flags;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index feab446d1982..a254e1f370a3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -174,9 +174,9 @@ __first_valid_page(unsigned long pfn, unsigned long n=
r_pages)
  * A call to drain_all_pages() after isolation can flush most of them. H=
owever
  * in some cases pages might still end up on pcp lists and that would al=
low
  * for their allocation even when they are in fact isolated already. Dep=
ending
- * on how strong of a guarantee the caller needs, further drain_all_page=
s()
- * might be needed (e.g. __offline_pages will need to call it after chec=
k for
- * isolated range for a next retry).
+ * on how strong of a guarantee the caller needs, zone_pcp_disable/enabl=
e()
+ * might be used to flush and disable pcplist before isolation and enabl=
e after
+ * unisolation.
  *
  * Return: 0 on success and -EBUSY if any part of range cannot be isolat=
ed.
  */
--=20
2.28.0