From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82328C47095 for ; Thu, 8 Oct 2020 11:42:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EFFA620B1F for ; Thu, 8 Oct 2020 11:42:18 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EFFA620B1F Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B0F146B0072; Thu, 8 Oct 2020 07:42:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A937D6B0075; Thu, 8 Oct 2020 07:42:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C29B6B0074; Thu, 8 Oct 2020 07:42:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0102.hostedemail.com [216.40.44.102]) by kanga.kvack.org (Postfix) with ESMTP id 485346B0070 for ; Thu, 8 Oct 2020 07:42:11 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id D29F6180AD804 for ; Thu, 8 Oct 2020 11:42:10 +0000 (UTC) X-FDA: 77348569620.20.hat53_4c17c97271d7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin20.hostedemail.com (Postfix) with ESMTP id B5F0E180C07A3 for ; Thu, 8 Oct 2020 11:42:10 +0000 (UTC) X-HE-Tag: hat53_4c17c97271d7 X-Filterd-Recvd-Size: 11291 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Thu, 8 Oct 2020 11:42:10 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id EC6FDB2A2; Thu, 8 Oct 2020 11:42:07 +0000 (UTC) From: Vlastimil Babka To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Michal Hocko , Pavel Tatashin , David Hildenbrand , Oscar Salvador , Joonsoo Kim , Vlastimil Babka , Michal Hocko Subject: [PATCH v2 7/7] mm, page_alloc: disable pcplists during memory offline Date: Thu, 8 Oct 2020 13:42:01 +0200 Message-Id: <20201008114201.18824-8-vbabka@suse.cz> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20201008114201.18824-1-vbabka@suse.cz> References: <20201008114201.18824-1-vbabka@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Memory offline relies on page isolation can race with process freeing pag= es to pcplists in a way that a page from isolated pageblock can end up on pcpli= st. This can be worked around by repeated draining of pcplists, as done by co= mmit 968318261221 ("mm/memory_hotplug: drain per-cpu pages again during memory offline"). David and Michal would prefer that this race was closed in a way that cal= lers of page isolation who need stronger guarantees don't need to repeatedly d= rain. David suggested disabling pcplists usage completely during page isolation= , instead of repeatedly draining them. To achieve this without adding special cases in alloc/free fastpath, we c= an use the same approach as boot pagesets - when pcp->high is 0, any pcplist add= ition will be immediately flushed. The race can thus be closed by setting pcp->high to 0 and draining pcplis= ts once, before calling start_isolate_page_range(). The draining will serial= ize after processes that already disabled interrupts and read the old value o= f pcp->high in free_unref_page_commit(), and processes that have not yet di= sabled interrupts, will observe pcp->high =3D=3D 0 when they are rescheduled, an= d skip pcplists. This guarantees no stray pages on pcplists in zones where isola= tion happens. This patch thus adds zone_pcp_disable() and zone_pcp_enable() functions t= hat page isolation users can call before start_isolate_page_range() and after unisolating (or offlining) the isolated pages. Also, drain_all_pages() is optimized to only execute on cpus where pcplis= ts are not empty. The check can however race with a free to pcplist that has not= yet increased the pcp->count from 0 to 1. Thus make the drain optionally skip= the racy check and drain on all cpus, and use this option in zone_pcp_disable= (). As we have to avoid external updates to high and batch while pcplists are disabled, we take pcp_batch_high_lock in zone_pcp_disable() and release i= t in zone_pcp_enable(). This also synchronizes multiple users of zone_pcp_disable()/enable(). Currently the only user of this functionality is offline_pages(). Suggested-by: David Hildenbrand Suggested-by: Michal Hocko Signed-off-by: Vlastimil Babka --- mm/internal.h | 2 ++ mm/memory_hotplug.c | 28 ++++++++---------- mm/page_alloc.c | 69 +++++++++++++++++++++++++++++++++++---------- mm/page_isolation.c | 6 ++-- 4 files changed, 71 insertions(+), 34 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index c43ccdddb0f6..2966496680bc 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -201,6 +201,8 @@ extern int user_min_free_kbytes; =20 extern void zone_pcp_update(struct zone *zone); extern void zone_pcp_reset(struct zone *zone); +extern void zone_pcp_disable(struct zone *zone); +extern void zone_pcp_enable(struct zone *zone); =20 #if defined CONFIG_COMPACTION || defined CONFIG_CMA =20 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2e6ad899c55e..4382b585c76c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1510,17 +1510,21 @@ int __ref offline_pages(unsigned long start_pfn, = unsigned long nr_pages) } node =3D zone_to_nid(zone); =20 + /* + * Disable pcplists so that page isolation cannot race with freeing + * in a way that pages from isolated pageblock are left on pcplists. + */ + zone_pcp_disable(zone); + /* set above range as isolated */ ret =3D start_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE, MEMORY_OFFLINE | REPORT_FAILURE); if (ret) { reason =3D "failure to isolate range"; - goto failed_removal; + goto failed_removal_pcplists_disabled; } =20 - drain_all_pages(zone); - arg.start_pfn =3D start_pfn; arg.nr_pages =3D nr_pages; node_states_check_changes_offline(nr_pages, zone, &arg); @@ -1570,20 +1574,8 @@ int __ref offline_pages(unsigned long start_pfn, u= nsigned long nr_pages) goto failed_removal_isolated; } =20 - /* - * per-cpu pages are drained after start_isolate_page_range, but - * if there are still pages that are not free, make sure that we - * drain again, because when we isolated range we might have - * raced with another thread that was adding pages to pcp list. - * - * Forward progress should be still guaranteed because - * pages on the pcp list can only belong to MOVABLE_ZONE - * because has_unmovable_pages explicitly checks for - * PageBuddy on freed pages on other zones. - */ ret =3D test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); - if (ret) - drain_all_pages(zone); + } while (ret); =20 /* Mark all sections offline and remove free pages from the buddy. */ @@ -1599,6 +1591,8 @@ int __ref offline_pages(unsigned long start_pfn, un= signed long nr_pages) zone->nr_isolate_pageblock -=3D nr_pages / pageblock_nr_pages; spin_unlock_irqrestore(&zone->lock, flags); =20 + zone_pcp_enable(zone); + /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); zone->present_pages -=3D nr_pages; @@ -1631,6 +1625,8 @@ int __ref offline_pages(unsigned long start_pfn, un= signed long nr_pages) failed_removal_isolated: undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); memory_notify(MEM_CANCEL_OFFLINE, &arg); +failed_removal_pcplists_disabled: + zone_pcp_enable(zone); failed_removal: pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n", (unsigned long long) start_pfn << PAGE_SHIFT, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1f7108fe9a0b..366c516c9062 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3018,14 +3018,7 @@ static void drain_local_pages_wq(struct work_struc= t *work) preempt_enable(); } =20 -/* - * Spill all the per-cpu pages from all CPUs back into the buddy allocat= or. - * - * When zone parameter is non-NULL, spill just the single zone's pages. - * - * Note that this can be extremely slow as the draining happens in a wor= kqueue. - */ -void drain_all_pages(struct zone *zone) +void __drain_all_pages(struct zone *zone, bool force_all_cpus) { int cpu; =20 @@ -3064,7 +3057,13 @@ void drain_all_pages(struct zone *zone) struct zone *z; bool has_pcps =3D false; =20 - if (zone) { + if (force_all_cpus) { + /* + * The pcp.count check is racy, some callers need a + * guarantee that no cpu is missed. + */ + has_pcps =3D true; + } else if (zone) { pcp =3D per_cpu_ptr(zone->pageset, cpu); if (pcp->pcp.count) has_pcps =3D true; @@ -3097,6 +3096,18 @@ void drain_all_pages(struct zone *zone) mutex_unlock(&pcpu_drain_mutex); } =20 +/* + * Spill all the per-cpu pages from all CPUs back into the buddy allocat= or. + * + * When zone parameter is non-NULL, spill just the single zone's pages. + * + * Note that this can be extremely slow as the draining happens in a wor= kqueue. + */ +void drain_all_pages(struct zone *zone) +{ + __drain_all_pages(zone, false); +} + #ifdef CONFIG_HIBERNATION =20 /* @@ -6296,6 +6307,18 @@ static void pageset_init(struct per_cpu_pageset *p= ) pcp->batch =3D BOOT_PAGESET_BATCH; } =20 +void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long = high, + unsigned long batch) +{ + struct per_cpu_pageset *p; + int cpu; + + for_each_possible_cpu(cpu) { + p =3D per_cpu_ptr(zone->pageset, cpu); + pageset_update(&p->pcp, high, batch); + } +} + /* * Calculate and set new high and batch values for all per-cpu pagesets = of a * zone, based on the zone's size and the percpu_pagelist_fraction sysct= l. @@ -6303,8 +6326,6 @@ static void pageset_init(struct per_cpu_pageset *p) static void zone_set_pageset_high_and_batch(struct zone *zone) { unsigned long new_high, new_batch; - struct per_cpu_pageset *p; - int cpu; =20 if (percpu_pagelist_fraction) { new_high =3D zone_managed_pages(zone) / percpu_pagelist_fraction; @@ -6325,10 +6346,7 @@ static void zone_set_pageset_high_and_batch(struct= zone *zone) return; } =20 - for_each_possible_cpu(cpu) { - p =3D per_cpu_ptr(zone->pageset, cpu); - pageset_update(&p->pcp, new_high, new_batch); - } + __zone_set_pageset_high_and_batch(zone, new_high, new_batch); } =20 void __meminit setup_zone_pageset(struct zone *zone) @@ -8723,6 +8741,27 @@ void __meminit zone_pcp_update(struct zone *zone) mutex_unlock(&pcp_batch_high_lock); } =20 +/* + * Effectively disable pcplists for the zone by setting the high limit t= o 0 + * and draining all cpus. A concurrent page freeing on another CPU that'= s about + * to put the page on pcplist will either finish before the drain and th= e page + * will be drained, or observe the new high limit and skip the pcplist. + * + * Must be paired with a call to zone_pcp_enable(). + */ +void zone_pcp_disable(struct zone *zone) +{ + mutex_lock(&pcp_batch_high_lock); + __zone_set_pageset_high_and_batch(zone, 0, 1); + __drain_all_pages(zone, true); +} + +void zone_pcp_enable(struct zone *zone) +{ + __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pages= et_batch); + mutex_unlock(&pcp_batch_high_lock); +} + void zone_pcp_reset(struct zone *zone) { unsigned long flags; diff --git a/mm/page_isolation.c b/mm/page_isolation.c index feab446d1982..a254e1f370a3 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -174,9 +174,9 @@ __first_valid_page(unsigned long pfn, unsigned long n= r_pages) * A call to drain_all_pages() after isolation can flush most of them. H= owever * in some cases pages might still end up on pcp lists and that would al= low * for their allocation even when they are in fact isolated already. Dep= ending - * on how strong of a guarantee the caller needs, further drain_all_page= s() - * might be needed (e.g. __offline_pages will need to call it after chec= k for - * isolated range for a next retry). + * on how strong of a guarantee the caller needs, zone_pcp_disable/enabl= e() + * might be used to flush and disable pcplist before isolation and enabl= e after + * unisolation. * * Return: 0 on success and -EBUSY if any part of range cannot be isolat= ed. */ --=20 2.28.0