From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=uJng=PW=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 10534C433EF
	for <linux-mm@archiver.kernel.org>; Wed,  3 Nov 2021 17:05:28 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 9CFAA61156
	for <linux-mm@archiver.kernel.org>; Wed,  3 Nov 2021 17:05:27 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9CFAA61156
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 1306F6B0072; Wed,  3 Nov 2021 13:05:27 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 092316B0073; Wed,  3 Nov 2021 13:05:27 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DFEA46B0074; Wed,  3 Nov 2021 13:05:26 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0104.hostedemail.com [216.40.44.104])
	by kanga.kvack.org (Postfix) with ESMTP id CA7FD6B0072
	for <linux-mm@kvack.org>; Wed,  3 Nov 2021 13:05:26 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 65CE270019
	for <linux-mm@kvack.org>; Wed,  3 Nov 2021 17:05:26 +0000 (UTC)
X-FDA: 78768245052.02.6DB20A5
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf20.hostedemail.com (Postfix) with ESMTP id 57654D0000A2
	for <linux-mm@kvack.org>; Wed,  3 Nov 2021 17:05:17 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1635959125;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=W6hqjk3k1qOe1hlsoywI2rtd9aMmXRpPZ4qPXicNcNI=;
	b=g/v0l06Sv/GsPgo1K6x+LuE0+wKdBeXuIWgJUwEMNHRgsG1RB9wQQyGAWLITG8/EmTFYz0
	dUHYS8OopFLRTH+6hxsUJNtzPwTVwUbwo1cX6uumb5ehegtY8JX9Uonrj/UfNKy/dRlY3Y
	RwH6uyJ8+JyQIhcqHeqXJ9WUCRY882s=
Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com
 [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-550-ulIwwhf5PBu6PD-Vo4feZg-1; Wed, 03 Nov 2021 13:05:21 -0400
X-MC-Unique: ulIwwhf5PBu6PD-Vo4feZg-1
Received: by mail-wm1-f70.google.com with SMTP id 128-20020a1c0486000000b0030dcd45476aso1365396wme.0
        for <linux-mm@kvack.org>; Wed, 03 Nov 2021 10:05:21 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=W6hqjk3k1qOe1hlsoywI2rtd9aMmXRpPZ4qPXicNcNI=;
        b=djdkZsTzyzumNE2R6dvb7aS0xwhbnLp6W/T1QxfXUh0vAEPL49nx+O9/XfbRkhzD57
         EaFK64HXQCpWMjJspgyctIg5+E7sOSnQqgW24i/w1sG3DUstTbSzZ2qK6nVGCMZzjm/y
         tnQsksofSiKOQE+JMABdwYQZW87DJYqJsDbA7UOXPDTcubk678XeQlGSjEtAxobj+IK4
         eUlvGmGnKiecluLACz4MZV+81ZsXd6ssdLIRyvq5D3omV6ScoLZfiyF3z2tNRuZlyGPE
         0K6GEWRgvAMHpQ0kAWQ1+a4NeygX/i1YJTo6Wqvusnz8I6FCLHunWjus7jFTDCAz0eN6
         83ig==
X-Gm-Message-State: AOAM533wW39GoDNZvmuzr4g/BuaznXUeDQvMYaXp1sb+orIgf+saJ+HI
	OxFoXFjc0AGeEsD1t6I1UfoY3QU8h5YXCe25zkaNP1LRO438mmLrg2gOQDbpnnicqegMPokAfAj
	gXPAYQBrkUeM=
X-Received: by 2002:adf:fc88:: with SMTP id g8mr28007980wrr.334.1635959120479;
        Wed, 03 Nov 2021 10:05:20 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJx/GN+uyyTaWBah9VkZ3ed2+0wkDjTtHrEFLj3fSAJzY+sm3o2ftXmvEq8M9d48axCCrj02dw==
X-Received: by 2002:adf:fc88:: with SMTP id g8mr28007937wrr.334.1635959120235;
        Wed, 03 Nov 2021 10:05:20 -0700 (PDT)
Received: from vian.redhat.com ([2a0c:5a80:3c10:3400:3c70:6643:6e71:7eae])
        by smtp.gmail.com with ESMTPSA id h22sm2900610wmq.14.2021.11.03.10.05.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 03 Nov 2021 10:05:19 -0700 (PDT)
From: Nicolas Saenz Julienne <nsaenzju@redhat.com>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	frederic@kernel.org,
	tglx@linutronix.de,
	peterz@infradead.org,
	mtosatti@redhat.com,
	nilal@redhat.com,
	mgorman@suse.de,
	linux-rt-users@vger.kernel.org,
	vbabka@suse.cz,
	cl@linux.com,
	ppandit@redhat.com,
	Nicolas Saenz Julienne <nsaenzju@redhat.com>
Subject: [PATCH v2 2/3] mm/page_alloc: Convert per-cpu lists' local locks to per-cpu spin locks
Date: Wed,  3 Nov 2021 18:05:11 +0100
Message-Id: <20211103170512.2745765-3-nsaenzju@redhat.com>
X-Mailer: git-send-email 2.33.1
In-Reply-To: <20211103170512.2745765-1-nsaenzju@redhat.com>
References: <20211103170512.2745765-1-nsaenzju@redhat.com>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="US-ASCII"
X-Rspamd-Server: rspam03
X-Rspamd-Queue-Id: 57654D0000A2
X-Stat-Signature: bwzec7xokxbysbahmkz7bjwqtm6t9iap
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="g/v0l06S";
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf20.hostedemail.com: domain of nsaenzju@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=nsaenzju@redhat.com
X-HE-Tag: 1635959117-103306
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

page_alloc's per-cpu page lists are currently protected by local locks.
This forces any remote operation dependent on draining them to schedule
drain work on all CPUs. This doesn't play well with NOHZ_FULL CPUs,
which can't be bothered to run housekeeping tasks.

As a first step to mitigate this, convert the current locking scheme to
per-cpu spinlocks. The conversion also moves the actual lock into
'struct per_cpu_pages' which is nicer code, but also essential in order
to couple access to the lock and lists. One side effect of this is a
more complex free_unref_page_list(), as the patch tries to maintain
previous function optimizations[1]. Other than that the conversion
itself is mostly trivial.

The performance difference between local locks and uncontended per-cpu
spinlocks (which they happen to be during normal operation) is pretty
small.

On an Intel Xeon E5-2640 (x86_64) with with 32GB of memory (mean
variation vs. vanilla runs, higher is worse):
   - netperf: -0.5% to 0.5% (no difference)
   - hackbench: -0.3% to 0.7% (almost no difference)
   - mmtests/sparsetruncate-tiny: -0.1% to 0.6%

On a Cavium ThunderX2 (arm64) with 64GB of memory:
   - netperf 1.0% to 1.7%
   - hackbench 0.8% to 1.5%
   - mmtests/sparsetruncate-tiny 1.6% to 2.1%

arm64 is a bit more sensitive to the change. Probably due to the effect
of the spinlock's memory barriers.

Note that the aim9 test suite was also run (through
mmtests/pagealloc-performance) but the test's own variance distorts the
results too much.

[1] See:
      - 9cca35d42eb61 ("mm, page_alloc: enable/disable IRQs once when
	freeing a list of pages ")
      - c24ad77d962c3 ("mm/page_alloc.c: avoid excessive IRQ disabled
	times in free_unref_page_list()")

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 include/linux/mmzone.h |  1 +
 mm/page_alloc.c        | 87 ++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 58e744b78c2c..83c51036c756 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -376,6 +376,7 @@ struct per_cpu_pages {
=20
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
+	spinlock_t lock;
 };
=20
 struct per_cpu_zonestat {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ef03dfb8f95..b332d5cc40f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -122,13 +122,6 @@ typedef int __bitwise fpi_t;
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
=20
-struct pagesets {
-	local_lock_t lock;
-};
-static DEFINE_PER_CPU(struct pagesets, pagesets) =3D {
-	.lock =3D INIT_LOCAL_LOCK(lock),
-};
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
@@ -1505,8 +1498,8 @@ static void free_pcppages_bulk(struct zone *zone, i=
nt count,
 	pcp->count -=3D nr_freed;
=20
 	/*
-	 * local_lock_irq held so equivalent to spin_lock_irqsave for
-	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 * spin_lock_irqsave(&pcp->lock) held so equivalent to
+	 * spin_lock_irqsave().
 	 */
 	spin_lock(&zone->lock);
 	isolated_pageblocks =3D has_isolate_pageblock(zone);
@@ -3011,8 +3004,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned=
 int order,
 	int i, allocated =3D 0;
=20
 	/*
-	 * local_lock_irq held so equivalent to spin_lock_irqsave for
-	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
+	 * spin_lock_irqsave(&pcp->lock) held so equivalent to
+	 * spin_lock_irqsave().
 	 */
 	spin_lock(&zone->lock);
 	for (i =3D 0; i < count; ++i) {
@@ -3066,12 +3059,12 @@ void drain_zone_pages(struct zone *zone, struct p=
er_cpu_pages *pcp)
 	unsigned long flags;
 	int to_drain, batch;
=20
-	local_lock_irqsave(&pagesets.lock, flags);
+	spin_lock_irqsave(&pcp->lock, flags);
 	batch =3D READ_ONCE(pcp->batch);
 	to_drain =3D min(pcp->count, batch);
 	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 }
 #endif
=20
@@ -3087,13 +3080,11 @@ static void drain_pages_zone(unsigned int cpu, st=
ruct zone *zone)
 	unsigned long flags;
 	struct per_cpu_pages *pcp;
=20
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu);
+	spin_lock_irqsave(&pcp->lock, flags);
 	if (pcp->count)
 		free_pcppages_bulk(zone, pcp->count, pcp);
-
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 }
=20
 /*
@@ -3355,16 +3346,14 @@ static int nr_pcp_high(struct per_cpu_pages *pcp,=
 struct zone *zone)
 	return min(READ_ONCE(pcp->batch) << 2, high);
 }
=20
-static void free_unref_page_commit(struct page *page, int migratetype,
-				   unsigned int order)
+static void free_unref_page_commit(struct page *page, struct per_cpu_pag=
es *pcp,
+				   int migratetype, unsigned int order)
 {
 	struct zone *zone =3D page_zone(page);
-	struct per_cpu_pages *pcp;
 	int high;
 	int pindex;
=20
 	__count_vm_event(PGFREE);
-	pcp =3D this_cpu_ptr(zone->per_cpu_pageset);
 	pindex =3D order_to_pindex(migratetype, order);
 	list_add(&page->lru, &pcp->lists[pindex]);
 	pcp->count +=3D 1 << order;
@@ -3383,6 +3372,7 @@ void free_unref_page(struct page *page, unsigned in=
t order)
 {
 	unsigned long flags;
 	unsigned long pfn =3D page_to_pfn(page);
+	struct per_cpu_pages *pcp;
 	int migratetype;
=20
 	if (!free_unref_page_prepare(page, pfn, order))
@@ -3404,9 +3394,10 @@ void free_unref_page(struct page *page, unsigned i=
nt order)
 		migratetype =3D MIGRATE_MOVABLE;
 	}
=20
-	local_lock_irqsave(&pagesets.lock, flags);
-	free_unref_page_commit(page, migratetype, order);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	pcp =3D this_cpu_ptr(page_zone(page)->per_cpu_pageset);
+	spin_lock_irqsave(&pcp->lock, flags);
+	free_unref_page_commit(page, pcp, migratetype, order);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 }
=20
 /*
@@ -3415,6 +3406,7 @@ void free_unref_page(struct page *page, unsigned in=
t order)
 void free_unref_page_list(struct list_head *list)
 {
 	struct page *page, *next;
+	spinlock_t *lock =3D NULL;
 	unsigned long flags;
 	int batch_count =3D 0;
 	int migratetype;
@@ -3422,6 +3414,7 @@ void free_unref_page_list(struct list_head *list)
 	/* Prepare pages for freeing */
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn =3D page_to_pfn(page);
+
 		if (!free_unref_page_prepare(page, pfn, 0)) {
 			list_del(&page->lru);
 			continue;
@@ -3439,8 +3432,22 @@ void free_unref_page_list(struct list_head *list)
 		}
 	}
=20
-	local_lock_irqsave(&pagesets.lock, flags);
 	list_for_each_entry_safe(page, next, list, lru) {
+		struct per_cpu_pages *pcp =3D this_cpu_ptr(page_zone(page)->per_cpu_pa=
geset);
+
+		/*
+		 * As an optimization, release the previously held lock only if
+		 * the page belongs to a different zone. But also, guard
+		 * against excessive IRQ disabled times when we get a large
+		 * list of pages to free.
+		 */
+		if (++batch_count =3D=3D SWAP_CLUSTER_MAX ||
+		    (lock !=3D &pcp->lock && lock)) {
+			spin_unlock_irqrestore(lock, flags);
+			batch_count =3D 0;
+			lock =3D NULL;
+		}
+
 		/*
 		 * Non-isolated types over MIGRATE_PCPTYPES get added
 		 * to the MIGRATE_MOVABLE pcp list.
@@ -3450,19 +3457,17 @@ void free_unref_page_list(struct list_head *list)
 			migratetype =3D MIGRATE_MOVABLE;
=20
 		trace_mm_page_free_batched(page);
-		free_unref_page_commit(page, migratetype, 0);
=20
-		/*
-		 * Guard against excessive IRQ disabled times when we get
-		 * a large list of pages to free.
-		 */
-		if (++batch_count =3D=3D SWAP_CLUSTER_MAX) {
-			local_unlock_irqrestore(&pagesets.lock, flags);
-			batch_count =3D 0;
-			local_lock_irqsave(&pagesets.lock, flags);
+		if (!lock) {
+			spin_lock_irqsave(&pcp->lock, flags);
+			lock =3D &pcp->lock;
 		}
+
+		free_unref_page_commit(page, pcp, migratetype, 0);
 	}
-	local_unlock_irqrestore(&pagesets.lock, flags);
+
+	if (lock)
+		spin_unlock_irqrestore(lock, flags);
 }
=20
 /*
@@ -3636,18 +3641,17 @@ static struct page *rmqueue_pcplist(struct zone *=
preferred_zone,
 	struct page *page;
 	unsigned long flags;
=20
-	local_lock_irqsave(&pagesets.lock, flags);
-
 	/*
 	 * On allocation, reduce the number of pages that are batch freed.
 	 * See nr_pcp_free() where free_factor is increased for subsequent
 	 * frees.
 	 */
 	pcp =3D this_cpu_ptr(zone->per_cpu_pageset);
+	spin_lock_irqsave(&pcp->lock, flags);
 	pcp->free_factor >>=3D 1;
 	list =3D &pcp->lists[order_to_pindex(migratetype, order)];
 	page =3D __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, =
list);
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
 		zone_statistics(preferred_zone, zone, 1);
@@ -5265,8 +5269,8 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int pre=
ferred_nid,
 		goto failed;
=20
 	/* Attempt the batch allocation */
-	local_lock_irqsave(&pagesets.lock, flags);
 	pcp =3D this_cpu_ptr(zone->per_cpu_pageset);
+	spin_lock_irqsave(&pcp->lock, flags);
 	pcp_list =3D &pcp->lists[order_to_pindex(ac.migratetype, 0)];
=20
 	while (nr_populated < nr_pages) {
@@ -5295,7 +5299,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int pre=
ferred_nid,
 		nr_populated++;
 	}
=20
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
=20
 	__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 	zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
@@ -5304,7 +5308,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int pre=
ferred_nid,
 	return nr_populated;
=20
 failed_irq:
-	local_unlock_irqrestore(&pagesets.lock, flags);
+	spin_unlock_irqrestore(&pcp->lock, flags);
=20
 failed:
 	page =3D __alloc_pages(gfp, 0, preferred_nid, nodemask);
@@ -6947,6 +6951,7 @@ void __meminit setup_zone_pageset(struct zone *zone=
)
 		struct per_cpu_zonestat *pzstats;
=20
 		pcp =3D per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		spin_lock_init(&pcp->lock);
 		pzstats =3D per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 		per_cpu_pages_init(pcp, pzstats);
 	}
--=20
2.33.1