linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support
@ 2021-10-08 16:19 Nicolas Saenz Julienne
  2021-10-08 16:19 ` [RFC 1/3] mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments Nicolas Saenz Julienne
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Nicolas Saenz Julienne @ 2021-10-08 16:19 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, paulmck, ppandit,
	Nicolas Saenz Julienne

This series replaces mm/page_alloc's per-cpu lists drain mechanism in order for
it to be able to be run remotely. Currently, only a local CPU is permitted to
change its per-cpu lists, and it's expected to do so, on-demand, whenever a
process demands it (by means of queueing a drain task on the local CPU). Most
systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
that can't take any sort of interruption without breaking their functional
guarantees (latency, bandwidth, etc...). Having a way for these processes to
remotely drain the lists themselves will make co-existing with isolated CPUs
possible, and comes with minimal performance[1]/memory cost to other users.

The new algorithm will atomically switch the pointer to the per-cpu lists and
use RCU to make sure it's not being used before draining them. 

I'm interested in an sort of feedback, but especially validating that the
approach is acceptable, and any tests/benchmarks you'd like to see run against
it. For now, I've been testing this successfully on both arm64 and x86_64
systems while forcing high memory pressure (i.e. forcing the
page_alloc's slow path).

Patches 1-2 serve as cleanups/preparation to make patch 3 easier to follow.

Here's my previous attempt at fixing this:
https://lkml.org/lkml/2021/9/21/599

[1] Proper performance numbers will be provided if the approach is deemed
    acceptable. That said, mm/page_alloc.c's fast paths only grow by an extra
    pointer indirection and a compiler barrier, which I think is unlikely to be
    measurable.

---

Nicolas Saenz Julienne (3):
  mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments
  mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
  mm/page_alloc: Add remote draining support to per-cpu lists

 include/linux/mmzone.h |  24 +++++-
 mm/page_alloc.c        | 173 +++++++++++++++++++++--------------------
 mm/vmstat.c            |   6 +-
 3 files changed, 114 insertions(+), 89 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC 1/3] mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments
  2021-10-08 16:19 [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Nicolas Saenz Julienne
@ 2021-10-08 16:19 ` Nicolas Saenz Julienne
  2021-10-08 16:19 ` [RFC 2/3] mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly Nicolas Saenz Julienne
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Nicolas Saenz Julienne @ 2021-10-08 16:19 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, paulmck, ppandit,
	Nicolas Saenz Julienne

Both users of __rmqueue_pcplist() use the same means to extract the
right list from their per-cpu lists: calculate the index based on the
page's migratetype and order. This data is already being passed to
__rmqueue_pcplist(), so centralize the list extraction process inside
the function.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 mm/page_alloc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..dd89933503b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3600,11 +3600,13 @@ static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			int migratetype,
 			unsigned int alloc_flags,
-			struct per_cpu_pages *pcp,
-			struct list_head *list)
+			struct per_cpu_pages *pcp)
 {
+	struct list_head *list;
 	struct page *page;
 
+	list = &pcp->lists[order_to_pindex(migratetype, order)];
+
 	do {
 		if (list_empty(list)) {
 			int batch = READ_ONCE(pcp->batch);
@@ -3643,7 +3645,6 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			unsigned int alloc_flags)
 {
 	struct per_cpu_pages *pcp;
-	struct list_head *list;
 	struct page *page;
 	unsigned long flags;
 
@@ -3656,8 +3657,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 */
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
 	pcp->free_factor >>= 1;
-	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -5202,7 +5202,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	struct zone *zone;
 	struct zoneref *z;
 	struct per_cpu_pages *pcp;
-	struct list_head *pcp_list;
 	struct alloc_context ac;
 	gfp_t alloc_gfp;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -5278,7 +5277,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 	/* Attempt the batch allocation */
 	local_lock_irqsave(&pagesets.lock, flags);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
-	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
 
 	while (nr_populated < nr_pages) {
 
@@ -5288,8 +5286,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
 			continue;
 		}
 
-		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
-								pcp, pcp_list);
+		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, pcp);
 		if (unlikely(!page)) {
 			/* Try and get at least one page */
 			if (!nr_populated)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 2/3] mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
  2021-10-08 16:19 [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Nicolas Saenz Julienne
  2021-10-08 16:19 ` [RFC 1/3] mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments Nicolas Saenz Julienne
@ 2021-10-08 16:19 ` Nicolas Saenz Julienne
  2021-10-08 16:19 ` [RFC 3/3] mm/page_alloc: Add remote draining support to per-cpu lists Nicolas Saenz Julienne
  2021-10-12 15:45 ` [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Vlastimil Babka
  3 siblings, 0 replies; 7+ messages in thread
From: Nicolas Saenz Julienne @ 2021-10-08 16:19 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, paulmck, ppandit,
	Nicolas Saenz Julienne

In preparation to adding remote pcplists drain support, let's bundle
'struct per_cpu_pages' list heads and page count into a new structure,
'struct pcplists', and have all code access it indirectly through a
pointer. It'll be used by upcoming patches, which will maintain multiple
versions of pcplists and switch the pointer atomically.

free_pcppages_bulk() also gains a new argument, since we want to avoid
dereferencing the pcplists pointer twice per critical section (delimited
by the pagevec local locks).

'struct pcplists' data is marked as __private, so as to make sure nobody
accesses it directly, except for the initialization code. Note that
'struct per_cpu_pages' is used during boot, when no allocation is
possible.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
---
 include/linux/mmzone.h | 10 +++++--
 mm/page_alloc.c        | 66 +++++++++++++++++++++++++-----------------
 mm/vmstat.c            |  6 ++--
 3 files changed, 49 insertions(+), 33 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..fb023da9a181 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -358,7 +358,6 @@ enum zone_watermarks {
 
 /* Fields and list protected by pagesets local_lock in page_alloc.c */
 struct per_cpu_pages {
-	int count;		/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
 	short free_factor;	/* batch scaling factor during free */
@@ -366,8 +365,13 @@ struct per_cpu_pages {
 	short expire;		/* When 0, remote pagesets are drained */
 #endif
 
-	/* Lists of pages, one per migrate type stored on the pcp-lists */
-	struct list_head lists[NR_PCP_LISTS];
+	struct pcplists *lp;
+	struct pcplists {
+		/* Number of pages in the lists */
+		int count;
+		/* Lists of pages, one per migrate type stored on the pcp-lists */
+		struct list_head lists[NR_PCP_LISTS];
+	} __private pcplists;
 };
 
 struct per_cpu_zonestat {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd89933503b4..842816f269da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1438,7 +1438,8 @@ static inline void prefetch_buddy(struct page *page)
  * pinned" detection logic.
  */
 static void free_pcppages_bulk(struct zone *zone, int count,
-					struct per_cpu_pages *pcp)
+			       struct per_cpu_pages *pcp,
+			       struct pcplists *lp)
 {
 	int pindex = 0;
 	int batch_free = 0;
@@ -1453,7 +1454,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	 * Ensure proper count is passed which otherwise would stuck in the
 	 * below while (list_empty(list)) loop.
 	 */
-	count = min(pcp->count, count);
+	count = min(lp->count, count);
 	while (count > 0) {
 		struct list_head *list;
 
@@ -1468,7 +1469,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			batch_free++;
 			if (++pindex == NR_PCP_LISTS)
 				pindex = 0;
-			list = &pcp->lists[pindex];
+			list = &lp->lists[pindex];
 		} while (list_empty(list));
 
 		/* This is the only non-empty list. Free them all. */
@@ -1508,7 +1509,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			}
 		} while (count > 0 && --batch_free && !list_empty(list));
 	}
-	pcp->count -= nr_freed;
+	lp->count -= nr_freed;
 
 	/*
 	 * local_lock_irq held so equivalent to spin_lock_irqsave for
@@ -3069,14 +3070,16 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
  */
 void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 {
+	struct pcplists *lp;
 	unsigned long flags;
 	int to_drain, batch;
 
 	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
-	to_drain = min(pcp->count, batch);
+	lp = pcp->lp;
+	to_drain = min(lp->count, batch);
 	if (to_drain > 0)
-		free_pcppages_bulk(zone, to_drain, pcp);
+		free_pcppages_bulk(zone, to_drain, pcp, lp);
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
 #endif
@@ -3092,12 +3095,14 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 {
 	unsigned long flags;
 	struct per_cpu_pages *pcp;
+	struct pcplists *lp;
 
 	local_lock_irqsave(&pagesets.lock, flags);
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-	if (pcp->count)
-		free_pcppages_bulk(zone, pcp->count, pcp);
+	lp = pcp->lp;
+	if (lp->count)
+		free_pcppages_bulk(zone, lp->count, pcp, lp);
 
 	local_unlock_irqrestore(&pagesets.lock, flags);
 }
@@ -3158,7 +3163,7 @@ static void drain_local_pages_wq(struct work_struct *work)
  *
  * drain_all_pages() is optimized to only execute on cpus where pcplists are
  * not empty. The check for non-emptiness can however race with a free to
- * pcplist that has not yet increased the pcp->count from 0 to 1. Callers
+ * pcplist that has not yet increased the lp->count from 0 to 1. Callers
  * that need the guarantee that every CPU has drained can disable the
  * optimizing racy check.
  */
@@ -3200,21 +3205,22 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 		struct per_cpu_pages *pcp;
 		struct zone *z;
 		bool has_pcps = false;
+		struct pcplists *lp;
 
 		if (force_all_cpus) {
 			/*
-			 * The pcp.count check is racy, some callers need a
+			 * The lp->count check is racy, some callers need a
 			 * guarantee that no cpu is missed.
 			 */
 			has_pcps = true;
 		} else if (zone) {
-			pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-			if (pcp->count)
+			lp = per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp;
+			if (lp->count)
 				has_pcps = true;
 		} else {
 			for_each_populated_zone(z) {
-				pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
-				if (pcp->count) {
+				lp = per_cpu_ptr(z->per_cpu_pageset, cpu)->lp;
+				if (lp->count) {
 					has_pcps = true;
 					break;
 				}
@@ -3366,19 +3372,21 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn,
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
+	struct pcplists *lp;
 	int high;
 	int pindex;
 
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
+	lp = pcp->lp;
 	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->lru, &pcp->lists[pindex]);
-	pcp->count += 1 << order;
+	list_add(&page->lru, &lp->lists[pindex]);
+	lp->count += 1 << order;
 	high = nr_pcp_high(pcp, zone);
-	if (pcp->count >= high) {
+	if (lp->count >= high) {
 		int batch = READ_ONCE(pcp->batch);
 
-		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
+		free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp, lp);
 	}
 }
 
@@ -3603,9 +3611,11 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			struct per_cpu_pages *pcp)
 {
 	struct list_head *list;
+	struct pcplists *lp;
 	struct page *page;
 
-	list = &pcp->lists[order_to_pindex(migratetype, order)];
+	lp = pcp->lp;
+	list = &lp->lists[order_to_pindex(migratetype, order)];
 
 	do {
 		if (list_empty(list)) {
@@ -3625,14 +3635,14 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 					batch, list,
 					migratetype, alloc_flags);
 
-			pcp->count += alloced << order;
+			lp->count += alloced << order;
 			if (unlikely(list_empty(list)))
 				return NULL;
 		}
 
 		page = list_first_entry(list, struct page, lru);
 		list_del(&page->lru);
-		pcp->count -= 1 << order;
+		lp->count -= 1 << order;
 	} while (check_new_pcp(page));
 
 	return page;
@@ -5877,7 +5887,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			continue;
 
 		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;
 	}
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -5971,7 +5981,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 		free_pcp = 0;
 		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;
 
 		show_node(zone);
 		printk(KERN_CONT
@@ -6012,7 +6022,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(free_pcp),
-			K(this_cpu_read(zone->per_cpu_pageset->count)),
+			K(this_cpu_read(zone->per_cpu_pageset)->lp->count),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -6848,7 +6858,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
 
 /*
  * pcp->high and pcp->batch values are related and generally batch is lower
- * than high. They are also related to pcp->count such that count is lower
+ * than high. They are also related to pcp->lp->count such that count is lower
  * than high, and as soon as it reaches high, the pcplist is flushed.
  *
  * However, guaranteeing these relations at all times would require e.g. write
@@ -6856,7 +6866,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
  * thus be prone to error and bad for performance. Thus the update only prevents
  * store tearing. Any new users of pcp->batch and pcp->high should ensure they
  * can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * pcp->lp->count field on the local CPU with interrupts disabled.
  *
  * mutex_is_locked(&pcp_batch_high_lock) required when calling this function
  * outside of boot time (or some other assurance that no concurrent updaters
@@ -6876,8 +6886,10 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	memset(pcp, 0, sizeof(*pcp));
 	memset(pzstats, 0, sizeof(*pzstats));
 
+	pcp->lp = &ACCESS_PRIVATE(pcp, pcplists);
+
 	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
-		INIT_LIST_HEAD(&pcp->lists[pindex]);
+		INIT_LIST_HEAD(&pcp->lp->lists[pindex]);
 
 	/*
 	 * Set batch and high values safe for a boot pageset. A true percpu
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ce2620344b2..5279d3f34e0b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -856,7 +856,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			 * if not then there is nothing to expire.
 			 */
 			if (!__this_cpu_read(pcp->expire) ||
-			       !__this_cpu_read(pcp->count))
+			       !this_cpu_ptr(pcp)->lp->count)
 				continue;
 
 			/*
@@ -870,7 +870,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			if (__this_cpu_dec_return(pcp->expire))
 				continue;
 
-			if (__this_cpu_read(pcp->count)) {
+			if (this_cpu_ptr(pcp)->lp->count) {
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
 				changes++;
 			}
@@ -1707,7 +1707,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 			   "\n              high:  %i"
 			   "\n              batch: %i",
 			   i,
-			   pcp->count,
+			   pcp->lp->count,
 			   pcp->high,
 			   pcp->batch);
 #ifdef CONFIG_SMP
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC 3/3] mm/page_alloc: Add remote draining support to per-cpu lists
  2021-10-08 16:19 [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Nicolas Saenz Julienne
  2021-10-08 16:19 ` [RFC 1/3] mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments Nicolas Saenz Julienne
  2021-10-08 16:19 ` [RFC 2/3] mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly Nicolas Saenz Julienne
@ 2021-10-08 16:19 ` Nicolas Saenz Julienne
  2021-10-12 15:45 ` [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Vlastimil Babka
  3 siblings, 0 replies; 7+ messages in thread
From: Nicolas Saenz Julienne @ 2021-10-08 16:19 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, vbabka, cl, paulmck, ppandit,
	Nicolas Saenz Julienne

page_alloc.c's per-cpu page lists are currently protected using local
locks. While performance savvy, this doesn't allow for remote access to
these structures. CPUs requiring system-wide per-cpu list drains get
around this by scheduling drain work on all CPUs. That said, some select
setups like systems with NOHZ_FULL CPUs, aren't well suited to this, as
they can't handle interruptions of any sort.

To mitigate this, replace the current draining mechanism with one that
allows remotely draining the lists in a lock-less manner. It leverages
the fact that the per-cpu page lists are accessed through indirection,
and that the pointer can be updated atomically. Upon draining we now:

 - Atomically switch the per-cpu lists pointers to ones pointing to
   empty lists.

 - Wait for a grace period so as for all concurrent writers holding the
   old per-cpu lists pointer finish updating them[1].

 - Remotely flush the old lists now that we know nobody holds a
   reference to them. Concurrent access to the drain process is
   protected by a mutex.

RCU guarantees atomicity both while dereferencing the per-cpu lists
pointer and replacing it. It also checks for RCU critical
section/locking correctness, as all writers have to hold their per-cpu
pagesets local lock. Memory ordering on both pointers' data is
guaranteed by synchronize_rcu() and the 'pcpu_drain_mutex'. Also,
synchronize_rcu_expedited() is used to minimize hangs during low memory
situations.

Accesses to the pcplists like the ones in mm/vmstat.c don't require RCU
supervision since they can handle outdated data, but they do use
READ_ONCE() in order to avoid compiler weirdness and be explicit about
the concurrent nature of the pcplists pointer.

As a side effect to all this we now have to promote the spin_lock() in
free_pcppages_bulk() to spin_lock_irqsave() since not all function users
enter with interrupts disabled.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>

[1] Note that whatever concurrent writers were doing, the result was
    going to be flushed anyway as the old mechanism disabled preemption
    as the means for serialization, so per-cpu drain works were already
    stepping over whatever was being processed concurrently to the drain
    call.
---
 include/linux/mmzone.h |  18 ++++++-
 mm/page_alloc.c        | 114 ++++++++++++++++++++---------------------
 mm/vmstat.c            |   6 +--
 3 files changed, 75 insertions(+), 63 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb023da9a181..c112e7831c54 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -365,13 +365,27 @@ struct per_cpu_pages {
 	short expire;		/* When 0, remote pagesets are drained */
 #endif
 
-	struct pcplists *lp;
+	/*
+	 * Having two pcplists allows us to remotely flush them in a lock-less
+	 * manner: we atomically switch the 'lp' and 'drain' pointers, wait a
+	 * grace period to synchronize against concurrent users of 'lp', and
+	 * safely free whatever is left in 'drain'.
+	 *
+	 * All accesses to 'lp' are protected by local locks, which also serve
+	 * as RCU critical section delimiters. 'lp' should only be dereferenced
+	 * *once* per critical section.
+	 *
+	 * See mm/page_alloc.c's __drain_all_pages() for the bulk of the remote
+	 * drain implementation.
+	 */
+	struct pcplists __rcu *lp;
+	struct pcplists *drain;
 	struct pcplists {
 		/* Number of pages in the lists */
 		int count;
 		/* Lists of pages, one per migrate type stored on the pcp-lists */
 		struct list_head lists[NR_PCP_LISTS];
-	} __private pcplists;
+	} __private pcplists[2];
 };
 
 struct per_cpu_zonestat {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 842816f269da..d56d06dde66a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -147,13 +147,7 @@ DEFINE_PER_CPU(int, _numa_mem_);		/* Kernel "local memory" node */
 EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 #endif
 
-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
-	struct zone *zone;
-	struct work_struct work;
-};
 static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
 
 #ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
 volatile unsigned long latent_entropy __latent_entropy;
@@ -1448,6 +1442,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int prefetch_nr = READ_ONCE(pcp->batch);
 	bool isolated_pageblocks;
 	struct page *page, *tmp;
+	unsigned long flags;
 	LIST_HEAD(head);
 
 	/*
@@ -1511,11 +1506,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	}
 	lp->count -= nr_freed;
 
-	/*
-	 * local_lock_irq held so equivalent to spin_lock_irqsave for
-	 * both PREEMPT_RT and non-PREEMPT_RT configurations.
-	 */
-	spin_lock(&zone->lock);
+	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
 
 	/*
@@ -1538,7 +1529,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		__free_one_page(page, page_to_pfn(page), zone, order, mt, FPI_NONE);
 		trace_mm_page_pcpu_drain(page, order, mt);
 	}
-	spin_unlock(&zone->lock);
+	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
 static void free_one_page(struct zone *zone,
@@ -3076,7 +3067,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 
 	local_lock_irqsave(&pagesets.lock, flags);
 	batch = READ_ONCE(pcp->batch);
-	lp = pcp->lp;
+	lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
 	to_drain = min(lp->count, batch);
 	if (to_drain > 0)
 		free_pcppages_bulk(zone, to_drain, pcp, lp);
@@ -3100,7 +3091,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 	local_lock_irqsave(&pagesets.lock, flags);
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
-	lp = pcp->lp;
+	lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
 	if (lp->count)
 		free_pcppages_bulk(zone, lp->count, pcp, lp);
 
@@ -3139,24 +3130,6 @@ void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
-{
-	struct pcpu_drain *drain;
-
-	drain = container_of(work, struct pcpu_drain, work);
-
-	/*
-	 * drain_all_pages doesn't use proper cpu hotplug protection so
-	 * we can race with cpu offline when the WQ can move this from
-	 * a cpu pinned worker to an unbound one. We can operate on a different
-	 * cpu which is alright but we also have to make sure to not move to
-	 * a different one.
-	 */
-	preempt_disable();
-	drain_local_pages(drain->zone);
-	preempt_enable();
-}
-
 /*
  * The implementation of drain_all_pages(), exposing an extra parameter to
  * drain on all cpus.
@@ -3169,6 +3142,8 @@ static void drain_local_pages_wq(struct work_struct *work)
  */
 static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 {
+	struct per_cpu_pages *pcp;
+	struct zone *z;
 	int cpu;
 
 	/*
@@ -3177,13 +3152,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3202,8 +3170,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 * disables preemption as part of its processing
 	 */
 	for_each_online_cpu(cpu) {
-		struct per_cpu_pages *pcp;
-		struct zone *z;
 		bool has_pcps = false;
 		struct pcplists *lp;
 
@@ -3214,12 +3180,12 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 			 */
 			has_pcps = true;
 		} else if (zone) {
-			lp = per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp;
+			lp = READ_ONCE(per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp);
 			if (lp->count)
 				has_pcps = true;
 		} else {
 			for_each_populated_zone(z) {
-				lp = per_cpu_ptr(z->per_cpu_pageset, cpu)->lp;
+				lp = READ_ONCE(per_cpu_ptr(z->per_cpu_pageset, cpu)->lp);
 				if (lp->count) {
 					has_pcps = true;
 					break;
@@ -3233,16 +3199,37 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 			cpumask_clear_cpu(cpu, &cpus_with_pcps);
 	}
 
+	if (!force_all_cpus && cpumask_empty(&cpus_with_pcps))
+	       goto exit;
+
+	for_each_cpu(cpu, &cpus_with_pcps) {
+	       for_each_populated_zone(z) {
+		       if (zone && zone != z)
+			       continue;
+
+		       pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
+		       pcp->drain = rcu_replace_pointer(pcp->lp, pcp->drain,
+					       mutex_is_locked(&pcpu_drain_mutex));
+	       }
+	}
+
+	synchronize_rcu_expedited();
+
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
+		for_each_populated_zone(z) {
+			int count;
+
+			pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
+			count = pcp->drain->count;
+			if (!count)
+			       continue;
 
-		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+			free_pcppages_bulk(z, count, pcp, pcp->drain);
+			VM_BUG_ON(pcp->drain->count);
+		}
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
+exit:
 	mutex_unlock(&pcpu_drain_mutex);
 }
 
@@ -3378,7 +3365,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn,
 
 	__count_vm_event(PGFREE);
 	pcp = this_cpu_ptr(zone->per_cpu_pageset);
-	lp = pcp->lp;
+	lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
 	pindex = order_to_pindex(migratetype, order);
 	list_add(&page->lru, &lp->lists[pindex]);
 	lp->count += 1 << order;
@@ -3614,7 +3601,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 	struct pcplists *lp;
 	struct page *page;
 
-	lp = pcp->lp;
+	lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
 	list = &lp->lists[order_to_pindex(migratetype, order)];
 
 	do {
@@ -5886,8 +5873,12 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
 			continue;
 
-		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;
+		for_each_online_cpu(cpu) {
+			struct pcplists *lp;
+
+			lp = READ_ONCE(per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp);
+			free_pcp += lp->count;
+		}
 	}
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -5980,8 +5971,12 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			continue;
 
 		free_pcp = 0;
-		for_each_online_cpu(cpu)
-			free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;
+		for_each_online_cpu(cpu) {
+			struct pcplists *lp;
+
+			lp = READ_ONCE(per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp);
+			free_pcp += lp->count;
+		}
 
 		show_node(zone);
 		printk(KERN_CONT
@@ -6022,7 +6017,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(zone_page_state(zone, NR_MLOCK)),
 			K(zone_page_state(zone, NR_BOUNCE)),
 			K(free_pcp),
-			K(this_cpu_read(zone->per_cpu_pageset)->lp->count),
+			K(READ_ONCE(this_cpu_ptr(zone->per_cpu_pageset)->lp)->count),
 			K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
 		printk("lowmem_reserve[]:");
 		for (i = 0; i < MAX_NR_ZONES; i++)
@@ -6886,10 +6881,13 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	memset(pcp, 0, sizeof(*pcp));
 	memset(pzstats, 0, sizeof(*pzstats));
 
-	pcp->lp = &ACCESS_PRIVATE(pcp, pcplists);
+	pcp->lp = &ACCESS_PRIVATE(pcp, pcplists[0]);
+	pcp->drain = &ACCESS_PRIVATE(pcp, pcplists[1]);
 
-	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
+	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++) {
 		INIT_LIST_HEAD(&pcp->lp->lists[pindex]);
+		INIT_LIST_HEAD(&pcp->drain->lists[pindex]);
+	}
 
 	/*
 	 * Set batch and high values safe for a boot pageset. A true percpu
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5279d3f34e0b..1ffa4fc64a4f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -856,7 +856,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			 * if not then there is nothing to expire.
 			 */
 			if (!__this_cpu_read(pcp->expire) ||
-			       !this_cpu_ptr(pcp)->lp->count)
+			       !READ_ONCE(this_cpu_ptr(pcp)->lp)->count)
 				continue;
 
 			/*
@@ -870,7 +870,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
 			if (__this_cpu_dec_return(pcp->expire))
 				continue;
 
-			if (this_cpu_ptr(pcp)->lp->count) {
+			if (READ_ONCE(this_cpu_ptr(pcp)->lp)->count) {
 				drain_zone_pages(zone, this_cpu_ptr(pcp));
 				changes++;
 			}
@@ -1707,7 +1707,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 			   "\n              high:  %i"
 			   "\n              batch: %i",
 			   i,
-			   pcp->lp->count,
+			   READ_ONCE(pcp->lp)->count,
 			   pcp->high,
 			   pcp->batch);
 #ifdef CONFIG_SMP
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support
  2021-10-08 16:19 [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Nicolas Saenz Julienne
                   ` (2 preceding siblings ...)
  2021-10-08 16:19 ` [RFC 3/3] mm/page_alloc: Add remote draining support to per-cpu lists Nicolas Saenz Julienne
@ 2021-10-12 15:45 ` Vlastimil Babka
  2021-10-13 12:50   ` nsaenzju
  3 siblings, 1 reply; 7+ messages in thread
From: Vlastimil Babka @ 2021-10-12 15:45 UTC (permalink / raw)
  To: Nicolas Saenz Julienne, akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, cl, paulmck, ppandit

On 10/8/21 18:19, Nicolas Saenz Julienne wrote:
> This series replaces mm/page_alloc's per-cpu lists drain mechanism in order for
> it to be able to be run remotely. Currently, only a local CPU is permitted to
> change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> process demands it (by means of queueing a drain task on the local CPU). Most
> systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> that can't take any sort of interruption without breaking their functional
> guarantees (latency, bandwidth, etc...). Having a way for these processes to
> remotely drain the lists themselves will make co-existing with isolated CPUs
> possible, and comes with minimal performance[1]/memory cost to other users.
> 
> The new algorithm will atomically switch the pointer to the per-cpu lists and
> use RCU to make sure it's not being used before draining them. 
> 
> I'm interested in an sort of feedback, but especially validating that the
> approach is acceptable, and any tests/benchmarks you'd like to see run against

So let's consider the added alloc/free fast paths overhead:
- Patch 1 - __alloc_pages_bulk() used to determine pcp_list once, now it's
determined for each allocated page in __rmqueue_pcplist().
- Patch 2 - adds indirection from pcp->$foo to pcp->lp->$foo in each operation
- Patch 3
  - extra irqsave/irqrestore in free_pcppages_bulk (amortized)
  - rcu_dereference_check() in free_unref_page_commit() and __rmqueue_pcplist()

BTW - I'm not sure if the RCU usage is valid here.

The "read side" (normal operations) is using:
rcu_dereference_check(pcp->lp,
		lockdep_is_held(this_cpu_ptr(&pagesets.lock)));

where the lockdep parameter according to the comments for
rcu_dereference_check() means

"indicate to lockdep that foo->bar may only be dereferenced if either
rcu_read_lock() is held, or that the lock required to replace the bar struct
at foo->bar is held."

but you are not taking rcu_read_lock() and the "write side" (remote
draining) actually doesn't take pagesets.lock, so it's not true that the
"lock required to replace ... is held"? The write side uses
rcu_replace_pointer(...,
			mutex_is_locked(&pcpu_drain_mutex))
which is a different lock.

IOW, synchronize_rcu_expedited() AFAICS has nothing (no rcu_read_lock() to
synchronize against? Might accidentally work on !RT thanks to disabled irqs,
but not sure about with RT lock semantics of the local_lock...

So back to overhead, if I'm correct above we can assume that there would be
also rcu_read_lock() in the fast paths.

The alternative proposed by tglx was IIRC that there would be a spinlock on
each cpu, which would be mostly uncontended except when draining. Maybe an
uncontended spin lock/unlock would have lower overhead than all of the
above? It would be certainly simpler, so I would probably try that first and
see if it's acceptable?

> it. For now, I've been testing this successfully on both arm64 and x86_64
> systems while forcing high memory pressure (i.e. forcing the
> page_alloc's slow path).
> 
> Patches 1-2 serve as cleanups/preparation to make patch 3 easier to follow.
> 
> Here's my previous attempt at fixing this:
> https://lkml.org/lkml/2021/9/21/599
> 
> [1] Proper performance numbers will be provided if the approach is deemed
>     acceptable. That said, mm/page_alloc.c's fast paths only grow by an extra
>     pointer indirection and a compiler barrier, which I think is unlikely to be
>     measurable.
> 
> ---
> 
> Nicolas Saenz Julienne (3):
>   mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments
>   mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
>   mm/page_alloc: Add remote draining support to per-cpu lists
> 
>  include/linux/mmzone.h |  24 +++++-
>  mm/page_alloc.c        | 173 +++++++++++++++++++++--------------------
>  mm/vmstat.c            |   6 +-
>  3 files changed, 114 insertions(+), 89 deletions(-)
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support
  2021-10-12 15:45 ` [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Vlastimil Babka
@ 2021-10-13 12:50   ` nsaenzju
  2021-10-21  8:27     ` Vlastimil Babka
  0 siblings, 1 reply; 7+ messages in thread
From: nsaenzju @ 2021-10-13 12:50 UTC (permalink / raw)
  To: Vlastimil Babka, akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, cl, paulmck, ppandit

Hi Vlastimil, thanks for spending time on this.
Also, excuse me if I over explain things.

On Tue, 2021-10-12 at 17:45 +0200, Vlastimil Babka wrote:
> On 10/8/21 18:19, Nicolas Saenz Julienne wrote:
> > This series replaces mm/page_alloc's per-cpu lists drain mechanism in order for
> > it to be able to be run remotely. Currently, only a local CPU is permitted to
> > change its per-cpu lists, and it's expected to do so, on-demand, whenever a
> > process demands it (by means of queueing a drain task on the local CPU). Most
> > systems will handle this promptly, but it'll cause problems for NOHZ_FULL CPUs
> > that can't take any sort of interruption without breaking their functional
> > guarantees (latency, bandwidth, etc...). Having a way for these processes to
> > remotely drain the lists themselves will make co-existing with isolated CPUs
> > possible, and comes with minimal performance[1]/memory cost to other users.
> > 
> > The new algorithm will atomically switch the pointer to the per-cpu lists and
> > use RCU to make sure it's not being used before draining them. 
> > 
> > I'm interested in an sort of feedback, but especially validating that the
> > approach is acceptable, and any tests/benchmarks you'd like to see run against
> 
> So let's consider the added alloc/free fast paths overhead:
> - Patch 1 - __alloc_pages_bulk() used to determine pcp_list once, now it's
> determined for each allocated page in __rmqueue_pcplist().

This one I can avoid, I missed the performance aspect of it. I was aiming at
making the code bearable.

> - Patch 2 - adds indirection from pcp->$foo to pcp->lp->$foo in each operation
> - Patch 3
>   - extra irqsave/irqrestore in free_pcppages_bulk (amortized)
>   - rcu_dereference_check() in free_unref_page_commit() and __rmqueue_pcplist()

Yes.

> BTW - I'm not sure if the RCU usage is valid here.
>
> The "read side" (normal operations) is using:
> rcu_dereference_check(pcp->lp,
> 		lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
> 
> where the lockdep parameter according to the comments for
> rcu_dereference_check() means
> 
> "indicate to lockdep that foo->bar may only be dereferenced if either
> rcu_read_lock() is held, or that the lock required to replace the bar struct
> at foo->bar is held."

You missed the "Could be used to" at the beginning of the sentence :). That
said, I believe this is similar to what I'm doing, only that the situation is
more complex.

> but you are not taking rcu_read_lock() 

I am taking the rcu_read_lock() implicitly, it's explained in 'struct
per_cpu_pages', and in more depth below.

> and the "write side" (remote draining) actually doesn't take pagesets.lock,
> so it's not true that the "lock required to replace ... is held"? The write
> side uses rcu_replace_pointer(..., mutex_is_locked(&pcpu_drain_mutex))
> which is a different lock.

The thing 'pagesets.lock' protects against is concurrent access to pcp->lp's
content, as opposed to its address. pcp->lp is dereferenced atomically, so no
need for locking on that operation.

The drain side never accesses pcp->lp's contents concurrently, it changes
pcp->lp's address and makes sure all CPUs are in sync with the new address
before clearing the stale data.

Just for the record, I think a better representation of what 'check' in
rcu_dereference means is:

 * Do an rcu_dereference(), but check that the conditions under which the
 * dereference will take place are correct.  Typically the conditions
 * indicate the various locking conditions that should be held at that
 * point. The check should return true if the conditions are satisfied.
 * An implicit check for being in an RCU read-side critical section
 * (rcu_read_lock()) is included.

So for the read side, that is, code reading pcp->lp's address and its contents,
the conditions to be met are: being in a RCU critical section, to make sure RCU
is keeping track of it, and holding 'pagesets.lock', to avoid concurrently
accessing pcp->lp's contents. The later is achieved either by disabling local
irqs or disabling migration and getting a per-cpu rt_spinlock. Conveniently
these are actions that implicitly delimit an RCU critical section (see [1] and
[2]). So the 'pagesets.lock' check fully covers the read side locking/RCU
concerns.

On the write side, the drain has to make sure pcp->lp address change is atomic
(this is achieved through rcu_replace_pointer()) and that lp->drain is emptied
before a happens. So checking for pcpu_drain_mutex being held is good enough.

> IOW, synchronize_rcu_expedited() AFAICS has nothing (no rcu_read_lock() to
> synchronize against? Might accidentally work on !RT thanks to disabled irqs,
> but not sure about with RT lock semantics of the local_lock...
>
> So back to overhead, if I'm correct above we can assume that there would be
> also rcu_read_lock() in the fast paths.

As I explained above, no need.

> The alternative proposed by tglx was IIRC that there would be a spinlock on
> each cpu, which would be mostly uncontended except when draining. Maybe an
> uncontended spin lock/unlock would have lower overhead than all of the
> above? It would be certainly simpler, so I would probably try that first and
> see if it's acceptable?

You have a point here. I'll provide a performance rundown of both solutions.
This one is a bit more complex that's for sure.

Thanks!

[1] See rcu_read_lock()'s description: "synchronize_rcu() wait for regions of
    code with preemption disabled, including regions of code with interrupts or
    softirqs disabled."

[2] See kernel/locking/spinlock_rt.c: "The RT [spinlock] substitutions
    explicitly disable migration and take rcu_read_lock() across the lock held
    section."

-- 
Nicolás Sáenz



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support
  2021-10-13 12:50   ` nsaenzju
@ 2021-10-21  8:27     ` Vlastimil Babka
  0 siblings, 0 replies; 7+ messages in thread
From: Vlastimil Babka @ 2021-10-21  8:27 UTC (permalink / raw)
  To: nsaenzju, akpm
  Cc: linux-kernel, linux-mm, frederic, tglx, peterz, mtosatti, nilal,
	mgorman, linux-rt-users, cl, paulmck, ppandit

On 10/13/21 14:50, nsaenzju@redhat.com wrote:
> Hi Vlastimil, thanks for spending time on this.
> Also, excuse me if I over explain things.

Hi, thanks for spending time on the explanation :) It was very useful.

...

> 
>> and the "write side" (remote draining) actually doesn't take pagesets.lock,
>> so it's not true that the "lock required to replace ... is held"? The write
>> side uses rcu_replace_pointer(..., mutex_is_locked(&pcpu_drain_mutex))
>> which is a different lock.
> 
> The thing 'pagesets.lock' protects against is concurrent access to pcp->lp's
> content, as opposed to its address. pcp->lp is dereferenced atomically, so no
> need for locking on that operation.
> 
> The drain side never accesses pcp->lp's contents concurrently, it changes
> pcp->lp's address and makes sure all CPUs are in sync with the new address
> before clearing the stale data.
> 
> Just for the record, I think a better representation of what 'check' in
> rcu_dereference means is:
> 
>  * Do an rcu_dereference(), but check that the conditions under which the
>  * dereference will take place are correct.  Typically the conditions
>  * indicate the various locking conditions that should be held at that
>  * point. The check should return true if the conditions are satisfied.
>  * An implicit check for being in an RCU read-side critical section
>  * (rcu_read_lock()) is included.
> 
> So for the read side, that is, code reading pcp->lp's address and its contents,
> the conditions to be met are: being in a RCU critical section, to make sure RCU
> is keeping track of it, and holding 'pagesets.lock', to avoid concurrently
> accessing pcp->lp's contents. The later is achieved either by disabling local
> irqs or disabling migration and getting a per-cpu rt_spinlock. Conveniently
> these are actions that implicitly delimit an RCU critical section (see [1] and
> [2]). So the 'pagesets.lock' check fully covers the read side locking/RCU
> concerns.

Yeah, I wasn't aware of [2] especially. It makes sense that RT locks provide
the same guarantees for RCU as non-RT.

> On the write side, the drain has to make sure pcp->lp address change is atomic
> (this is achieved through rcu_replace_pointer()) and that lp->drain is emptied
> before a happens. So checking for pcpu_drain_mutex being held is good enough.
> 
>> IOW, synchronize_rcu_expedited() AFAICS has nothing (no rcu_read_lock() to
>> synchronize against? Might accidentally work on !RT thanks to disabled irqs,
>> but not sure about with RT lock semantics of the local_lock...
>>
>> So back to overhead, if I'm correct above we can assume that there would be
>> also rcu_read_lock() in the fast paths.
> 
> As I explained above, no need.
> 
>> The alternative proposed by tglx was IIRC that there would be a spinlock on
>> each cpu, which would be mostly uncontended except when draining. Maybe an
>> uncontended spin lock/unlock would have lower overhead than all of the
>> above? It would be certainly simpler, so I would probably try that first and
>> see if it's acceptable?
> 
> You have a point here. I'll provide a performance rundown of both solutions.
> This one is a bit more complex that's for sure.

Great, thanks!

> Thanks!
> 
> [1] See rcu_read_lock()'s description: "synchronize_rcu() wait for regions of
>     code with preemption disabled, including regions of code with interrupts or
>     softirqs disabled."
> 
> [2] See kernel/locking/spinlock_rt.c: "The RT [spinlock] substitutions
>     explicitly disable migration and take rcu_read_lock() across the lock held
>     section."
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-10-21  8:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-08 16:19 [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Nicolas Saenz Julienne
2021-10-08 16:19 ` [RFC 1/3] mm/page_alloc: Simplify __rmqueue_pcplist()'s arguments Nicolas Saenz Julienne
2021-10-08 16:19 ` [RFC 2/3] mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly Nicolas Saenz Julienne
2021-10-08 16:19 ` [RFC 3/3] mm/page_alloc: Add remote draining support to per-cpu lists Nicolas Saenz Julienne
2021-10-12 15:45 ` [RFC 0/3] mm/page_alloc: Remote per-cpu lists drain support Vlastimil Babka
2021-10-13 12:50   ` nsaenzju
2021-10-21  8:27     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).