linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/12] fold per-CPU vmstats remotely
@ 2023-03-05 13:36 Marcelo Tosatti
  2023-03-05 13:36 ` [PATCH v4 01/12] mm/vmstat: remove remote node draining Marcelo Tosatti
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86

This patch series addresses the following two problems:

    1. A customer provided some evidence which indicates that
       the idle tick was stopped; albeit, CPU-specific vmstat
       counters still remained populated.

       Thus one can only assume quiet_vmstat() was not
       invoked on return to the idle loop. If I understand
       correctly, I suspect this divergence might erroneously
       prevent a reclaim attempt by kswapd. If the number of
       zone specific free pages are below their per-cpu drift
       value then zone_page_state_snapshot() is used to
       compute a more accurate view of the aforementioned
       statistic.  Thus any task blocked on the NUMA node
       specific pfmemalloc_wait queue will be unable to make
       significant progress via direct reclaim unless it is
       killed after being woken up by kswapd
       (see throttle_direct_reclaim())

    2. With a SCHED_FIFO task that busy loops on a given CPU,
       and kworker for that CPU at SCHED_OTHER priority,
       queuing work to sync per-vmstats will either cause that
       work to never execute, or stalld (i.e. stall daemon)
       boosts kworker priority which causes a latency
       violation

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.

v4:
- Switch per-CPU vmstat counters to s32, required for
  architectures that do not provide cmpxchg/xchg on
  8 bits

v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats  (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
  stats more accurate				       (Peter Xu)
- Improve changelog of
  "mm/vmstat: switch counter modification to cmpxchg"  (Peter Xu / David)
- Improve changelog of
  "mm/vmstat: remove remote node draining"	       (David Hildenbrand)


v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
  (Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
  (Uros Bizjak / Matthew Wilcox)


 arch/arm64/include/asm/percpu.h     |   16 ++
 arch/loongarch/include/asm/percpu.h |   23 +++
 arch/s390/include/asm/percpu.h      |    5 
 arch/x86/include/asm/percpu.h       |   39 +++---
 include/asm-generic/percpu.h        |   17 ++
 include/linux/mmzone.h              |   11 -
 include/linux/percpu-defs.h         |    2 
 kernel/fork.c                       |    2 
 kernel/scs.c                        |    2 
 mm/page_alloc.c                     |   23 ---
 mm/vmstat.c                         |  452 +++++++++++++++++++++++++++++++++++++++++------------------------------------
 11 files changed, 325 insertions(+), 267 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 01/12] mm/vmstat: remove remote node draining
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-03-05 13:36 ` Marcelo Tosatti
  2023-03-09 17:11   ` Vlastimil Babka
  2023-03-05 13:36 ` [PATCH v4 02/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	David Hildenbrand, Marcelo Tosatti

Draining of pages from the local pcp for a remote zone should not be
necessary, since once the system is low on memory (or compaction on a
zone is in effect), drain_all_pages should be called freeing any unused
pcps.

For reference, the original commit which introduces remote node
draining is 4037d452202e34214e8a939fa5621b2b3bbb45b7.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -679,9 +679,6 @@ struct per_cpu_pages {
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
 	short free_factor;	/* batch scaling factor during free */
-#ifdef CONFIG_NUMA
-	short expire;		/* When 0, remote pagesets are drained */
-#endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -803,20 +803,16 @@ static int fold_diff(int *zone_diff, int
  *
  * The function returns the number of global counters updated.
  */
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static int refresh_cpu_vm_stats(void)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
 	int i;
 	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
-	int changes = 0;
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
-		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
@@ -826,44 +822,8 @@ static int refresh_cpu_vm_stats(bool do_
 
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
-#ifdef CONFIG_NUMA
-				/* 3 seconds idle till flush */
-				__this_cpu_write(pcp->expire, 3);
-#endif
 			}
 		}
-#ifdef CONFIG_NUMA
-
-		if (do_pagesets) {
-			cond_resched();
-			/*
-			 * Deal with draining the remote pageset of this
-			 * processor
-			 *
-			 * Check if there are pages remaining in this pageset
-			 * if not then there is nothing to expire.
-			 */
-			if (!__this_cpu_read(pcp->expire) ||
-			       !__this_cpu_read(pcp->count))
-				continue;
-
-			/*
-			 * We never drain zones local to this processor.
-			 */
-			if (zone_to_nid(zone) == numa_node_id()) {
-				__this_cpu_write(pcp->expire, 0);
-				continue;
-			}
-
-			if (__this_cpu_dec_return(pcp->expire))
-				continue;
-
-			if (__this_cpu_read(pcp->count)) {
-				drain_zone_pages(zone, this_cpu_ptr(pcp));
-				changes++;
-			}
-		}
-#endif
 	}
 
 	for_each_online_pgdat(pgdat) {
@@ -880,8 +840,7 @@ static int refresh_cpu_vm_stats(bool do_
 		}
 	}
 
-	changes += fold_diff(global_zone_diff, global_node_diff);
-	return changes;
+	return fold_diff(global_zone_diff, global_node_diff);
 }
 
 /*
@@ -1867,7 +1826,7 @@ int sysctl_stat_interval __read_mostly =
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
-	refresh_cpu_vm_stats(true);
+	refresh_cpu_vm_stats();
 }
 
 int vmstat_refresh(struct ctl_table *table, int write,
@@ -1877,6 +1836,8 @@ int vmstat_refresh(struct ctl_table *tab
 	int err;
 	int i;
 
+	drain_all_pages(NULL);
+
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
 	 * than expected: leaving a significant amount in per_cpu buckets.
@@ -1931,7 +1892,7 @@ int vmstat_refresh(struct ctl_table *tab
 
 static void vmstat_update(struct work_struct *w)
 {
-	if (refresh_cpu_vm_stats(true)) {
+	if (refresh_cpu_vm_stats()) {
 		/*
 		 * Counters were updated so we expect more updates
 		 * to occur in the future. Keep on running the
@@ -1994,7 +1955,7 @@ void quiet_vmstat(void)
 	 * it would be too expensive from this path.
 	 * vmstat_shepherd will take care about that for us.
 	 */
-	refresh_cpu_vm_stats(false);
+	refresh_cpu_vm_stats();
 }
 
 /*
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -3176,26 +3176,6 @@ static int rmqueue_bulk(struct zone *zon
 	return allocated;
 }
 
-#ifdef CONFIG_NUMA
-/*
- * Called from the vmstat counter updater to drain pagesets of this
- * currently executing processor on remote nodes after they have
- * expired.
- */
-void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
-{
-	int to_drain, batch;
-
-	batch = READ_ONCE(pcp->batch);
-	to_drain = min(pcp->count, batch);
-	if (to_drain > 0) {
-		spin_lock(&pcp->lock);
-		free_pcppages_bulk(zone, to_drain, pcp, 0);
-		spin_unlock(&pcp->lock);
-	}
-}
-#endif
-
 /*
  * Drain pcplists of the indicated processor and zone.
  */



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 02/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-03-05 13:36 ` [PATCH v4 01/12] mm/vmstat: remove remote node draining Marcelo Tosatti
@ 2023-03-05 13:36 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 03/12] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this, 
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change ARM's this_cpu_cmpxchg_ helpers to be atomic,
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
+++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
@@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
 	_pcp_protect_return(xchg_relaxed, pcp, val)
 
 #define this_cpu_cmpxchg_1(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg, pcp, o, n)
+
+#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
 	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 03/12] this_cpu_cmpxchg: loongarch: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-03-05 13:36 ` [PATCH v4 01/12] mm/vmstat: remove remote node draining Marcelo Tosatti
  2023-03-05 13:36 ` [PATCH v4 02/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 04/12] this_cpu_cmpxchg: S390: " Marcelo Tosatti
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add this_cpu_cmpxchg_local helpers to Loongarch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/loongarch/include/asm/percpu.h
+++ linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
@@ -150,6 +150,16 @@ static inline unsigned long __percpu_xch
 }
 
 /* this_cpu_cmpxchg */
+#define _protect_cmpxchg(pcp, o, n)				\
+({								\
+	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
+	preempt_disable_notrace();				\
+	__ret = cmpxchg(raw_cpu_ptr(&(pcp)), o, n);		\
+	preempt_enable_notrace();				\
+	__ret;							\
+})
+
+/* this_cpu_cmpxchg_local */
 #define _protect_cmpxchg_local(pcp, o, n)			\
 ({								\
 	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
@@ -222,10 +232,15 @@ do {									\
 #define this_cpu_xchg_4(pcp, val) _percpu_xchg(pcp, val)
 #define this_cpu_xchg_8(pcp, val) _percpu_xchg(pcp, val)
 
-#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+
+#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg(ptr, o, n)
 
 #include <asm-generic/percpu.h>
 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 04/12] this_cpu_cmpxchg: S390: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 03/12] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 05/12] this_cpu_cmpxchg: x86: " Marcelo Tosatti
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add S390's this_cpu_cmpxchg_local.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/s390/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/s390/include/asm/percpu.h
+++ linux-vmstat-remote/arch/s390/include/asm/percpu.h
@@ -148,6 +148,11 @@
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+
 #define arch_this_cpu_xchg(pcp, nval)					\
 ({									\
 	typeof(pcp) *ptr__;						\



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 05/12] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 04/12] this_cpu_cmpxchg: S390: " Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-06 11:22   ` Peter Zijlstra
  2023-03-05 13:37 ` [PATCH v4 06/12] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change x86's this_cpu_cmpxchg_ helpers to be atomic.
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/x86/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/x86/include/asm/percpu.h
+++ linux-vmstat-remote/arch/x86/include/asm/percpu.h
@@ -197,11 +197,11 @@ do {									\
  * cmpxchg has no such implied lock semantics as a result it is much
  * more efficient for cpu local operations.
  */
-#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval)		\
+#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval, lockp)	\
 ({									\
 	__pcpu_type_##size pco_old__ = __pcpu_cast_##size(_oval);	\
 	__pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval);	\
-	asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]",		\
+	asm qual (__pcpu_op2_##size(lockp "cmpxchg", "%[nval]",		\
 				    __percpu_arg([var]))		\
 		  : [oval] "+a" (pco_old__),				\
 		    [var] "+m" (_var)					\
@@ -279,16 +279,20 @@ do {									\
 #define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, , pcp, val)
 #define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, , pcp, val)
 #define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, , pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval, "")
 
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, volatile, pcp, val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, volatile, pcp, val)
 #define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, volatile, pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, "")
+
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 #ifdef CONFIG_X86_CMPXCHG64
 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
@@ -319,16 +323,17 @@ do {									\
 #define raw_cpu_or_8(pcp, val)			percpu_to_op(8, , "or", (pcp), val)
 #define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, , pcp, val)
 #define raw_cpu_xchg_8(pcp, nval)		raw_percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval, "")
 
-#define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
-#define this_cpu_add_8(pcp, val)		percpu_add_op(8, volatile, (pcp), val)
-#define this_cpu_and_8(pcp, val)		percpu_to_op(8, volatile, "and", (pcp), val)
-#define this_cpu_or_8(pcp, val)			percpu_to_op(8, volatile, "or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
-#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_read_8(pcp)				percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)			percpu_to_op(8, volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)			percpu_add_op(8, volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val)			percpu_to_op(8, volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val)				percpu_to_op(8, volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val)			percpu_add_return_op(8, volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval)			percpu_xchg_op(8, volatile, pcp, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_8(pcp, oval, nval)		percpu_cmpxchg_op(8, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 06/12] add this_cpu_cmpxchg_local and asm-generic definitions
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 05/12] this_cpu_cmpxchg: x86: " Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 07/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Add this_cpu_cmpxchg_local_ helpers to asm-generic/percpu.h.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/asm-generic/percpu.h
===================================================================
--- linux-vmstat-remote.orig/include/asm-generic/percpu.h
+++ linux-vmstat-remote/include/asm-generic/percpu.h
@@ -424,6 +424,23 @@ do {									\
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
+#ifndef this_cpu_cmpxchg_local_1
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_2
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_4
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_8
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+
 #ifndef this_cpu_cmpxchg_double_1
 #define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
Index: linux-vmstat-remote/include/linux/percpu-defs.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/percpu-defs.h
+++ linux-vmstat-remote/include/linux/percpu-defs.h
@@ -513,6 +513,8 @@ do {									\
 #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
 	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
+#define this_cpu_cmpxchg_local(pcp, oval, nval) \
+	__pcpu_size_call_return2(this_cpu_cmpxchg_local_, pcp, oval, nval)
 #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 07/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 06/12] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 08/12] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Peter Xu, Marcelo Tosatti

this_cpu_cmpxchg was modified to atomic version, which 
can be more costly than non-atomic version.

Switch users of this_cpu_cmpxchg to this_cpu_cmpxchg_local
(which preserves pre-non-atomic this_cpu_cmpxchg behaviour).

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/kernel/fork.c
===================================================================
--- linux-vmstat-remote.orig/kernel/fork.c
+++ linux-vmstat-remote/kernel/fork.c
@@ -203,7 +203,7 @@ static bool try_release_thread_stack_to_
 	unsigned int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+		if (this_cpu_cmpxchg_local(cached_stacks[i], NULL, vm) != NULL)
 			continue;
 		return true;
 	}
Index: linux-vmstat-remote/kernel/scs.c
===================================================================
--- linux-vmstat-remote.orig/kernel/scs.c
+++ linux-vmstat-remote/kernel/scs.c
@@ -83,7 +83,7 @@ void scs_free(void *s)
 	 */
 
 	for (i = 0; i < NR_CACHED_SCS; i++)
-		if (this_cpu_cmpxchg(scs_cache[i], 0, s) == NULL)
+		if (this_cpu_cmpxchg_local(scs_cache[i], 0, s) == NULL)
 			return;
 
 	kasan_unpoison_vmalloc(s, SCS_SIZE, KASAN_VMALLOC_PROT_NORMAL);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 08/12] mm/vmstat: switch counter modification to cmpxchg
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 07/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 09/12] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

In preparation for switching vmstat shepherd to flush
per-CPU counters remotely, switch the __{mod,inc,dec} functions that
modify the counters to use cmpxchg.

To facilitate reviewing, functions are ordered in the text file, as:

__{mod,inc,dec}_{zone,node}_page_state
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
#endif

This patch defines the __ versions for the
CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts:

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state = {mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state
#endif

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c 
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla			Patched		Diff
115.25			117		1.4%

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/*
+ * If we have cmpxchg_local support then we do not need to incur the overhead
+ * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
+ *
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *     0       No overstepping
+ *     1       Overstepping half of threshold
+ *     -1      Overstepping minus half of threshold
+ */
+static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
+				  long delta, int overstep_mode)
+{
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a zone.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		zone_page_state_add(z, zone, item);
+}
+
+void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			 long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_zone_page_state);
+
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			   long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+void inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_zone_page_state);
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_zone_page_state);
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+				  enum node_stat_item item,
+				  int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	if (vmstat_item_in_bytes(item)) {
+		/*
+		 * Only cgroups use subpage accounting right now; at
+		 * the global level, these items still change in
+		 * multiples of whole pages. Store them as pages
+		 * internally to keep the per-cpu counters compact.
+		 */
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+#else
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -541,149 +723,6 @@ void __dec_node_page_state(struct page *
 }
 EXPORT_SYMBOL(__dec_node_page_state);
 
-#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
-/*
- * If we have cmpxchg_local support then we do not need to incur the overhead
- * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
- *
- * mod_state() modifies the zone counter state through atomic per cpu
- * operations.
- *
- * Overstep mode specifies how overstep should handled:
- *     0       No overstepping
- *     1       Overstepping half of threshold
- *     -1      Overstepping minus half of threshold
-*/
-static inline void mod_zone_state(struct zone *zone,
-       enum zone_stat_item item, long delta, int overstep_mode)
-{
-	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long o, n, t, z;
-
-	do {
-		z = 0;  /* overflow to zone counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a zone.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to zone counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		zone_page_state_add(z, zone, item);
-}
-
-void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-			 long delta)
-{
-	mod_zone_state(zone, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_zone_page_state);
-
-void inc_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_zone_page_state);
-
-void dec_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_zone_page_state);
-
-static inline void mod_node_state(struct pglist_data *pgdat,
-       enum node_stat_item item, int delta, int overstep_mode)
-{
-	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	long o, n, t, z;
-
-	if (vmstat_item_in_bytes(item)) {
-		/*
-		 * Only cgroups use subpage accounting right now; at
-		 * the global level, these items still change in
-		 * multiples of whole pages. Store them as pages
-		 * internally to keep the per-cpu counters compact.
-		 */
-		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
-		delta >>= PAGE_SHIFT;
-	}
-
-	do {
-		z = 0;  /* overflow to node counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a node.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to node counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		node_page_state_add(z, pgdat, item);
-}
-
-void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
-					long delta)
-{
-	mod_node_state(pgdat, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_node_page_state);
-
-void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
-{
-	mod_node_state(pgdat, item, 1, 1);
-}
-
-void inc_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_node_page_state);
-
-void dec_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_node_page_state);
-#else
 /*
  * Use interrupt disable to serialize counter updates
  */
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -8608,9 +8608,6 @@ static int page_alloc_cpu_dead(unsigned
 	/*
 	 * Zero the differential counters of the dead processor
 	 * so that the vm statistics are consistent.
-	 *
-	 * This is only okay since the processor is dead and cannot
-	 * race with what we are doing.
 	 */
 	cpu_vm_stats_fold(cpu);
 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 09/12] vmstat: switch per-cpu vmstat counters to 32-bits
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 08/12] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 10/12] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Some architectures only provide xchg/cmpxchg in 32/64-bit quantities.

Since the next patch is about to use xchg on per-CPU vmstat counters,
switch them to s32.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -686,8 +686,8 @@ struct per_cpu_pages {
 
 struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
-	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
-	s8 stat_threshold;
+	s32 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
+	s32 stat_threshold;
 #endif
 #ifdef CONFIG_NUMA
 	/*
@@ -700,8 +700,8 @@ struct per_cpu_zonestat {
 };
 
 struct per_cpu_nodestat {
-	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 stat_threshold;
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -351,7 +351,7 @@ static inline void mod_zone_state(struct
 				  long delta, int overstep_mode)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long o, n, t, z;
 
 	do {
@@ -428,7 +428,7 @@ static inline void mod_node_state(struct
 				  int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	if (vmstat_item_in_bytes(item)) {
@@ -525,7 +525,7 @@ void __mod_zone_page_state(struct zone *
 			   long delta)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long x;
 	long t;
 
@@ -556,7 +556,7 @@ void __mod_node_page_state(struct pglist
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -614,8 +614,8 @@ EXPORT_SYMBOL(__mod_node_page_state);
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -623,7 +623,7 @@ void __inc_zone_state(struct zone *zone,
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
@@ -635,8 +635,8 @@ void __inc_zone_state(struct zone *zone,
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -646,7 +646,7 @@ void __inc_node_state(struct pglist_data
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -670,8 +670,8 @@ EXPORT_SYMBOL(__inc_node_page_state);
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -679,7 +679,7 @@ void __dec_zone_state(struct zone *zone,
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
@@ -691,8 +691,8 @@ void __dec_zone_state(struct zone *zone,
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -702,7 +702,7 @@ void __dec_node_state(struct pglist_data
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 10/12] mm/vmstat: use xchg in cpu_vm_stats_fold
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 09/12] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 11/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 12/12] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use xchg instead of a
pair of read/write instructions.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -883,7 +883,7 @@ static int refresh_cpu_vm_stats(void)
 }
 
 /*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
@@ -904,8 +904,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = pzstats->vm_stat_diff[i];
-				pzstats->vm_stat_diff[i] = 0;
+				v = xchg(&pzstats->vm_stat_diff[i], 0);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
@@ -915,8 +914,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_numa_event[i]) {
 				unsigned long v;
 
-				v = pzstats->vm_numa_event[i];
-				pzstats->vm_numa_event[i] = 0;
+				v = xchg(&pzstats->vm_numa_event[i], 0);
 				zone_numa_event_add(v, zone, i);
 			}
 		}
@@ -932,8 +930,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (p->vm_node_stat_diff[i]) {
 				int v;
 
-				v = p->vm_node_stat_diff[i];
-				p->vm_node_stat_diff[i] = 0;
+				v = xchg(&p->vm_node_stat_diff[i], 0);
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
 			}



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 11/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 10/12] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  2023-03-05 13:37 ` [PATCH v4 12/12] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Now that the counters are modified via cmpxchg both CPU locally
(via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU 
vmstats folding remotely.

This fixes the following two problems:

 1. A customer provided some evidence which indicates that
    the idle tick was stopped; albeit, CPU-specific vmstat
    counters still remained populated.

    Thus one can only assume quiet_vmstat() was not
    invoked on return to the idle loop. If I understand
    correctly, I suspect this divergence might erroneously
    prevent a reclaim attempt by kswapd. If the number of
    zone specific free pages are below their per-cpu drift
    value then zone_page_state_snapshot() is used to
    compute a more accurate view of the aforementioned
    statistic.  Thus any task blocked on the NUMA node
    specific pfmemalloc_wait queue will be unable to make
    significant progress via direct reclaim unless it is
    killed after being woken up by kswapd
    (see throttle_direct_reclaim())

 2. With a SCHED_FIFO task that busy loops on a given CPU,
    and kworker for that CPU at SCHED_OTHER priority,
    queuing work to sync per-vmstats will either cause that
    work to never execute, or stalld (i.e. stall daemon)
    boosts kworker priority which causes a latency
    violation

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -2004,6 +2004,23 @@ static void vmstat_shepherd(struct work_
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		cpu_vm_stats_fold(cpu);
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
+}
+#else
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
@@ -2023,6 +2040,7 @@ static void vmstat_shepherd(struct work_
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }
+#endif
 
 static void __init start_shepherd_timer(void)
 {



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v4 12/12] mm/vmstat: refresh stats remotely instead of via work item
  2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2023-03-05 13:37 ` [PATCH v4 11/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-03-05 13:37 ` Marcelo Tosatti
  11 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-05 13:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Marcelo Tosatti

Refresh per-CPU stats remotely, instead of queueing 
work items, for the stat_refresh procfs method.

This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -1860,11 +1860,21 @@ static DEFINE_PER_CPU(struct delayed_wor
 int sysctl_stat_interval __read_mostly = HZ;
 
 #ifdef CONFIG_PROC_FS
+
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+static int refresh_all_vm_stats(void);
+#else
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats();
 }
 
+static int refresh_all_vm_stats(void)
+{
+	return schedule_on_each_cpu(refresh_vm_stats);
+}
+#endif
+
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -1886,7 +1896,7 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
+	err = refresh_all_vm_stats();
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
@@ -2006,7 +2016,7 @@ static DECLARE_DEFERRABLE_WORK(shepherd,
 
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
-static void vmstat_shepherd(struct work_struct *w)
+static int refresh_all_vm_stats(void)
 {
 	int cpu;
 
@@ -2016,7 +2026,12 @@ static void vmstat_shepherd(struct work_
 		cond_resched();
 	}
 	cpus_read_unlock();
+	return 0;
+}
 
+static void vmstat_shepherd(struct work_struct *w)
+{
+	refresh_all_vm_stats();
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 05/12] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-05 13:37 ` [PATCH v4 05/12] this_cpu_cmpxchg: x86: " Marcelo Tosatti
@ 2023-03-06 11:22   ` Peter Zijlstra
  2023-03-08 21:42     ` Marcelo Tosatti
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2023-03-06 11:22 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86

On Sun, Mar 05, 2023 at 10:37:02AM -0300, Marcelo Tosatti wrote:
> Goal is to have vmstat_shepherd to transfer from
> per-CPU counters to global counters remotely. For this,
> an atomic this_cpu_cmpxchg is necessary.
> 
> Following the kernel convention for cmpxchg/cmpxchg_local,
> change x86's this_cpu_cmpxchg_ helpers to be atomic.
> and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Urgh.. much hate for this. this_cpu_*() is local, per definition,
always.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 05/12] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-06 11:22   ` Peter Zijlstra
@ 2023-03-08 21:42     ` Marcelo Tosatti
  0 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-08 21:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86

On Mon, Mar 06, 2023 at 12:22:40PM +0100, Peter Zijlstra wrote:
> On Sun, Mar 05, 2023 at 10:37:02AM -0300, Marcelo Tosatti wrote:
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this,
> > an atomic this_cpu_cmpxchg is necessary.
> > 
> > Following the kernel convention for cmpxchg/cmpxchg_local,
> > change x86's this_cpu_cmpxchg_ helpers to be atomic.
> > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> 
> Urgh.. much hate for this. this_cpu_*() is local, per definition,
> always.

:Author: Christoph Lameter, August 4th, 2014
:Author: Pranith Kumar, Aug 2nd, 2014

this_cpu operations are a way of optimizing access to per cpu
variables associated with the *currently* executing processor. This is
done through the use of segment registers (or a dedicated register where
the cpu permanently stored the beginning of the per cpu area for a
specific processor).

this_cpu operations add a per cpu variable offset to the processor
specific per cpu base and encode that operation in the instruction
operating on the per cpu variable.

This means that there are no atomicity issues between the calculation of
the offset and the operation on the data. Therefore it is not
necessary to disable preemption or interrupts to ensure that the
processor is not changed between the calculation of the address and
the operation on the data.

>> Call the above [1].

>> Up to this point, everything remains valid with adding lock
>> to cmpxchg. That is, it makes sense to have a locked
>> this_cpu_cmpxchg, operating on per-CPU data, without the need
>> to disable preemption/interrupts to ensure processor is not
>> changed.

Read-modify-write operations are of particular interest. Frequently
processors have special lower latency instructions that can operate
without the typical synchronization overhead, but still provide some
sort of relaxed atomicity guarantees. The x86, for example, can execute
RMW (Read Modify Write) instructions like inc/dec/cmpxchg without the
lock prefix and the associated latency penalty.

>> This sentence above makes sense if the data is accessed locally.
>> However, we are extending it (this_cpu_cmpxchg) to the case
>> where we want the data to be accessed remotely (therefore need
>> to be a locked access), but still want the benefits of [1].
>>
>> For example, this_cpu_xchg implies LOCK semantics.

Access to the variable without the lock prefix is not synchronized but
synchronization is not necessary since we are dealing with per cpu
data specific to the currently executing processor. Only the current
processor should be accessing that variable and therefore there are no
concurrency issues with other processors in the system.

----

So in summary, i understand your POV, but:

1) Have outlined arguments above which point that this_cpu_xxx and 
locked instructions can co-exist (in fact, they are already do, for
this_cpu_xchg).

2) In case you still think this is a bad idea, any suggestions for
improvements?

Thanks


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 01/12] mm/vmstat: remove remote node draining
  2023-03-05 13:36 ` [PATCH v4 01/12] mm/vmstat: remove remote node draining Marcelo Tosatti
@ 2023-03-09 17:11   ` Vlastimil Babka
  2023-03-13 16:41     ` Marcelo Tosatti
  0 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2023-03-09 17:11 UTC (permalink / raw)
  To: Marcelo Tosatti, Christoph Lameter, Mel Gorman
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	David Hildenbrand

On 3/5/23 14:36, Marcelo Tosatti wrote:
> Draining of pages from the local pcp for a remote zone should not be
> necessary, since once the system is low on memory (or compaction on a
> zone is in effect), drain_all_pages should be called freeing any unused
> pcps.

Hmm I can't easily say this is a good idea without a proper evaluation. It's
true that drain_all_pages() will be called from
__alloc_pages_direct_reclaim(), but if you look closely, it's only after
__perform_reclaim() and subsequent get_page_from_freelist() failure, which
means very low on memory. There's also kcompactd_do_work() caller, but that
shouldn't respond to order-0 pressure.

Maybe part of the problem is that pages sitting in pcplist are not counted
as NR_FREE_PAGES, so watermark checks will not "see" them as available. That
would be true for both local and remote nodes as we process the zonelist,
but if pages are stuck on remote pcplists, it could make the problem worse
than if they stayed only in local ones.

But the good news should be that for this series you shouldn't actually need
this patch, AFAIU. Last year Mel did the "Drain remote per-cpu directly"
series and followups, so remote draining of pcplists is today done without
sending an IPI to the remote CPU.

BTW I kinda like the implementation that relies on "per-cpu spin lock"
that's used mostly locally thus uncontended, and only rarely remotely. Much
simpler than complicated cmpxchg schemes, wonder if those are really still
such a win these days, e.g. in SLUB or here in vmstat...

> For reference, the original commit which introduces remote node
> draining is 4037d452202e34214e8a939fa5621b2b3bbb45b7.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-vmstat-remote/include/linux/mmzone.h
> ===================================================================
> --- linux-vmstat-remote.orig/include/linux/mmzone.h
> +++ linux-vmstat-remote/include/linux/mmzone.h
> @@ -679,9 +679,6 @@ struct per_cpu_pages {
>  	int high;		/* high watermark, emptying needed */
>  	int batch;		/* chunk size for buddy add/remove */
>  	short free_factor;	/* batch scaling factor during free */
> -#ifdef CONFIG_NUMA
> -	short expire;		/* When 0, remote pagesets are drained */
> -#endif
>  
>  	/* Lists of pages, one per migrate type stored on the pcp-lists */
>  	struct list_head lists[NR_PCP_LISTS];
> Index: linux-vmstat-remote/mm/vmstat.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmstat.c
> +++ linux-vmstat-remote/mm/vmstat.c
> @@ -803,20 +803,16 @@ static int fold_diff(int *zone_diff, int
>   *
>   * The function returns the number of global counters updated.
>   */
> -static int refresh_cpu_vm_stats(bool do_pagesets)
> +static int refresh_cpu_vm_stats(void)
>  {
>  	struct pglist_data *pgdat;
>  	struct zone *zone;
>  	int i;
>  	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
>  	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
> -	int changes = 0;
>  
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
> -#ifdef CONFIG_NUMA
> -		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
> -#endif
>  
>  		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  			int v;
> @@ -826,44 +822,8 @@ static int refresh_cpu_vm_stats(bool do_
>  
>  				atomic_long_add(v, &zone->vm_stat[i]);
>  				global_zone_diff[i] += v;
> -#ifdef CONFIG_NUMA
> -				/* 3 seconds idle till flush */
> -				__this_cpu_write(pcp->expire, 3);
> -#endif
>  			}
>  		}
> -#ifdef CONFIG_NUMA
> -
> -		if (do_pagesets) {
> -			cond_resched();
> -			/*
> -			 * Deal with draining the remote pageset of this
> -			 * processor
> -			 *
> -			 * Check if there are pages remaining in this pageset
> -			 * if not then there is nothing to expire.
> -			 */
> -			if (!__this_cpu_read(pcp->expire) ||
> -			       !__this_cpu_read(pcp->count))
> -				continue;
> -
> -			/*
> -			 * We never drain zones local to this processor.
> -			 */
> -			if (zone_to_nid(zone) == numa_node_id()) {
> -				__this_cpu_write(pcp->expire, 0);
> -				continue;
> -			}
> -
> -			if (__this_cpu_dec_return(pcp->expire))
> -				continue;
> -
> -			if (__this_cpu_read(pcp->count)) {
> -				drain_zone_pages(zone, this_cpu_ptr(pcp));
> -				changes++;
> -			}
> -		}
> -#endif
>  	}
>  
>  	for_each_online_pgdat(pgdat) {
> @@ -880,8 +840,7 @@ static int refresh_cpu_vm_stats(bool do_
>  		}
>  	}
>  
> -	changes += fold_diff(global_zone_diff, global_node_diff);
> -	return changes;
> +	return fold_diff(global_zone_diff, global_node_diff);
>  }
>  
>  /*
> @@ -1867,7 +1826,7 @@ int sysctl_stat_interval __read_mostly =
>  #ifdef CONFIG_PROC_FS
>  static void refresh_vm_stats(struct work_struct *work)
>  {
> -	refresh_cpu_vm_stats(true);
> +	refresh_cpu_vm_stats();
>  }
>  
>  int vmstat_refresh(struct ctl_table *table, int write,
> @@ -1877,6 +1836,8 @@ int vmstat_refresh(struct ctl_table *tab
>  	int err;
>  	int i;
>  
> +	drain_all_pages(NULL);
> +
>  	/*
>  	 * The regular update, every sysctl_stat_interval, may come later
>  	 * than expected: leaving a significant amount in per_cpu buckets.
> @@ -1931,7 +1892,7 @@ int vmstat_refresh(struct ctl_table *tab
>  
>  static void vmstat_update(struct work_struct *w)
>  {
> -	if (refresh_cpu_vm_stats(true)) {
> +	if (refresh_cpu_vm_stats()) {
>  		/*
>  		 * Counters were updated so we expect more updates
>  		 * to occur in the future. Keep on running the
> @@ -1994,7 +1955,7 @@ void quiet_vmstat(void)
>  	 * it would be too expensive from this path.
>  	 * vmstat_shepherd will take care about that for us.
>  	 */
> -	refresh_cpu_vm_stats(false);
> +	refresh_cpu_vm_stats();
>  }
>  
>  /*
> Index: linux-vmstat-remote/mm/page_alloc.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/page_alloc.c
> +++ linux-vmstat-remote/mm/page_alloc.c
> @@ -3176,26 +3176,6 @@ static int rmqueue_bulk(struct zone *zon
>  	return allocated;
>  }
>  
> -#ifdef CONFIG_NUMA
> -/*
> - * Called from the vmstat counter updater to drain pagesets of this
> - * currently executing processor on remote nodes after they have
> - * expired.
> - */
> -void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> -{
> -	int to_drain, batch;
> -
> -	batch = READ_ONCE(pcp->batch);
> -	to_drain = min(pcp->count, batch);
> -	if (to_drain > 0) {
> -		spin_lock(&pcp->lock);
> -		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -		spin_unlock(&pcp->lock);
> -	}
> -}
> -#endif
> -
>  /*
>   * Drain pcplists of the indicated processor and zone.
>   */
> 
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v4 01/12] mm/vmstat: remove remote node draining
  2023-03-09 17:11   ` Vlastimil Babka
@ 2023-03-13 16:41     ` Marcelo Tosatti
  0 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2023-03-13 16:41 UTC (permalink / raw)
  To: Vlastimil Babka, Leonardo Bras
  Cc: Christoph Lameter, Mel Gorman, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, David Hildenbrand

On Thu, Mar 09, 2023 at 06:11:30PM +0100, Vlastimil Babka wrote:
> On 3/5/23 14:36, Marcelo Tosatti wrote:
> > Draining of pages from the local pcp for a remote zone should not be
> > necessary, since once the system is low on memory (or compaction on a
> > zone is in effect), drain_all_pages should be called freeing any unused
> > pcps.
> 
> Hmm I can't easily say this is a good idea without a proper evaluation. It's
> true that drain_all_pages() will be called from
> __alloc_pages_direct_reclaim(), but if you look closely, it's only after
> __perform_reclaim() and subsequent get_page_from_freelist() failure, which
> means very low on memory. There's also kcompactd_do_work() caller, but that
> shouldn't respond to order-0 pressure.
> 
> Maybe part of the problem is that pages sitting in pcplist are not counted
> as NR_FREE_PAGES, so watermark checks will not "see" them as available. That
> would be true for both local and remote nodes as we process the zonelist,
> but if pages are stuck on remote pcplists, it could make the problem worse
> than if they stayed only in local ones.
> 
> But the good news should be that for this series you shouldn't actually need
> this patch, AFAIU. Last year Mel did the "Drain remote per-cpu directly"
> series and followups, so remote draining of pcplists is today done without
> sending an IPI to the remote CPU.

Vlastimil,

Sent -v5 which should address the concerns regarding pages stuck on
remote pcplists. Let me know if you have other concerns.

> BTW I kinda like the implementation that relies on "per-cpu spin lock"
> that's used mostly locally thus uncontended, and only rarely remotely. Much
> simpler than complicated cmpxchg schemes, wonder if those are really still
> such a win these days, e.g. in SLUB or here in vmstat...

Good question (to which i don't know the answer).

Thanks.

> > For reference, the original commit which introduces remote node
> > draining is 4037d452202e34214e8a939fa5621b2b3bbb45b7.
> > 
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/include/linux/mmzone.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/include/linux/mmzone.h
> > +++ linux-vmstat-remote/include/linux/mmzone.h
> > @@ -679,9 +679,6 @@ struct per_cpu_pages {
> >  	int high;		/* high watermark, emptying needed */
> >  	int batch;		/* chunk size for buddy add/remove */
> >  	short free_factor;	/* batch scaling factor during free */
> > -#ifdef CONFIG_NUMA
> > -	short expire;		/* When 0, remote pagesets are drained */
> > -#endif
> >  
> >  	/* Lists of pages, one per migrate type stored on the pcp-lists */
> >  	struct list_head lists[NR_PCP_LISTS];
> > Index: linux-vmstat-remote/mm/vmstat.c
> > ===================================================================
> > --- linux-vmstat-remote.orig/mm/vmstat.c
> > +++ linux-vmstat-remote/mm/vmstat.c
> > @@ -803,20 +803,16 @@ static int fold_diff(int *zone_diff, int
> >   *
> >   * The function returns the number of global counters updated.
> >   */
> > -static int refresh_cpu_vm_stats(bool do_pagesets)
> > +static int refresh_cpu_vm_stats(void)
> >  {
> >  	struct pglist_data *pgdat;
> >  	struct zone *zone;
> >  	int i;
> >  	int global_zone_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
> >  	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
> > -	int changes = 0;
> >  
> >  	for_each_populated_zone(zone) {
> >  		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
> > -#ifdef CONFIG_NUMA
> > -		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
> > -#endif
> >  
> >  		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
> >  			int v;
> > @@ -826,44 +822,8 @@ static int refresh_cpu_vm_stats(bool do_
> >  
> >  				atomic_long_add(v, &zone->vm_stat[i]);
> >  				global_zone_diff[i] += v;
> > -#ifdef CONFIG_NUMA
> > -				/* 3 seconds idle till flush */
> > -				__this_cpu_write(pcp->expire, 3);
> > -#endif
> >  			}
> >  		}
> > -#ifdef CONFIG_NUMA
> > -
> > -		if (do_pagesets) {
> > -			cond_resched();
> > -			/*
> > -			 * Deal with draining the remote pageset of this
> > -			 * processor
> > -			 *
> > -			 * Check if there are pages remaining in this pageset
> > -			 * if not then there is nothing to expire.
> > -			 */
> > -			if (!__this_cpu_read(pcp->expire) ||
> > -			       !__this_cpu_read(pcp->count))
> > -				continue;
> > -
> > -			/*
> > -			 * We never drain zones local to this processor.
> > -			 */
> > -			if (zone_to_nid(zone) == numa_node_id()) {
> > -				__this_cpu_write(pcp->expire, 0);
> > -				continue;
> > -			}
> > -
> > -			if (__this_cpu_dec_return(pcp->expire))
> > -				continue;
> > -
> > -			if (__this_cpu_read(pcp->count)) {
> > -				drain_zone_pages(zone, this_cpu_ptr(pcp));
> > -				changes++;
> > -			}
> > -		}
> > -#endif
> >  	}
> >  
> >  	for_each_online_pgdat(pgdat) {
> > @@ -880,8 +840,7 @@ static int refresh_cpu_vm_stats(bool do_
> >  		}
> >  	}
> >  
> > -	changes += fold_diff(global_zone_diff, global_node_diff);
> > -	return changes;
> > +	return fold_diff(global_zone_diff, global_node_diff);
> >  }
> >  
> >  /*
> > @@ -1867,7 +1826,7 @@ int sysctl_stat_interval __read_mostly =
> >  #ifdef CONFIG_PROC_FS
> >  static void refresh_vm_stats(struct work_struct *work)
> >  {
> > -	refresh_cpu_vm_stats(true);
> > +	refresh_cpu_vm_stats();
> >  }
> >  
> >  int vmstat_refresh(struct ctl_table *table, int write,
> > @@ -1877,6 +1836,8 @@ int vmstat_refresh(struct ctl_table *tab
> >  	int err;
> >  	int i;
> >  
> > +	drain_all_pages(NULL);
> > +
> >  	/*
> >  	 * The regular update, every sysctl_stat_interval, may come later
> >  	 * than expected: leaving a significant amount in per_cpu buckets.
> > @@ -1931,7 +1892,7 @@ int vmstat_refresh(struct ctl_table *tab
> >  
> >  static void vmstat_update(struct work_struct *w)
> >  {
> > -	if (refresh_cpu_vm_stats(true)) {
> > +	if (refresh_cpu_vm_stats()) {
> >  		/*
> >  		 * Counters were updated so we expect more updates
> >  		 * to occur in the future. Keep on running the
> > @@ -1994,7 +1955,7 @@ void quiet_vmstat(void)
> >  	 * it would be too expensive from this path.
> >  	 * vmstat_shepherd will take care about that for us.
> >  	 */
> > -	refresh_cpu_vm_stats(false);
> > +	refresh_cpu_vm_stats();
> >  }
> >  
> >  /*
> > Index: linux-vmstat-remote/mm/page_alloc.c
> > ===================================================================
> > --- linux-vmstat-remote.orig/mm/page_alloc.c
> > +++ linux-vmstat-remote/mm/page_alloc.c
> > @@ -3176,26 +3176,6 @@ static int rmqueue_bulk(struct zone *zon
> >  	return allocated;
> >  }
> >  
> > -#ifdef CONFIG_NUMA
> > -/*
> > - * Called from the vmstat counter updater to drain pagesets of this
> > - * currently executing processor on remote nodes after they have
> > - * expired.
> > - */
> > -void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> > -{
> > -	int to_drain, batch;
> > -
> > -	batch = READ_ONCE(pcp->batch);
> > -	to_drain = min(pcp->count, batch);
> > -	if (to_drain > 0) {
> > -		spin_lock(&pcp->lock);
> > -		free_pcppages_bulk(zone, to_drain, pcp, 0);
> > -		spin_unlock(&pcp->lock);
> > -	}
> > -}
> > -#endif
> > -
> >  /*
> >   * Drain pcplists of the indicated processor and zone.
> >   */
> > 
> > 
> > 
> 
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-03-13 16:42 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-05 13:36 [PATCH v4 00/12] fold per-CPU vmstats remotely Marcelo Tosatti
2023-03-05 13:36 ` [PATCH v4 01/12] mm/vmstat: remove remote node draining Marcelo Tosatti
2023-03-09 17:11   ` Vlastimil Babka
2023-03-13 16:41     ` Marcelo Tosatti
2023-03-05 13:36 ` [PATCH v4 02/12] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 03/12] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 04/12] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 05/12] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-03-06 11:22   ` Peter Zijlstra
2023-03-08 21:42     ` Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 06/12] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 07/12] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 08/12] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 09/12] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 10/12] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 11/12] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-05 13:37 ` [PATCH v4 12/12] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).