All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] fold per-CPU vmstats remotely
@ 2023-02-09 15:01 Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
                   ` (11 more replies)
  0 siblings, 12 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

This patch series addresses the following two problems:

    1. A customer provided some evidence which indicates that
       the idle tick was stopped; albeit, CPU-specific vmstat
       counters still remained populated.

       Thus one can only assume quiet_vmstat() was not
       invoked on return to the idle loop. If I understand
       correctly, I suspect this divergence might erroneously
       prevent a reclaim attempt by kswapd. If the number of
       zone specific free pages are below their per-cpu drift
       value then zone_page_state_snapshot() is used to
       compute a more accurate view of the aforementioned
       statistic.  Thus any task blocked on the NUMA node
       specific pfmemalloc_wait queue will be unable to make
       significant progress via direct reclaim unless it is
       killed after being woken up by kswapd
       (see throttle_direct_reclaim())

    2. With a SCHED_FIFO task that busy loops on a given CPU,
       and kworker for that CPU at SCHED_OTHER priority,
       queuing work to sync per-vmstats will either cause that
       work to never execute, or stalld (i.e. stall daemon)
       boosts kworker priority which causes a latency
       violation

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.

v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
  (Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
  (Uros Bizjak / Matthew Wilcox)


 arch/arm64/include/asm/percpu.h     |   16 ++-
 arch/loongarch/include/asm/percpu.h |   23 ++++
 arch/s390/include/asm/percpu.h      |    5 +
 arch/x86/include/asm/percpu.h       |   39 ++++----
 include/asm-generic/percpu.h        |   17 +++
 include/linux/mmzone.h              |    3 
 kernel/fork.c                       |    2 
 kernel/scs.c                        |    2 
 mm/vmstat.c                         |  424 ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------
 9 files changed, 308 insertions(+), 223 deletions(-)





^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-02-28 15:53   ` David Hildenbrand
  2023-03-02 17:21   ` Peter Xu
  2023-02-09 15:01 ` [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Draining of pages from the local pcp for a remote zone was necessary
since:

"Note that remote node draining is a somewhat esoteric feature that is
required on large NUMA systems because otherwise significant portions
of system memory can become trapped in pcp queues. The number of pcp is
determined by the number of processors and nodes in a system. A system
with 4 processors and 2 nodes has 8 pcps which is okay. But a system
with 1024 processors and 512 nodes has 512k pcps with a high potential
for large amount of memory being caught in them."

Since commit 443c2accd1b6679a1320167f8f56eed6536b806e
("mm/page_alloc: remotely drain per-cpu lists"), drain_all_pages() is able 
to remotely free those pages when necessary.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -577,9 +577,6 @@ struct per_cpu_pages {
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
 	short free_factor;	/* batch scaling factor during free */
-#ifdef CONFIG_NUMA
-	short expire;		/* When 0, remote pagesets are drained */
-#endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -803,7 +803,7 @@ static int fold_diff(int *zone_diff, int
  *
  * The function returns the number of global counters updated.
  */
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static int refresh_cpu_vm_stats(void)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
@@ -814,9 +814,6 @@ static int refresh_cpu_vm_stats(bool do_
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
-		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
@@ -826,44 +823,8 @@ static int refresh_cpu_vm_stats(bool do_
 
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
-#ifdef CONFIG_NUMA
-				/* 3 seconds idle till flush */
-				__this_cpu_write(pcp->expire, 3);
-#endif
 			}
 		}
-#ifdef CONFIG_NUMA
-
-		if (do_pagesets) {
-			cond_resched();
-			/*
-			 * Deal with draining the remote pageset of this
-			 * processor
-			 *
-			 * Check if there are pages remaining in this pageset
-			 * if not then there is nothing to expire.
-			 */
-			if (!__this_cpu_read(pcp->expire) ||
-			       !__this_cpu_read(pcp->count))
-				continue;
-
-			/*
-			 * We never drain zones local to this processor.
-			 */
-			if (zone_to_nid(zone) == numa_node_id()) {
-				__this_cpu_write(pcp->expire, 0);
-				continue;
-			}
-
-			if (__this_cpu_dec_return(pcp->expire))
-				continue;
-
-			if (__this_cpu_read(pcp->count)) {
-				drain_zone_pages(zone, this_cpu_ptr(pcp));
-				changes++;
-			}
-		}
-#endif
 	}
 
 	for_each_online_pgdat(pgdat) {
@@ -1864,7 +1825,7 @@ int sysctl_stat_interval __read_mostly =
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
-	refresh_cpu_vm_stats(true);
+	refresh_cpu_vm_stats();
 }
 
 int vmstat_refresh(struct ctl_table *table, int write,
@@ -1928,7 +1889,7 @@ int vmstat_refresh(struct ctl_table *tab
 
 static void vmstat_update(struct work_struct *w)
 {
-	if (refresh_cpu_vm_stats(true)) {
+	if (refresh_cpu_vm_stats()) {
 		/*
 		 * Counters were updated so we expect more updates
 		 * to occur in the future. Keep on running the
@@ -1991,7 +1952,7 @@ void quiet_vmstat(void)
 	 * it would be too expensive from this path.
 	 * vmstat_shepherd will take care about that for us.
 	 */
-	refresh_cpu_vm_stats(false);
+	refresh_cpu_vm_stats();
 }
 
 /*



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-03-02 10:42   ` David Hildenbrand
                     ` (2 more replies)
  2023-02-09 15:01 ` [PATCH v2 03/11] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this, 
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change ARM's this_cpu_cmpxchg_ helpers to be atomic,
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
+++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
@@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
 	_pcp_protect_return(xchg_relaxed, pcp, val)
 
 #define this_cpu_cmpxchg_1(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg, pcp, o, n)
+
+#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
 	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 03/11] this_cpu_cmpxchg: loongarch: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 04/11] this_cpu_cmpxchg: S390: " Marcelo Tosatti
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add this_cpu_cmpxchg_local helpers to Loongarch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/loongarch/include/asm/percpu.h
+++ linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
@@ -150,6 +150,16 @@ static inline unsigned long __percpu_xch
 }
 
 /* this_cpu_cmpxchg */
+#define _protect_cmpxchg(pcp, o, n)				\
+({								\
+	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
+	preempt_disable_notrace();				\
+	__ret = cmpxchg(raw_cpu_ptr(&(pcp)), o, n);		\
+	preempt_enable_notrace();				\
+	__ret;							\
+})
+
+/* this_cpu_cmpxchg_local */
 #define _protect_cmpxchg_local(pcp, o, n)			\
 ({								\
 	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
@@ -222,10 +232,15 @@ do {									\
 #define this_cpu_xchg_4(pcp, val) _percpu_xchg(pcp, val)
 #define this_cpu_xchg_8(pcp, val) _percpu_xchg(pcp, val)
 
-#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+
+#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg(ptr, o, n)
 
 #include <asm-generic/percpu.h>
 



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 04/11] this_cpu_cmpxchg: S390: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2023-02-09 15:01 ` [PATCH v2 03/11] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 05/11] this_cpu_cmpxchg: x86: " Marcelo Tosatti
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add S390's this_cpu_cmpxchg_local.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/s390/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/s390/include/asm/percpu.h
+++ linux-vmstat-remote/arch/s390/include/asm/percpu.h
@@ -148,6 +148,11 @@
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+
 #define arch_this_cpu_xchg(pcp, nval)					\
 ({									\
 	typeof(pcp) *ptr__;						\



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 05/11] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2023-02-09 15:01 ` [PATCH v2 04/11] this_cpu_cmpxchg: S390: " Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 06/11] this_cpu_cmpxchg: asm-generic: " Marcelo Tosatti
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change x86's this_cpu_cmpxchg_ helpers to be atomic.
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/arch/x86/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/x86/include/asm/percpu.h
+++ linux-vmstat-remote/arch/x86/include/asm/percpu.h
@@ -197,11 +197,11 @@ do {									\
  * cmpxchg has no such implied lock semantics as a result it is much
  * more efficient for cpu local operations.
  */
-#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval)		\
+#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval, lockp)	\
 ({									\
 	__pcpu_type_##size pco_old__ = __pcpu_cast_##size(_oval);	\
 	__pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval);	\
-	asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]",		\
+	asm qual (__pcpu_op2_##size(lockp "cmpxchg", "%[nval]",		\
 				    __percpu_arg([var]))		\
 		  : [oval] "+a" (pco_old__),				\
 		    [var] "+m" (_var)					\
@@ -279,16 +279,20 @@ do {									\
 #define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, , pcp, val)
 #define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, , pcp, val)
 #define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, , pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval, "")
 
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, volatile, pcp, val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, volatile, pcp, val)
 #define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, volatile, pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, "")
+
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 #ifdef CONFIG_X86_CMPXCHG64
 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
@@ -319,16 +323,17 @@ do {									\
 #define raw_cpu_or_8(pcp, val)			percpu_to_op(8, , "or", (pcp), val)
 #define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, , pcp, val)
 #define raw_cpu_xchg_8(pcp, nval)		raw_percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval, "")
 
-#define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
-#define this_cpu_add_8(pcp, val)		percpu_add_op(8, volatile, (pcp), val)
-#define this_cpu_and_8(pcp, val)		percpu_to_op(8, volatile, "and", (pcp), val)
-#define this_cpu_or_8(pcp, val)			percpu_to_op(8, volatile, "or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
-#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_read_8(pcp)				percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)			percpu_to_op(8, volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)			percpu_add_op(8, volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val)			percpu_to_op(8, volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val)				percpu_to_op(8, volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val)			percpu_add_return_op(8, volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval)			percpu_xchg_op(8, volatile, pcp, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_8(pcp, oval, nval)		percpu_cmpxchg_op(8, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 06/11] this_cpu_cmpxchg: asm-generic: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2023-02-09 15:01 ` [PATCH v2 05/11] this_cpu_cmpxchg: x86: " Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-02-09 15:01 ` [PATCH v2 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Add this_cpu_cmpxchg_local_ helpers to asm-generic/percpu.h.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/asm-generic/percpu.h
===================================================================
--- linux-vmstat-remote.orig/include/asm-generic/percpu.h
+++ linux-vmstat-remote/include/asm-generic/percpu.h
@@ -424,6 +424,23 @@ do {									\
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
+#ifndef this_cpu_cmpxchg_local_1
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_2
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_4
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_8
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+
 #ifndef this_cpu_cmpxchg_double_1
 #define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2023-02-09 15:01 ` [PATCH v2 06/11] this_cpu_cmpxchg: asm-generic: " Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-03-02 20:54   ` Peter Xu
  2023-02-09 15:01 ` [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

this_cpu_cmpxchg was modified to atomic version, which 
can be more costly than non-atomic version.

Switch users of this_cpu_cmpxchg to this_cpu_cmpxchg_local
(which preserves pre-non-atomic this_cpu_cmpxchg behaviour).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/kernel/fork.c
===================================================================
--- linux-vmstat-remote.orig/kernel/fork.c
+++ linux-vmstat-remote/kernel/fork.c
@@ -203,7 +203,7 @@ static bool try_release_thread_stack_to_
 	unsigned int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+		if (this_cpu_cmpxchg_local(cached_stacks[i], NULL, vm) != NULL)
 			continue;
 		return true;
 	}
Index: linux-vmstat-remote/kernel/scs.c
===================================================================
--- linux-vmstat-remote.orig/kernel/scs.c
+++ linux-vmstat-remote/kernel/scs.c
@@ -79,7 +79,7 @@ void scs_free(void *s)
 	 */
 
 	for (i = 0; i < NR_CACHED_SCS; i++)
-		if (this_cpu_cmpxchg(scs_cache[i], 0, s) == NULL)
+		if (this_cpu_cmpxchg_local(scs_cache[i], 0, s) == NULL)
 			return;
 
 	kasan_unpoison_vmalloc(s, SCS_SIZE, KASAN_VMALLOC_PROT_NORMAL);



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2023-02-09 15:01 ` [PATCH v2 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-03-02 10:47   ` David Hildenbrand
  2023-02-09 15:01 ` [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, switch all functions that
modify the counters to use cmpxchg.

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c 
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla			Patched		Diff
159			165		3.7%

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/*
+ * If we have cmpxchg_local support then we do not need to incur the overhead
+ * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
+ *
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *     0       No overstepping
+ *     1       Overstepping half of threshold
+ *     -1      Overstepping minus half of threshold
+ */
+static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
+				  long delta, int overstep_mode)
+{
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a zone.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		zone_page_state_add(z, zone, item);
+}
+
+void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			 long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_zone_page_state);
+
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			   long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+void inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_zone_page_state);
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_zone_page_state);
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+				  enum node_stat_item item,
+				  int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	if (vmstat_item_in_bytes(item)) {
+		/*
+		 * Only cgroups use subpage accounting right now; at
+		 * the global level, these items still change in
+		 * multiples of whole pages. Store them as pages
+		 * internally to keep the per-cpu counters compact.
+		 */
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+#else
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -541,149 +723,6 @@ void __dec_node_page_state(struct page *
 }
 EXPORT_SYMBOL(__dec_node_page_state);
 
-#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
-/*
- * If we have cmpxchg_local support then we do not need to incur the overhead
- * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
- *
- * mod_state() modifies the zone counter state through atomic per cpu
- * operations.
- *
- * Overstep mode specifies how overstep should handled:
- *     0       No overstepping
- *     1       Overstepping half of threshold
- *     -1      Overstepping minus half of threshold
-*/
-static inline void mod_zone_state(struct zone *zone,
-       enum zone_stat_item item, long delta, int overstep_mode)
-{
-	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long o, n, t, z;
-
-	do {
-		z = 0;  /* overflow to zone counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a zone.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to zone counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		zone_page_state_add(z, zone, item);
-}
-
-void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-			 long delta)
-{
-	mod_zone_state(zone, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_zone_page_state);
-
-void inc_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_zone_page_state);
-
-void dec_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_zone_page_state);
-
-static inline void mod_node_state(struct pglist_data *pgdat,
-       enum node_stat_item item, int delta, int overstep_mode)
-{
-	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	long o, n, t, z;
-
-	if (vmstat_item_in_bytes(item)) {
-		/*
-		 * Only cgroups use subpage accounting right now; at
-		 * the global level, these items still change in
-		 * multiples of whole pages. Store them as pages
-		 * internally to keep the per-cpu counters compact.
-		 */
-		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
-		delta >>= PAGE_SHIFT;
-	}
-
-	do {
-		z = 0;  /* overflow to node counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a node.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to node counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		node_page_state_add(z, pgdat, item);
-}
-
-void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
-					long delta)
-{
-	mod_node_state(pgdat, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_node_page_state);
-
-void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
-{
-	mod_node_state(pgdat, item, 1, 1);
-}
-
-void inc_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_node_page_state);
-
-void dec_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_node_page_state);
-#else
 /*
  * Use interrupt disable to serialize counter updates
  */



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2023-02-09 15:01 ` [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-02-09 15:01 ` Marcelo Tosatti
  2023-03-01 22:57   ` Peter Xu
  2023-02-09 15:02 ` [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use a cmpxchg loop 
instead of a pair of read/write instructions.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -885,7 +885,7 @@ static int refresh_cpu_vm_stats(void)
 }
 
 /*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
@@ -906,8 +906,9 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = pzstats->vm_stat_diff[i];
-				pzstats->vm_stat_diff[i] = 0;
+				do {
+					v = pzstats->vm_stat_diff[i];
+				} while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
@@ -917,8 +918,9 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_numa_event[i]) {
 				unsigned long v;
 
-				v = pzstats->vm_numa_event[i];
-				pzstats->vm_numa_event[i] = 0;
+				do {
+					v = pzstats->vm_numa_event[i];
+				} while (!try_cmpxchg(&pzstats->vm_numa_event[i], &v, 0));
 				zone_numa_event_add(v, zone, i);
 			}
 		}
@@ -934,8 +936,9 @@ void cpu_vm_stats_fold(int cpu)
 			if (p->vm_node_stat_diff[i]) {
 				int v;
 
-				v = p->vm_node_stat_diff[i];
-				p->vm_node_stat_diff[i] = 0;
+				do {
+					v = p->vm_node_stat_diff[i];
+				} while	(!try_cmpxchg(&p->vm_node_stat_diff[i], &v, 0));
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
 			}



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2023-02-09 15:01 ` [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-02-09 15:02 ` Marcelo Tosatti
  2023-03-02 21:01   ` Peter Xu
  2023-02-09 15:02 ` [PATCH v2 11/11] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
  2023-02-23 14:54 ` [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
  11 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Now that the counters are modified via cmpxchg both CPU locally
(via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU 
vmstats folding remotely.

This fixes the following two problems:

 1. A customer provided some evidence which indicates that
    the idle tick was stopped; albeit, CPU-specific vmstat
    counters still remained populated.

    Thus one can only assume quiet_vmstat() was not
    invoked on return to the idle loop. If I understand
    correctly, I suspect this divergence might erroneously
    prevent a reclaim attempt by kswapd. If the number of
    zone specific free pages are below their per-cpu drift
    value then zone_page_state_snapshot() is used to
    compute a more accurate view of the aforementioned
    statistic.  Thus any task blocked on the NUMA node
    specific pfmemalloc_wait queue will be unable to make
    significant progress via direct reclaim unless it is
    killed after being woken up by kswapd
    (see throttle_direct_reclaim())

 2. With a SCHED_FIFO task that busy loops on a given CPU,
    and kworker for that CPU at SCHED_OTHER priority,
    queuing work to sync per-vmstats will either cause that
    work to never execute, or stalld (i.e. stall daemon)
    boosts kworker priority which causes a latency
    violation

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -2007,6 +2007,23 @@ static void vmstat_shepherd(struct work_
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		cpu_vm_stats_fold(cpu);
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
+}
+#else
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
@@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }
+#endif
 
 static void __init start_shepherd_timer(void)
 {



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v2 11/11] mm/vmstat: refresh stats remotely instead of via work item
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2023-02-09 15:02 ` [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-02-09 15:02 ` Marcelo Tosatti
  2023-02-23 14:54 ` [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
  11 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-09 15:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Refresh per-CPU stats remotely, instead of queueing 
work items, for the stat_refresh procfs method.

This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1865,11 +1865,21 @@ static DEFINE_PER_CPU(struct delayed_wor
 int sysctl_stat_interval __read_mostly = HZ;
 
 #ifdef CONFIG_PROC_FS
+
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+static int refresh_all_vm_stats(void);
+#else
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats();
 }
 
+static int refresh_all_vm_stats(void)
+{
+	return schedule_on_each_cpu(refresh_vm_stats);
+}
+#endif
+
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -1889,7 +1899,7 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
+	err = refresh_all_vm_stats();
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
@@ -2009,7 +2019,7 @@ static DECLARE_DEFERRABLE_WORK(shepherd,
 
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
-static void vmstat_shepherd(struct work_struct *w)
+static int refresh_all_vm_stats(void)
 {
 	int cpu;
 
@@ -2019,7 +2029,12 @@ static void vmstat_shepherd(struct work_
 		cond_resched();
 	}
 	cpus_read_unlock();
+	return 0;
+}
 
+static void vmstat_shepherd(struct work_struct *w)
+{
+	refresh_all_vm_stats();
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/11] fold per-CPU vmstats remotely
  2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2023-02-09 15:02 ` [PATCH v2 11/11] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
@ 2023-02-23 14:54 ` Marcelo Tosatti
  2023-02-24  2:34   ` Hillf Danton
  11 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-23 14:54 UTC (permalink / raw)
  To: Christoph Lameter, Andrew Morton
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 09, 2023 at 12:01:50PM -0300, Marcelo Tosatti wrote:
> This patch series addresses the following two problems:
> 
>     1. A customer provided some evidence which indicates that
>        the idle tick was stopped; albeit, CPU-specific vmstat
>        counters still remained populated.
> 
>        Thus one can only assume quiet_vmstat() was not
>        invoked on return to the idle loop. If I understand
>        correctly, I suspect this divergence might erroneously
>        prevent a reclaim attempt by kswapd. If the number of
>        zone specific free pages are below their per-cpu drift
>        value then zone_page_state_snapshot() is used to
>        compute a more accurate view of the aforementioned
>        statistic.  Thus any task blocked on the NUMA node
>        specific pfmemalloc_wait queue will be unable to make
>        significant progress via direct reclaim unless it is
>        killed after being woken up by kswapd
>        (see throttle_direct_reclaim())
> 
>     2. With a SCHED_FIFO task that busy loops on a given CPU,
>        and kworker for that CPU at SCHED_OTHER priority,
>        queuing work to sync per-vmstats will either cause that
>        work to never execute, or stalld (i.e. stall daemon)
>        boosts kworker priority which causes a latency
>        violation
> 
> By having vmstat_shepherd flush the per-CPU counters to the
> global counters from remote CPUs.
> 
> This is done using cmpxchg to manipulate the counters,
> both CPU locally (via the account functions),
> and remotely (via cpu_vm_stats_fold).
> 
> Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> the initial patch series.
> 
> v2:
> - actually use LOCK CMPXCHG on counter mod/inc/dec functions
>   (Christoph Lameter)
> - use try_cmpxchg for cmpxchg loops
>   (Uros Bizjak / Matthew Wilcox)
> 
> 
>  arch/arm64/include/asm/percpu.h     |   16 ++-
>  arch/loongarch/include/asm/percpu.h |   23 ++++
>  arch/s390/include/asm/percpu.h      |    5 +
>  arch/x86/include/asm/percpu.h       |   39 ++++----
>  include/asm-generic/percpu.h        |   17 +++
>  include/linux/mmzone.h              |    3 
>  kernel/fork.c                       |    2 
>  kernel/scs.c                        |    2 
>  mm/vmstat.c                         |  424 ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------
>  9 files changed, 308 insertions(+), 223 deletions(-)

Friendly ping, any other concern with this series?

If not, ACKed-by or Reviewed-by's would be welcome.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/11] fold per-CPU vmstats remotely
  2023-02-23 14:54 ` [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-02-24  2:34   ` Hillf Danton
  2023-02-27 19:41     ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: Hillf Danton @ 2023-02-24  2:34 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Andrew Morton, Aaron Tomlin,
	Frederic Weisbecker, linux-kernel, linux-mm

On Thu, Feb 09, 2023 at 12:01:50PM -0300, Marcelo Tosatti wrote:
> This patch series addresses the following two problems:
> 
>     1. A customer provided some evidence which indicates that
>        the idle tick was stopped; albeit, CPU-specific vmstat
>        counters still remained populated.
> 
>        Thus one can only assume quiet_vmstat() was not
>        invoked on return to the idle loop. If I understand
>        correctly, I suspect this divergence might erroneously
>        prevent a reclaim attempt by kswapd. If the number of
>        zone specific free pages are below their per-cpu drift
>        value then zone_page_state_snapshot() is used to
>        compute a more accurate view of the aforementioned
>        statistic.  Thus any task blocked on the NUMA node
>        specific pfmemalloc_wait queue will be unable to make
>        significant progress via direct reclaim unless it is
>        killed after being woken up by kswapd
>        (see throttle_direct_reclaim())
> 
>     2. With a SCHED_FIFO task that busy loops on a given CPU,
>        and kworker for that CPU at SCHED_OTHER priority,
>        queuing work to sync per-vmstats will either cause that
>        work to never execute, or stalld (i.e. stall daemon)
>        boosts kworker priority which causes a latency
>        violation
> 
> By having vmstat_shepherd flush the per-CPU counters to the
> global counters from remote CPUs.
> 
> This is done using cmpxchg to manipulate the counters,
> both CPU locally (via the account functions),
> and remotely (via cpu_vm_stats_fold).

Frankly another case of bandaid[1] ?

[1] https://lore.kernel.org/lkml/20230223150624.GA29739@lst.de/


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 00/11] fold per-CPU vmstats remotely
  2023-02-24  2:34   ` Hillf Danton
@ 2023-02-27 19:41     ` Marcelo Tosatti
  0 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-27 19:41 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Christoph Lameter, Andrew Morton, Aaron Tomlin,
	Frederic Weisbecker, linux-kernel, linux-mm

On Fri, Feb 24, 2023 at 10:34:10AM +0800, Hillf Danton wrote:
> On Thu, Feb 09, 2023 at 12:01:50PM -0300, Marcelo Tosatti wrote:
> > This patch series addresses the following two problems:
> > 
> >     1. A customer provided some evidence which indicates that
> >        the idle tick was stopped; albeit, CPU-specific vmstat
> >        counters still remained populated.
> > 
> >        Thus one can only assume quiet_vmstat() was not
> >        invoked on return to the idle loop. If I understand
> >        correctly, I suspect this divergence might erroneously
> >        prevent a reclaim attempt by kswapd. If the number of
> >        zone specific free pages are below their per-cpu drift
> >        value then zone_page_state_snapshot() is used to
> >        compute a more accurate view of the aforementioned
> >        statistic.  Thus any task blocked on the NUMA node
> >        specific pfmemalloc_wait queue will be unable to make
> >        significant progress via direct reclaim unless it is
> >        killed after being woken up by kswapd
> >        (see throttle_direct_reclaim())
> > 
> >     2. With a SCHED_FIFO task that busy loops on a given CPU,
> >        and kworker for that CPU at SCHED_OTHER priority,
> >        queuing work to sync per-vmstats will either cause that
> >        work to never execute, or stalld (i.e. stall daemon)
> >        boosts kworker priority which causes a latency
> >        violation
> > 
> > By having vmstat_shepherd flush the per-CPU counters to the
> > global counters from remote CPUs.
> > 
> > This is done using cmpxchg to manipulate the counters,
> > both CPU locally (via the account functions),
> > and remotely (via cpu_vm_stats_fold).
> 
> Frankly another case of bandaid[1] ?
> 
> [1] https://lore.kernel.org/lkml/20230223150624.GA29739@lst.de/

Only if you disable per-CPU vmstat counters for isolated CPUs
(then maintenance of the data structures in isolated CPUs is
not necessary).

Which would be terrible for performance, however.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-02-09 15:01 ` [PATCH v2 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
@ 2023-02-28 15:53   ` David Hildenbrand
  2023-02-28 19:36     ` Marcelo Tosatti
  2023-03-02 17:21   ` Peter Xu
  1 sibling, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-02-28 15:53 UTC (permalink / raw)
  To: Marcelo Tosatti, Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Mel Gorman

On 09.02.23 16:01, Marcelo Tosatti wrote:
> Draining of pages from the local pcp for a remote zone was necessary
> since:
> 
> "Note that remote node draining is a somewhat esoteric feature that is
> required on large NUMA systems because otherwise significant portions
> of system memory can become trapped in pcp queues. The number of pcp is
> determined by the number of processors and nodes in a system. A system
> with 4 processors and 2 nodes has 8 pcps which is okay. But a system
> with 1024 processors and 512 nodes has 512k pcps with a high potential
> for large amount of memory being caught in them."
> 
> Since commit 443c2accd1b6679a1320167f8f56eed6536b806e
> ("mm/page_alloc: remotely drain per-cpu lists"), drain_all_pages() is able
> to remotely free those pages when necessary.
> 

I'm a bit new to that piece of code, so sorry for the dummy questions. 
I'm staring at linux master,

(1) I think you're removing the single user of drain_zone_pages(). So we
     should remove drain_zone_pages() as well.

(2) drain_zone_pages() documents that we're draining the PCP
     (bulk-freeing them) of the current CPU on remote nodes. That bulk-
     freeing will properly adjust free memory counters. What exactly is
     the impact when no longer doing that? Won't the "snapshot" of some
     counters eventually be wrong? Do we care?

Describing the difference between instructed refresh of vmstat and 
"remotely drain per-cpu lists" in order to move free memory from the pcp 
to the buddy would be great.

Because removing this code here looks nice, but I am not 100% sure about 
the implications. CCing Mel as well.


> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-vmstat-remote/include/linux/mmzone.h
> ===================================================================
> --- linux-vmstat-remote.orig/include/linux/mmzone.h
> +++ linux-vmstat-remote/include/linux/mmzone.h
> @@ -577,9 +577,6 @@ struct per_cpu_pages {
>   	int high;		/* high watermark, emptying needed */
>   	int batch;		/* chunk size for buddy add/remove */
>   	short free_factor;	/* batch scaling factor during free */
> -#ifdef CONFIG_NUMA
> -	short expire;		/* When 0, remote pagesets are drained */
> -#endif
>   
>   	/* Lists of pages, one per migrate type stored on the pcp-lists */
>   	struct list_head lists[NR_PCP_LISTS];
> Index: linux-vmstat-remote/mm/vmstat.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmstat.c
> +++ linux-vmstat-remote/mm/vmstat.c
> @@ -803,7 +803,7 @@ static int fold_diff(int *zone_diff, int
>    *
>    * The function returns the number of global counters updated.
>    */
> -static int refresh_cpu_vm_stats(bool do_pagesets)
> +static int refresh_cpu_vm_stats(void)
>   {
>   	struct pglist_data *pgdat;
>   	struct zone *zone;
> @@ -814,9 +814,6 @@ static int refresh_cpu_vm_stats(bool do_
>   
>   	for_each_populated_zone(zone) {
>   		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
> -#ifdef CONFIG_NUMA
> -		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
> -#endif
>   
>   		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>   			int v;
> @@ -826,44 +823,8 @@ static int refresh_cpu_vm_stats(bool do_
>   
>   				atomic_long_add(v, &zone->vm_stat[i]);
>   				global_zone_diff[i] += v;
> -#ifdef CONFIG_NUMA
> -				/* 3 seconds idle till flush */
> -				__this_cpu_write(pcp->expire, 3);
> -#endif
>   			}
>   		}
> -#ifdef CONFIG_NUMA
> -
> -		if (do_pagesets) {
> -			cond_resched();
> -			/*
> -			 * Deal with draining the remote pageset of this
> -			 * processor
> -			 *
> -			 * Check if there are pages remaining in this pageset
> -			 * if not then there is nothing to expire.
> -			 */
> -			if (!__this_cpu_read(pcp->expire) ||
> -			       !__this_cpu_read(pcp->count))
> -				continue;
> -
> -			/*
> -			 * We never drain zones local to this processor.
> -			 */
> -			if (zone_to_nid(zone) == numa_node_id()) {
> -				__this_cpu_write(pcp->expire, 0);
> -				continue;
> -			}
> -
> -			if (__this_cpu_dec_return(pcp->expire))
> -				continue;
> -
> -			if (__this_cpu_read(pcp->count)) {
> -				drain_zone_pages(zone, this_cpu_ptr(pcp));
> -				changes++;
> -			}
> -		}
> -#endif
>   	}

I think you can then also get rid of the "changes" local variable and do 
a "return fold_diff(global_zone_diff, global_node_diff);" directly.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-02-28 15:53   ` David Hildenbrand
@ 2023-02-28 19:36     ` Marcelo Tosatti
  2023-03-02 10:10       ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-02-28 19:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Mel Gorman

On Tue, Feb 28, 2023 at 04:53:47PM +0100, David Hildenbrand wrote:
> On 09.02.23 16:01, Marcelo Tosatti wrote:
> > Draining of pages from the local pcp for a remote zone was necessary
> > since:
> > 
> > "Note that remote node draining is a somewhat esoteric feature that is
> > required on large NUMA systems because otherwise significant portions
> > of system memory can become trapped in pcp queues. The number of pcp is
> > determined by the number of processors and nodes in a system. A system
> > with 4 processors and 2 nodes has 8 pcps which is okay. But a system
> > with 1024 processors and 512 nodes has 512k pcps with a high potential
> > for large amount of memory being caught in them."
> > 
> > Since commit 443c2accd1b6679a1320167f8f56eed6536b806e
> > ("mm/page_alloc: remotely drain per-cpu lists"), drain_all_pages() is able
> > to remotely free those pages when necessary.
> > 
> 
> I'm a bit new to that piece of code, so sorry for the dummy questions. I'm
> staring at linux master,
> 
> (1) I think you're removing the single user of drain_zone_pages(). So we
>     should remove drain_zone_pages() as well.

Done.

> (2) drain_zone_pages() documents that we're draining the PCP
>     (bulk-freeing them) of the current CPU on remote nodes. That bulk-
>     freeing will properly adjust free memory counters. What exactly is
>     the impact when no longer doing that? Won't the "snapshot" of some
>     counters eventually be wrong? Do we care?

Don't see why the snapshot of counters will be wrong.

Instead of freeing pages on pcp list of remote nodes after they are
considered idle ("3 seconds idle till flush"), what will happen is that
drain_all_pages() will free those pcps, for example after an allocation
fails on direct reclaim:

        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

        /*
         * If an allocation failed after direct reclaim, it could be because
         * pages are pinned on the per-cpu lists or in high alloc reserves.
         * Shrink them and try again
         */
        if (!page && !drained) {
                unreserve_highatomic_pageblock(ac, false);
                drain_all_pages(NULL);
                drained = true;
                goto retry;
        }

In both cases the pages are freed (and counters maintained) here:

static inline void __free_one_page(struct page *page,
                unsigned long pfn,
                struct zone *zone, unsigned int order,
                int migratetype, fpi_t fpi_flags)
{
        struct capture_control *capc = task_capc(zone);
        unsigned long buddy_pfn = 0;
        unsigned long combined_pfn;
        struct page *buddy;
        bool to_tail;

        VM_BUG_ON(!zone_is_initialized(zone));
        VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);

        VM_BUG_ON(migratetype == -1);
        if (likely(!is_migrate_isolate(migratetype)))
                __mod_zone_freepage_state(zone, 1 << order, migratetype);

        VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
        VM_BUG_ON_PAGE(bad_range(zone, page), page);

        while (order < MAX_ORDER - 1) {
                if (compaction_capture(capc, page, order, migratetype)) {
                        __mod_zone_freepage_state(zone, -(1 << order),
                                                                migratetype);
                        return;
                }

> Describing the difference between instructed refresh of vmstat and "remotely
> drain per-cpu lists" in order to move free memory from the pcp to the buddy
> would be great.

The difference is that now remote PCPs will be drained on demand, either via
kcompactd or direct reclaim (through drain_all_pages), when memory is
low.

For example, with the following test:

dd if=/dev/zero of=file bs=1M count=32000 on a tmpfs filesystem:

      kcompactd0-116     [005] ...1 228232.042873: drain_all_pages <-kcompactd_do_work
      kcompactd0-116     [005] ...1 228232.042873: __drain_all_pages <-kcompactd_do_work
              dd-479485  [003] ...1 228232.455130: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
              dd-479485  [011] ...1 228232.721994: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
     gnome-shell-3750    [015] ...1 228232.723729: __drain_all_pages <-__alloc_pages_slowpath.constprop.0

The commit message was indeed incorrect. Updated one:

"mm/vmstat: remove remote node draining

Draining of pages from the local pcp for a remote zone should not be
necessary, since once the system is low on memory (or compaction on a
zone is in effect), drain_all_pages should be called freeing any unused
pcps."

Thanks!

> Because removing this code here looks nice, but I am not 100% sure about the
> implications. CCing Mel as well.
> 
> 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/include/linux/mmzone.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/include/linux/mmzone.h
> > +++ linux-vmstat-remote/include/linux/mmzone.h
> > @@ -577,9 +577,6 @@ struct per_cpu_pages {
> >   	int high;		/* high watermark, emptying needed */
> >   	int batch;		/* chunk size for buddy add/remove */
> >   	short free_factor;	/* batch scaling factor during free */
> > -#ifdef CONFIG_NUMA
> > -	short expire;		/* When 0, remote pagesets are drained */
> > -#endif
> >   	/* Lists of pages, one per migrate type stored on the pcp-lists */
> >   	struct list_head lists[NR_PCP_LISTS];
> > Index: linux-vmstat-remote/mm/vmstat.c
> > ===================================================================
> > --- linux-vmstat-remote.orig/mm/vmstat.c
> > +++ linux-vmstat-remote/mm/vmstat.c
> > @@ -803,7 +803,7 @@ static int fold_diff(int *zone_diff, int
> >    *
> >    * The function returns the number of global counters updated.
> >    */
> > -static int refresh_cpu_vm_stats(bool do_pagesets)
> > +static int refresh_cpu_vm_stats(void)
> >   {
> >   	struct pglist_data *pgdat;
> >   	struct zone *zone;
> > @@ -814,9 +814,6 @@ static int refresh_cpu_vm_stats(bool do_
> >   	for_each_populated_zone(zone) {
> >   		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
> > -#ifdef CONFIG_NUMA
> > -		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
> > -#endif
> >   		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
> >   			int v;
> > @@ -826,44 +823,8 @@ static int refresh_cpu_vm_stats(bool do_
> >   				atomic_long_add(v, &zone->vm_stat[i]);
> >   				global_zone_diff[i] += v;
> > -#ifdef CONFIG_NUMA
> > -				/* 3 seconds idle till flush */
> > -				__this_cpu_write(pcp->expire, 3);
> > -#endif
> >   			}
> >   		}
> > -#ifdef CONFIG_NUMA
> > -
> > -		if (do_pagesets) {
> > -			cond_resched();
> > -			/*
> > -			 * Deal with draining the remote pageset of this
> > -			 * processor
> > -			 *
> > -			 * Check if there are pages remaining in this pageset
> > -			 * if not then there is nothing to expire.
> > -			 */
> > -			if (!__this_cpu_read(pcp->expire) ||
> > -			       !__this_cpu_read(pcp->count))
> > -				continue;
> > -
> > -			/*
> > -			 * We never drain zones local to this processor.
> > -			 */
> > -			if (zone_to_nid(zone) == numa_node_id()) {
> > -				__this_cpu_write(pcp->expire, 0);
> > -				continue;
> > -			}
> > -
> > -			if (__this_cpu_dec_return(pcp->expire))
> > -				continue;
> > -
> > -			if (__this_cpu_read(pcp->count)) {
> > -				drain_zone_pages(zone, this_cpu_ptr(pcp));
> > -				changes++;
> > -			}
> > -		}
> > -#endif
> >   	}
> 
> I think you can then also get rid of the "changes" local variable and do a
> "return fold_diff(global_zone_diff, global_node_diff);" directly.

Done.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-09 15:01 ` [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-01 22:57   ` Peter Xu
  2023-03-02 13:55     ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2023-03-01 22:57 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 09, 2023 at 12:01:59PM -0300, Marcelo Tosatti wrote:
>  /*
> - * Fold the data for an offline cpu into the global array.
> + * Fold the data for a cpu into the global array.
>   * There cannot be any access by the offline cpu and therefore
>   * synchronization is simplified.
>   */
> @@ -906,8 +906,9 @@ void cpu_vm_stats_fold(int cpu)
>  			if (pzstats->vm_stat_diff[i]) {
>  				int v;
>  
> -				v = pzstats->vm_stat_diff[i];
> -				pzstats->vm_stat_diff[i] = 0;
> +				do {
> +					v = pzstats->vm_stat_diff[i];
> +				} while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));

IIUC try_cmpxchg will update "v" already, so I'd assume this'll work the
same:

        while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));

Then I figured, maybe it's easier to use xchg()?

I've no knowledge at all on cpu offline code, so sorry if this will be a
naive question.  But from what I understand this should not be touched by
anyone else.  Reasons:

  (1) cpu_vm_stats_fold() is only called in page_alloc_cpu_dead(), and the
      comment says:
  
	/*
	 * Zero the differential counters of the dead processor
	 * so that the vm statistics are consistent.
	 *
	 * This is only okay since the processor is dead and cannot
	 * race with what we are doing.
	 */
	cpu_vm_stats_fold(cpu);

      so.. I think that's what it says..

  (2) If someone can modify the dead cpu's vm_stat_diff, what guarantees it
      won't be e.g. boosted again right after try_cmpxchg() / xchg()
      returns?  What to do with the left-overs?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-02-28 19:36     ` Marcelo Tosatti
@ 2023-03-02 10:10       ` David Hildenbrand
  2023-03-21 15:20         ` Mel Gorman
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-03-02 10:10 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Mel Gorman,
	Vlastimil Babka

[...]

> 
>> (2) drain_zone_pages() documents that we're draining the PCP
>>      (bulk-freeing them) of the current CPU on remote nodes. That bulk-
>>      freeing will properly adjust free memory counters. What exactly is
>>      the impact when no longer doing that? Won't the "snapshot" of some
>>      counters eventually be wrong? Do we care?
> 
> Don't see why the snapshot of counters will be wrong.
> 
> Instead of freeing pages on pcp list of remote nodes after they are
> considered idle ("3 seconds idle till flush"), what will happen is that
> drain_all_pages() will free those pcps, for example after an allocation
> fails on direct reclaim:
> 
>          page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> 
>          /*
>           * If an allocation failed after direct reclaim, it could be because
>           * pages are pinned on the per-cpu lists or in high alloc reserves.
>           * Shrink them and try again
>           */
>          if (!page && !drained) {
>                  unreserve_highatomic_pageblock(ac, false);
>                  drain_all_pages(NULL);
>                  drained = true;
>                  goto retry;
>          }
> 
> In both cases the pages are freed (and counters maintained) here:
> 
> static inline void __free_one_page(struct page *page,
>                  unsigned long pfn,
>                  struct zone *zone, unsigned int order,
>                  int migratetype, fpi_t fpi_flags)
> {
>          struct capture_control *capc = task_capc(zone);
>          unsigned long buddy_pfn = 0;
>          unsigned long combined_pfn;
>          struct page *buddy;
>          bool to_tail;
> 
>          VM_BUG_ON(!zone_is_initialized(zone));
>          VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> 
>          VM_BUG_ON(migratetype == -1);
>          if (likely(!is_migrate_isolate(migratetype)))
>                  __mod_zone_freepage_state(zone, 1 << order, migratetype);
> 
>          VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>          VM_BUG_ON_PAGE(bad_range(zone, page), page);
> 
>          while (order < MAX_ORDER - 1) {
>                  if (compaction_capture(capc, page, order, migratetype)) {
>                          __mod_zone_freepage_state(zone, -(1 << order),
>                                                                  migratetype);
>                          return;
>                  }
> 
>> Describing the difference between instructed refresh of vmstat and "remotely
>> drain per-cpu lists" in order to move free memory from the pcp to the buddy
>> would be great.
> 
> The difference is that now remote PCPs will be drained on demand, either via
> kcompactd or direct reclaim (through drain_all_pages), when memory is
> low.
> 
> For example, with the following test:
> 
> dd if=/dev/zero of=file bs=1M count=32000 on a tmpfs filesystem:
> 
>        kcompactd0-116     [005] ...1 228232.042873: drain_all_pages <-kcompactd_do_work
>        kcompactd0-116     [005] ...1 228232.042873: __drain_all_pages <-kcompactd_do_work
>                dd-479485  [003] ...1 228232.455130: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
>                dd-479485  [011] ...1 228232.721994: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
>       gnome-shell-3750    [015] ...1 228232.723729: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> 
> The commit message was indeed incorrect. Updated one:
> 
> "mm/vmstat: remove remote node draining
> 
> Draining of pages from the local pcp for a remote zone should not be
> necessary, since once the system is low on memory (or compaction on a
> zone is in effect), drain_all_pages should be called freeing any unused
> pcps."
> 
> Thanks!

Thanks for the explanation, that makes sense to me. Feel free to add my

Acked-by: David Hildenbrand <david@redhat.com>

... hoping that some others (Mel, Vlastimil?) can have another look.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 ` [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
@ 2023-03-02 10:42   ` David Hildenbrand
  2023-03-02 10:51     ` David Hildenbrand
  2023-03-02 14:32     ` Marcelo Tosatti
  2023-03-02 20:53   ` Peter Xu
  2023-03-15 23:56   ` Christoph Lameter
  2 siblings, 2 replies; 47+ messages in thread
From: David Hildenbrand @ 2023-03-02 10:42 UTC (permalink / raw)
  To: Marcelo Tosatti, Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On 09.02.23 16:01, Marcelo Tosatti wrote:
> Goal is to have vmstat_shepherd to transfer from
> per-CPU counters to global counters remotely. For this,
> an atomic this_cpu_cmpxchg is necessary.
> 
> Following the kernel convention for cmpxchg/cmpxchg_local,
> change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> ===================================================================
> --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
>   	_pcp_protect_return(xchg_relaxed, pcp, val)
>   
>   #define this_cpu_cmpxchg_1(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>   #define this_cpu_cmpxchg_2(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>   #define this_cpu_cmpxchg_4(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>   #define this_cpu_cmpxchg_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
> +
> +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
>   	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +

Call me confused (not necessarily your fault :) ).

We have cmpxchg_local, cmpxchg_relaxed and cmpxchg. 
this_cpu_cmpxchg_local_* now calls ... *drumroll* ... cmpxchg_relaxed.

IIUC, cmpxchg_local is only guaranteed to be atomic WRO the current CPU 
(especially, protection against interrupts when the operation is 
implemented using multiple instructions). We do have a generic 
implementation that disables/enables interrupts.

IIUC, cmpxchg_relaxed an atomic update without any memory ordering 
guarantees (in contrast to cmpxchg, cmpxchg_acquire, cmpxchg_acquire). 
We default to arch_cmpxchg if we don't have arch_cmpxchg_relaxed. 
arch_cmpxchg defaults to arch_cmpxchg_local, if not supported.


Naturally I wonder:

(a) Should these new variants be rather called
     this_cpu_cmpxchg_relaxed_* ?

(b) Should these new variants rather call the "_local" variant?


Shedding some light on this would be great.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg
  2023-02-09 15:01 ` [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-03-02 10:47   ` David Hildenbrand
  2023-03-02 14:47     ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2023-03-02 10:47 UTC (permalink / raw)
  To: Marcelo Tosatti, Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On 09.02.23 16:01, Marcelo Tosatti wrote:
> In preparation to switch vmstat shepherd to flush
> per-CPU counters remotely, switch all functions that
> modify the counters to use cmpxchg.
> 
> To test the performance difference, a page allocator microbenchmark:
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
> with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
> 
> For the single_page_alloc_free test, which does
> 
>          /** Loop to measure **/
>          for (i = 0; i < rec->loops; i++) {
>                  my_page = alloc_page(gfp_mask);
>                  if (unlikely(my_page == NULL))
>                          return 0;
>                  __free_page(my_page);
>          }
> 
> Unit is cycles.
> 
> Vanilla			Patched		Diff
> 159			165		3.7%
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-vmstat-remote/mm/vmstat.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmstat.c
> +++ linux-vmstat-remote/mm/vmstat.c
> @@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
>   	}
>   }

I wonder why we get a diff that is rather hard to review because it 
removes all existing codes and replaces it by almost-identical code. Are 
you maybe moving a bunch of code while modifying some tiny bits at the 
same time?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-02 10:42   ` David Hildenbrand
@ 2023-03-02 10:51     ` David Hildenbrand
  2023-03-02 14:32     ` Marcelo Tosatti
  1 sibling, 0 replies; 47+ messages in thread
From: David Hildenbrand @ 2023-03-02 10:51 UTC (permalink / raw)
  To: Marcelo Tosatti, Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On 02.03.23 11:42, David Hildenbrand wrote:
> On 09.02.23 16:01, Marcelo Tosatti wrote:
>> Goal is to have vmstat_shepherd to transfer from
>> per-CPU counters to global counters remotely. For this,
>> an atomic this_cpu_cmpxchg is necessary.
>>
>> Following the kernel convention for cmpxchg/cmpxchg_local,
>> change ARM's this_cpu_cmpxchg_ helpers to be atomic,
>> and add this_cpu_cmpxchg_local_ helpers which are not atomic.
>>
>> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>>
>> Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
>> ===================================================================
>> --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
>> +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
>> @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
>>    	_pcp_protect_return(xchg_relaxed, pcp, val)
>>    
>>    #define this_cpu_cmpxchg_1(pcp, o, n)	\
>> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>>    #define this_cpu_cmpxchg_2(pcp, o, n)	\
>> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>>    #define this_cpu_cmpxchg_4(pcp, o, n)	\
>> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>>    #define this_cpu_cmpxchg_8(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>> +
>> +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
>>    	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
>> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
>> +
> 
> Call me confused (not necessarily your fault :) ).
> 
> We have cmpxchg_local, cmpxchg_relaxed and cmpxchg.
> this_cpu_cmpxchg_local_* now calls ... *drumroll* ... cmpxchg_relaxed.
> 
> IIUC, cmpxchg_local is only guaranteed to be atomic WRO the current CPU
> (especially, protection against interrupts when the operation is
> implemented using multiple instructions). We do have a generic
> implementation that disables/enables interrupts.
> 
> IIUC, cmpxchg_relaxed an atomic update without any memory ordering
> guarantees (in contrast to cmpxchg, cmpxchg_acquire, cmpxchg_acquire).
> We default to arch_cmpxchg if we don't have arch_cmpxchg_relaxed.
> arch_cmpxchg defaults to arch_cmpxchg_local, if not supported.
> 
> 
> Naturally I wonder:
> 
> (a) Should these new variants be rather called
>       this_cpu_cmpxchg_relaxed_* ?
> 
> (b) Should these new variants rather call the "_local" variant?
> 
> 
> Shedding some light on this would be great.

Nevermind, looking at the other patches I realized that this is 
arch-specific. Other archs that have _local variants call the _local 
variants. So I assume we really want the name this_cpu_cmpxchg_local_*, 
and using _relaxed here is just the aarch64 way of implementing _local 
via _relaxed.

Confusing :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-03-01 22:57   ` Peter Xu
@ 2023-03-02 13:55     ` Marcelo Tosatti
  2023-03-02 21:19       ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 13:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Wed, Mar 01, 2023 at 05:57:08PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:59PM -0300, Marcelo Tosatti wrote:
> >  /*
> > - * Fold the data for an offline cpu into the global array.
> > + * Fold the data for a cpu into the global array.
> >   * There cannot be any access by the offline cpu and therefore
> >   * synchronization is simplified.
> >   */
> > @@ -906,8 +906,9 @@ void cpu_vm_stats_fold(int cpu)
> >  			if (pzstats->vm_stat_diff[i]) {
> >  				int v;
> >  
> > -				v = pzstats->vm_stat_diff[i];
> > -				pzstats->vm_stat_diff[i] = 0;
> > +				do {
> > +					v = pzstats->vm_stat_diff[i];
> > +				} while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
> 
> IIUC try_cmpxchg will update "v" already, so I'd assume this'll work the
> same:
> 
>         while (!try_cmpxchg(&pzstats->vm_stat_diff[i], &v, 0));
> 
> Then I figured, maybe it's easier to use xchg()?

Yes, fixed.

> I've no knowledge at all on cpu offline code, so sorry if this will be a
> naive question.  But from what I understand this should not be touched by
> anyone else.  Reasons:
> 
>   (1) cpu_vm_stats_fold() is only called in page_alloc_cpu_dead(), and the
>       comment says:
>   
> 	/*
> 	 * Zero the differential counters of the dead processor
> 	 * so that the vm statistics are consistent.
> 	 *
> 	 * This is only okay since the processor is dead and cannot
> 	 * race with what we are doing.
> 	 */
> 	cpu_vm_stats_fold(cpu);
> 
>       so.. I think that's what it says..

This refers to the use of this_cpu operations being performed by the
counter updates.

If both the updater and reader use atomic accesses (which is the case after patch 8:
"mm/vmstat: switch counter modification to cmpxchg"), and
CONFIG_HAVE_CMPXCHG_LOCAL is set, then the comment is stale.

Removed it.

>   (2) If someone can modify the dead cpu's vm_stat_diff,

The only context that can modify the cpu's vm_stat_diff are:

1) The CPU itself (increases the counter).
2) cpu_vm_stats_fold (from vmstat_shepherd kernel thread), from 
x -> 0 only.

So you should not be able to increase the counter after this point. 
I suppose this is what this comment refers to.

>       what guarantees it
>       won't be e.g. boosted again right after try_cmpxchg() / xchg()
>       returns?  What to do with the left-overs?

If any code runs on the CPU that is being hotunplugged,
after cpu_vm_stats_fold (from page_alloc_cpu_dead), then there will 
be left-overs. But such bugs would exist today as well.

Or, if that bug exists, you could replace "for_each_online_cpu" to 
"for_each_cpu" here:

static void vmstat_shepherd(struct work_struct *w)
{
        int cpu;

        cpus_read_lock();
        /* Check processors whose vmstat worker threads have been disabled */
        for_each_online_cpu(cpu) {


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-02 10:42   ` David Hildenbrand
  2023-03-02 10:51     ` David Hildenbrand
@ 2023-03-02 14:32     ` Marcelo Tosatti
  1 sibling, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 14:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 11:42:57AM +0100, David Hildenbrand wrote:
> On 09.02.23 16:01, Marcelo Tosatti wrote:
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this,
> > an atomic this_cpu_cmpxchg is necessary.
> > 
> > Following the kernel convention for cmpxchg/cmpxchg_local,
> > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> >   	_pcp_protect_return(xchg_relaxed, pcp, val)
> >   #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >   #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >   #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >   #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > +
> > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> >   	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +
> 
> Call me confused (not necessarily your fault :) ).
> 
> We have cmpxchg_local, cmpxchg_relaxed and cmpxchg. this_cpu_cmpxchg_local_*
> now calls ... *drumroll* ... cmpxchg_relaxed.
> IIUC, cmpxchg_local is only guaranteed to be atomic WRO the current CPU
> (especially, protection against interrupts when the operation is implemented
> using multiple instructions). We do have a generic implementation that
> disables/enables interrupts.
>
> IIUC, cmpxchg_relaxed an atomic update without any memory ordering
> guarantees (in contrast to cmpxchg, cmpxchg_acquire, cmpxchg_acquire). We
> default to arch_cmpxchg if we don't have arch_cmpxchg_relaxed. arch_cmpxchg
> defaults to arch_cmpxchg_local, if not supported.
> 
> 
> Naturally I wonder:
> 
> (a) Should these new variants be rather called
>     this_cpu_cmpxchg_relaxed_* ?

No: it happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.

See cf10b79a7d88edc689479af989b3a88e9adf07ff.

> (b) Should these new variants rather call the "_local" variant?

They probably should. But this patchset maintains the current behaviour
of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:

 #define this_cpu_cmpxchg_1(pcp, o, n)  \
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)  \
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)  \
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)  \
+       _pcp_protect_return(cmpxchg, pcp, o, n)


Thanks.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg
  2023-03-02 10:47   ` David Hildenbrand
@ 2023-03-02 14:47     ` Marcelo Tosatti
  2023-03-02 16:20       ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 14:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 11:47:35AM +0100, David Hildenbrand wrote:
> On 09.02.23 16:01, Marcelo Tosatti wrote:
> > In preparation to switch vmstat shepherd to flush
> > per-CPU counters remotely, switch all functions that
> > modify the counters to use cmpxchg.
> > 
> > To test the performance difference, a page allocator microbenchmark:
> > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
> > with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
> > 
> > For the single_page_alloc_free test, which does
> > 
> >          /** Loop to measure **/
> >          for (i = 0; i < rec->loops; i++) {
> >                  my_page = alloc_page(gfp_mask);
> >                  if (unlikely(my_page == NULL))
> >                          return 0;
> >                  __free_page(my_page);
> >          }
> > 
> > Unit is cycles.
> > 
> > Vanilla			Patched		Diff
> > 159			165		3.7%
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/mm/vmstat.c
> > ===================================================================
> > --- linux-vmstat-remote.orig/mm/vmstat.c
> > +++ linux-vmstat-remote/mm/vmstat.c
> > @@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
> >   	}
> >   }
> 
> I wonder why we get a diff that is rather hard to review because it removes
> all existing codes and replaces it by almost-identical code. Are you maybe
> moving a bunch of code while modifying some tiny bits at the same time?

Current code has functions defined like so:

__mod_zone_page_state
__mod_node_page_state
__inc_zone_page_state
__inc_node_page_state
__dec_zone_page_state
__dec_node_page_state
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
mod_zone_page_state
inc_zone_page_state
dec_zone_page_state
mod_node_page_state
inc_node_page_state
dec_node_page_state
#else
mod_zone_page_state
inc_zone_page_state
dec_zone_page_state
mod_node_page_state
inc_node_page_state
dec_node_page_state
#endif

What this patch is doing is to define the __ versions for the
CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts.

So it will be:

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
mod_zone_page_state
inc_zone_page_state
dec_zone_page_state
mod_node_page_state
inc_node_page_state
dec_node_page_state
__mod_zone_page_state (new function, calls mod_zone_page_state).
__mod_node_page_state (new function, calls mod_node_page_state).
__inc_zone_page_state
__inc_node_page_state
__dec_zone_page_state
__dec_node_page_state
#else
__mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not)
__mod_node_page_state
__inc_zone_page_state
__inc_node_page_state
__dec_zone_page_state
__dec_node_page_state
mod_zone_page_state
inc_zone_page_state
dec_zone_page_state
mod_node_page_state
inc_node_page_state
dec_node_page_state
#endif

Any suggestion on how to split this into multiple patchsets for easier
reviewing? (can't think of anything obvious).


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg
  2023-03-02 14:47     ` Marcelo Tosatti
@ 2023-03-02 16:20       ` Peter Xu
  2023-03-02 19:11         ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2023-03-02 16:20 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: David Hildenbrand, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 11:47:35AM -0300, Marcelo Tosatti wrote:
> So it will be:
> 
> #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> mod_zone_page_state
> inc_zone_page_state
> dec_zone_page_state
> mod_node_page_state
> inc_node_page_state
> dec_node_page_state
> __mod_zone_page_state (new function, calls mod_zone_page_state).
> __mod_node_page_state (new function, calls mod_node_page_state).
> __inc_zone_page_state
> __inc_node_page_state
> __dec_zone_page_state
> __dec_node_page_state
> #else
> __mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not)
> __mod_node_page_state
> __inc_zone_page_state
> __inc_node_page_state
> __dec_zone_page_state
> __dec_node_page_state
> mod_zone_page_state
> inc_zone_page_state
> dec_zone_page_state
> mod_node_page_state
> inc_node_page_state
> dec_node_page_state
> #endif
> 
> Any suggestion on how to split this into multiple patchsets for easier
> reviewing? (can't think of anything obvious).

I figured this out before saw this, but it did take me some time to read
carefully into the code base..  maybe it'll be a good idea to mention
something like above in the commit message to ease future reviewers (and
more likelyhood to attract the experts to start chim in)?

One fundamental (but maybe another naive.. :) question on this code piece
(so not directly related to the changeset but maybe it is still..):

AFAICT CONFIG_HAVE_CMPXCHG_LOCAL only means we can do cmpxchg() without
locking memory bus, however when !CONFIG_HAVE_CMPXCHG_LOCAL here we're not
using non-local version but using preempt_disable_nested() to make sure the
read is atomic.  Does it really worth it?  What happens if we use cmpxchg()
unconditionally, but just use local (e.g. no "LOCK" prefix) version when
CONFIG_HAVE_CMPXCHG_LOCAL?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-02-09 15:01 ` [PATCH v2 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
  2023-02-28 15:53   ` David Hildenbrand
@ 2023-03-02 17:21   ` Peter Xu
  2023-03-02 17:27     ` Peter Xu
  2023-03-02 18:56     ` Marcelo Tosatti
  1 sibling, 2 replies; 47+ messages in thread
From: Peter Xu @ 2023-03-02 17:21 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 09, 2023 at 12:01:51PM -0300, Marcelo Tosatti wrote:
> Draining of pages from the local pcp for a remote zone was necessary
> since:
> 
> "Note that remote node draining is a somewhat esoteric feature that is
> required on large NUMA systems because otherwise significant portions
> of system memory can become trapped in pcp queues. The number of pcp is
> determined by the number of processors and nodes in a system. A system
> with 4 processors and 2 nodes has 8 pcps which is okay. But a system
> with 1024 processors and 512 nodes has 512k pcps with a high potential
> for large amount of memory being caught in them."

How about mentioning more details on where does this come from?

afaict it's from commit 4037d45 since 2007.

So I digged that out mostly because I want to know why we did flush pcp at
all during vmstat update.  It already sounds weird to me but I could have
been missing important details.

The rational I had here is refresh_cpu_vm_stats(true) is mostly being
called by the shepherd afaict, while:

  (1) The frequency of that interval is defined as sysctl_stat_interval,
      which has nothing yet to do with pcp pages but only stat at least in
      the name of it, and,

  (2) vmstat_work is only queued if need_update() here:

	for_each_online_cpu(cpu) {
		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

		if (!delayed_work_pending(dw) && need_update(cpu))
			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

		cond_resched();
	}

      need_update() tells us "we should flush vmstats", nothing it tells
      about "we should flush pcp list"..

I looked into the 2007 commit, besides what Marcelo quoted, I do see
there's a major benefit of reusing cache lines, quotting from the commit:

        Move the node draining so that is is done when the vm statistics
        are updated.  At that point we are already touching all the
        cachelines with the pagesets of a processor.

However I didn't see why it's rational to flush pcp list when vmstat needs
flushing either.  I also don't know whether that "cacheline locality" hold
true or not, because I saw that the pcp page list is split from vmstats
since 2021:

    commit 28f836b6777b6f42dce068a40d83a891deaaca37
    Author: Mel Gorman <mgorman@techsingularity.net>
    Date:   Mon Jun 28 19:41:38 2021 -0700

    mm/page_alloc: split per cpu page lists and zone stats

I'm not even sure my A-b or R-b worth anything at all here, just offer
something I got from git archaeology so maybe helpful to readers and
reasoning to this patch.  The correctness of archaeology needs help from
others (Christoph and Gel?)..  I would just say if there's anything useful
or correct may worth collect some into the commit log.

So from what I can tell this patch makes sense.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-03-02 17:21   ` Peter Xu
@ 2023-03-02 17:27     ` Peter Xu
  2023-03-02 19:17       ` Marcelo Tosatti
  2023-03-02 18:56     ` Marcelo Tosatti
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Xu @ 2023-03-02 17:27 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 12:21:15PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:51PM -0300, Marcelo Tosatti wrote:
> > Draining of pages from the local pcp for a remote zone was necessary
> > since:
> > 
> > "Note that remote node draining is a somewhat esoteric feature that is
> > required on large NUMA systems because otherwise significant portions
> > of system memory can become trapped in pcp queues. The number of pcp is
> > determined by the number of processors and nodes in a system. A system
> > with 4 processors and 2 nodes has 8 pcps which is okay. But a system
> > with 1024 processors and 512 nodes has 512k pcps with a high potential
> > for large amount of memory being caught in them."
> 
> How about mentioning more details on where does this come from?
> 
> afaict it's from commit 4037d45 since 2007.
> 
> So I digged that out mostly because I want to know why we did flush pcp at
> all during vmstat update.  It already sounds weird to me but I could have
> been missing important details.
> 
> The rational I had here is refresh_cpu_vm_stats(true) is mostly being
> called by the shepherd afaict, while:
> 
>   (1) The frequency of that interval is defined as sysctl_stat_interval,
>       which has nothing yet to do with pcp pages but only stat at least in
>       the name of it, and,
> 
>   (2) vmstat_work is only queued if need_update() here:
> 
> 	for_each_online_cpu(cpu) {
> 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> 
> 		if (!delayed_work_pending(dw) && need_update(cpu))
> 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> 
> 		cond_resched();
> 	}
> 
>       need_update() tells us "we should flush vmstats", nothing it tells
>       about "we should flush pcp list"..
> 
> I looked into the 2007 commit, besides what Marcelo quoted, I do see
> there's a major benefit of reusing cache lines, quotting from the commit:
> 
>         Move the node draining so that is is done when the vm statistics
>         are updated.  At that point we are already touching all the
>         cachelines with the pagesets of a processor.
> 
> However I didn't see why it's rational to flush pcp list when vmstat needs
> flushing either.  I also don't know whether that "cacheline locality" hold
> true or not, because I saw that the pcp page list is split from vmstats
> since 2021:
> 
>     commit 28f836b6777b6f42dce068a40d83a891deaaca37
>     Author: Mel Gorman <mgorman@techsingularity.net>
>     Date:   Mon Jun 28 19:41:38 2021 -0700
> 
>     mm/page_alloc: split per cpu page lists and zone stats
> 
> I'm not even sure my A-b or R-b worth anything at all here, just offer
> something I got from git archaeology so maybe helpful to readers and
> reasoning to this patch.  The correctness of archaeology needs help from
> others (Christoph and Gel?)..  I would just say if there's anything useful
> or correct may worth collect some into the commit log.
> 
> So from what I can tell this patch makes sense.

One thing I forgot to mention, which may be a slight abi change, is that I
think the pcp page drain is also triggered by /proc/PID/refresh_vm_stats
(even though again I don't see why flushing pcp is strictly needed).  It's
just that I don't know whether there's potential user app that can leverage
this.

The worst case is we can drain pcp list for refresh_vm_stats procfs
explicitly, but I'm not sure whether it'll be worthwhile either, probably
just to be safe.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-03-02 17:21   ` Peter Xu
  2023-03-02 17:27     ` Peter Xu
@ 2023-03-02 18:56     ` Marcelo Tosatti
  1 sibling, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 18:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 12:21:15PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:51PM -0300, Marcelo Tosatti wrote:
> > Draining of pages from the local pcp for a remote zone was necessary
> > since:
> > 
> > "Note that remote node draining is a somewhat esoteric feature that is
> > required on large NUMA systems because otherwise significant portions
> > of system memory can become trapped in pcp queues. The number of pcp is
> > determined by the number of processors and nodes in a system. A system
> > with 4 processors and 2 nodes has 8 pcps which is okay. But a system
> > with 1024 processors and 512 nodes has 512k pcps with a high potential
> > for large amount of memory being caught in them."
> 
> How about mentioning more details on where does this come from?
> 
> afaict it's from commit 4037d45 since 2007.
> 
> So I digged that out mostly because I want to know why we did flush pcp at
> all during vmstat update.  It already sounds weird to me but I could have
> been missing important details.
> 
> The rational I had here is refresh_cpu_vm_stats(true) is mostly being
> called by the shepherd afaict, while:
> 
>   (1) The frequency of that interval is defined as sysctl_stat_interval,
>       which has nothing yet to do with pcp pages but only stat at least in
>       the name of it, and,
> 
>   (2) vmstat_work is only queued if need_update() here:
> 
> 	for_each_online_cpu(cpu) {
> 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> 
> 		if (!delayed_work_pending(dw) && need_update(cpu))
> 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> 
> 		cond_resched();
> 	}
> 
>       need_update() tells us "we should flush vmstats", nothing it tells
>       about "we should flush pcp list"..
> 
> I looked into the 2007 commit, besides what Marcelo quoted, I do see
> there's a major benefit of reusing cache lines, quotting from the commit:
> 
>         Move the node draining so that is is done when the vm statistics
>         are updated.  At that point we are already touching all the
>         cachelines with the pagesets of a processor.
> 
> However I didn't see why it's rational to flush pcp list when vmstat needs
> flushing either. I also don't know whether that "cacheline locality" hold
> true or not, because I saw that the pcp page list is split from vmstats
> since 2021:
> 
>     commit 28f836b6777b6f42dce068a40d83a891deaaca37
>     Author: Mel Gorman <mgorman@techsingularity.net>
>     Date:   Mon Jun 28 19:41:38 2021 -0700
> 
>     mm/page_alloc: split per cpu page lists and zone stats
> 
> I'm not even sure my A-b or R-b worth anything at all here, 

"A-b or R-b" ? 

I think your points are valid.

Also, the fact that sysctl_stat_interval is large (a second or more),
means that any cache locality concern is would be limited to that
time span.

> just offer
> something I got from git archaeology so maybe helpful to readers and
> reasoning to this patch.  

> The correctness of archaeology needs help from
> others (Christoph and Gel?)..  I would just say if there's anything useful
> or correct may worth collect some into the commit log.

Agreed, i forgot to include the commit id in the changelog.

> So from what I can tell this patch makes sense.

Thanks!


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg
  2023-03-02 16:20       ` Peter Xu
@ 2023-03-02 19:11         ` Marcelo Tosatti
  2023-03-02 20:06           ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 19:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 11:20:49AM -0500, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 11:47:35AM -0300, Marcelo Tosatti wrote:
> > So it will be:
> > 
> > #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > mod_zone_page_state
> > inc_zone_page_state
> > dec_zone_page_state
> > mod_node_page_state
> > inc_node_page_state
> > dec_node_page_state
> > __mod_zone_page_state (new function, calls mod_zone_page_state).
> > __mod_node_page_state (new function, calls mod_node_page_state).
> > __inc_zone_page_state
> > __inc_node_page_state
> > __dec_zone_page_state
> > __dec_node_page_state
> > #else
> > __mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not)
> > __mod_node_page_state
> > __inc_zone_page_state
> > __inc_node_page_state
> > __dec_zone_page_state
> > __dec_node_page_state
> > mod_zone_page_state
> > inc_zone_page_state
> > dec_zone_page_state
> > mod_node_page_state
> > inc_node_page_state
> > dec_node_page_state
> > #endif
> > 
> > Any suggestion on how to split this into multiple patchsets for easier
> > reviewing? (can't think of anything obvious).
> 
> I figured this out before saw this, but it did take me some time to read
> carefully into the code base..  maybe it'll be a good idea to mention
> something like above in the commit message to ease future reviewers (and
> more likelyhood to attract the experts to start chim in)?
> 
> One fundamental (but maybe another naive.. :) question on this code piece
> (so not directly related to the changeset but maybe it is still..):
> 
> AFAICT CONFIG_HAVE_CMPXCHG_LOCAL only means we can do cmpxchg() without
> locking memory bus, 

CONFIG_HAVE_CMPXCHG_LOCAL means cmpxchg_local is implemented (that is
cmpxchg which is atomic with respect to local CPU).

LOCK cmpxchg is necessary for cmpxchg to be atomic on SMP.

> however when !CONFIG_HAVE_CMPXCHG_LOCAL here we're not
> using non-local version but using preempt_disable_nested() to make sure the
> read is atomic.  Does it really worth it?  What happens if we use cmpxchg()
> unconditionally, but just use local (e.g. no "LOCK" prefix) version when
> CONFIG_HAVE_CMPXCHG_LOCAL?

Can't use local version of cmpxchg because the vmstat counters are supposed
to be accessed from different CPUs simultaneously (this is the objective
of the patchset):

CPU-0					CPU-1

vmstat_shepherd				mod_zone_page_state
xchg location				LOCK cmpxchg location

xchg locks memory bus implicitly.

Is this what you are thinking about or did i misunderstood what you
mean?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-03-02 17:27     ` Peter Xu
@ 2023-03-02 19:17       ` Marcelo Tosatti
  0 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 19:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 12:27:11PM -0500, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 12:21:15PM -0500, Peter Xu wrote:
> > On Thu, Feb 09, 2023 at 12:01:51PM -0300, Marcelo Tosatti wrote:
> > > Draining of pages from the local pcp for a remote zone was necessary
> > > since:
> > > 
> > > "Note that remote node draining is a somewhat esoteric feature that is
> > > required on large NUMA systems because otherwise significant portions
> > > of system memory can become trapped in pcp queues. The number of pcp is
> > > determined by the number of processors and nodes in a system. A system
> > > with 4 processors and 2 nodes has 8 pcps which is okay. But a system
> > > with 1024 processors and 512 nodes has 512k pcps with a high potential
> > > for large amount of memory being caught in them."
> > 
> > How about mentioning more details on where does this come from?
> > 
> > afaict it's from commit 4037d45 since 2007.
> > 
> > So I digged that out mostly because I want to know why we did flush pcp at
> > all during vmstat update.  It already sounds weird to me but I could have
> > been missing important details.
> > 
> > The rational I had here is refresh_cpu_vm_stats(true) is mostly being
> > called by the shepherd afaict, while:
> > 
> >   (1) The frequency of that interval is defined as sysctl_stat_interval,
> >       which has nothing yet to do with pcp pages but only stat at least in
> >       the name of it, and,
> > 
> >   (2) vmstat_work is only queued if need_update() here:
> > 
> > 	for_each_online_cpu(cpu) {
> > 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> > 
> > 		if (!delayed_work_pending(dw) && need_update(cpu))
> > 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
> > 
> > 		cond_resched();
> > 	}
> > 
> >       need_update() tells us "we should flush vmstats", nothing it tells
> >       about "we should flush pcp list"..
> > 
> > I looked into the 2007 commit, besides what Marcelo quoted, I do see
> > there's a major benefit of reusing cache lines, quotting from the commit:
> > 
> >         Move the node draining so that is is done when the vm statistics
> >         are updated.  At that point we are already touching all the
> >         cachelines with the pagesets of a processor.
> > 
> > However I didn't see why it's rational to flush pcp list when vmstat needs
> > flushing either.  I also don't know whether that "cacheline locality" hold
> > true or not, because I saw that the pcp page list is split from vmstats
> > since 2021:
> > 
> >     commit 28f836b6777b6f42dce068a40d83a891deaaca37
> >     Author: Mel Gorman <mgorman@techsingularity.net>
> >     Date:   Mon Jun 28 19:41:38 2021 -0700
> > 
> >     mm/page_alloc: split per cpu page lists and zone stats
> > 
> > I'm not even sure my A-b or R-b worth anything at all here, just offer
> > something I got from git archaeology so maybe helpful to readers and
> > reasoning to this patch.  The correctness of archaeology needs help from
> > others (Christoph and Gel?)..  I would just say if there's anything useful
> > or correct may worth collect some into the commit log.
> > 
> > So from what I can tell this patch makes sense.
> 
> One thing I forgot to mention, which may be a slight abi change, is that I
> think the pcp page drain is also triggered by /proc/PID/refresh_vm_stats
> (even though again I don't see why flushing pcp is strictly needed).  It's
> just that I don't know whether there's potential user app that can leverage
> this.
> 
> The worst case is we can drain pcp list for refresh_vm_stats procfs
> explicitly, but I'm not sure whether it'll be worthwhile either, probably
> just to be safe.

This is a good point.

Documentation/admin-guide/sysctl/vm.rst:

stat_refresh
============

Any read or write (by root only) flushes all the per-cpu vm statistics
into their global totals, for more accurate reports when testing
e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo

As a side-effect, it also checks for negative totals (elsewhere reported
as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
(At time of writing, a few stats are known sometimes to be found negative,
with no ill effects: errors and warnings on these stats are suppressed.)


Will add "drain_all_pages(NULL)" call to the start of stat_refresh
handler.

Thanks.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg
  2023-03-02 19:11         ` Marcelo Tosatti
@ 2023-03-02 20:06           ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2023-03-02 20:06 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: David Hildenbrand, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 04:11:32PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 02, 2023 at 11:20:49AM -0500, Peter Xu wrote:
> > On Thu, Mar 02, 2023 at 11:47:35AM -0300, Marcelo Tosatti wrote:
> > > So it will be:
> > > 
> > > #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > > mod_zone_page_state
> > > inc_zone_page_state
> > > dec_zone_page_state
> > > mod_node_page_state
> > > inc_node_page_state
> > > dec_node_page_state
> > > __mod_zone_page_state (new function, calls mod_zone_page_state).
> > > __mod_node_page_state (new function, calls mod_node_page_state).
> > > __inc_zone_page_state
> > > __inc_node_page_state
> > > __dec_zone_page_state
> > > __dec_node_page_state
> > > #else
> > > __mod_zone_page_state (old, shared function for both CONFIG_HAVE_CMPXCHG_LOCAL and not)
> > > __mod_node_page_state
> > > __inc_zone_page_state
> > > __inc_node_page_state
> > > __dec_zone_page_state
> > > __dec_node_page_state
> > > mod_zone_page_state
> > > inc_zone_page_state
> > > dec_zone_page_state
> > > mod_node_page_state
> > > inc_node_page_state
> > > dec_node_page_state
> > > #endif
> > > 
> > > Any suggestion on how to split this into multiple patchsets for easier
> > > reviewing? (can't think of anything obvious).
> > 
> > I figured this out before saw this, but it did take me some time to read
> > carefully into the code base..  maybe it'll be a good idea to mention
> > something like above in the commit message to ease future reviewers (and
> > more likelyhood to attract the experts to start chim in)?
> > 
> > One fundamental (but maybe another naive.. :) question on this code piece
> > (so not directly related to the changeset but maybe it is still..):
> > 
> > AFAICT CONFIG_HAVE_CMPXCHG_LOCAL only means we can do cmpxchg() without
> > locking memory bus, 
> 
> CONFIG_HAVE_CMPXCHG_LOCAL means cmpxchg_local is implemented (that is
> cmpxchg which is atomic with respect to local CPU).
> 
> LOCK cmpxchg is necessary for cmpxchg to be atomic on SMP.
> 
> > however when !CONFIG_HAVE_CMPXCHG_LOCAL here we're not
> > using non-local version but using preempt_disable_nested() to make sure the
> > read is atomic.  Does it really worth it?  What happens if we use cmpxchg()
> > unconditionally, but just use local (e.g. no "LOCK" prefix) version when
> > CONFIG_HAVE_CMPXCHG_LOCAL?
> 
> Can't use local version of cmpxchg because the vmstat counters are supposed
> to be accessed from different CPUs simultaneously (this is the objective
> of the patchset):
> 
> CPU-0					CPU-1
> 
> vmstat_shepherd				mod_zone_page_state
> xchg location				LOCK cmpxchg location
> 
> xchg locks memory bus implicitly.
> 
> Is this what you are thinking about or did i misunderstood what you
> mean?

Yes, I think I wrongly interpreted cmpxchg_local() before on assuming it's
still atomic but an accelerated version of cmpxchg() that was only
supported on a few archs.  I double checked the code and spec on x86 - I
believe you're right.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 ` [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
  2023-03-02 10:42   ` David Hildenbrand
@ 2023-03-02 20:53   ` Peter Xu
  2023-03-02 21:04     ` Marcelo Tosatti
  2023-03-03 15:47     ` Marcelo Tosatti
  2023-03-15 23:56   ` Christoph Lameter
  2 siblings, 2 replies; 47+ messages in thread
From: Peter Xu @ 2023-03-02 20:53 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> Goal is to have vmstat_shepherd to transfer from
> per-CPU counters to global counters remotely. For this, 
> an atomic this_cpu_cmpxchg is necessary.
> 
> Following the kernel convention for cmpxchg/cmpxchg_local,
> change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> and add this_cpu_cmpxchg_local_ helpers which are not atomic.

I can follow on the necessity of having the _local version, however two
questions below.

> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> ===================================================================
> --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
>  	_pcp_protect_return(xchg_relaxed, pcp, val)
>  
>  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +	_pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg, pcp, o, n)

This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
for arm64) memory barrier implications since cmpxchg() has a strong memory
barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.

Maybe it's not a big deal if the audience of this helper is still limited
(e.g. we can add memory barriers if we don't want strict ordering
implication), but just to check with you on whether it's intended, and if
so whether it may worth some comments.

> +
> +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
>  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)

I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?

Nothing about your patch along since it was the same before, but I'm
wondering whether this is a good time to switchover.

The other thing is would it be good to copy arch-list for each arch patch?
Maybe it'll help to extend the audience too.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local
  2023-02-09 15:01 ` [PATCH v2 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
@ 2023-03-02 20:54   ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2023-03-02 20:54 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 09, 2023 at 12:01:57PM -0300, Marcelo Tosatti wrote:
> this_cpu_cmpxchg was modified to atomic version, which 
> can be more costly than non-atomic version.
> 
> Switch users of this_cpu_cmpxchg to this_cpu_cmpxchg_local
> (which preserves pre-non-atomic this_cpu_cmpxchg behaviour).
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Yes afaict these are the only pieces that uses the helper besides vmstat
udpates.

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-02-09 15:02 ` [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-03-02 21:01   ` Peter Xu
  2023-03-02 21:16     ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2023-03-02 21:01 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> +static void vmstat_shepherd(struct work_struct *w)
> +{
> +	int cpu;
> +
> +	cpus_read_lock();
> +	for_each_online_cpu(cpu) {
> +		cpu_vm_stats_fold(cpu);

Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
to replace the queuing.  Would it be cleaner to move the ifdef into
vmstat_shepherd, then, and keep the common logic?

> +		cond_resched();
> +	}
> +	cpus_read_unlock();
> +
> +	schedule_delayed_work(&shepherd,
> +		round_jiffies_relative(sysctl_stat_interval));
> +}
> +#else
>  static void vmstat_shepherd(struct work_struct *w)
>  {
>  	int cpu;
> @@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
>  	schedule_delayed_work(&shepherd,
>  		round_jiffies_relative(sysctl_stat_interval));
>  }
> +#endif
>  
>  static void __init start_shepherd_timer(void)
>  {
> 
> 
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-02 20:53   ` Peter Xu
@ 2023-03-02 21:04     ` Marcelo Tosatti
  2023-03-02 21:25       ` Peter Xu
  2023-03-03 15:47     ` Marcelo Tosatti
  1 sibling, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 21:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this, 
> > an atomic this_cpu_cmpxchg is necessary.
> > 
> > Following the kernel convention for cmpxchg/cmpxchg_local,
> > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> 
> I can follow on the necessity of having the _local version, however two
> questions below.
> 
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> >  
> >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> 
> This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> for arm64) memory barrier implications since cmpxchg() has a strong memory
> barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.
> 
> Maybe it's not a big deal if the audience of this helper is still limited
> (e.g. we can add memory barriers if we don't want strict ordering
> implication), but just to check with you on whether it's intended, and if
> so whether it may worth some comments.

It happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.

See cf10b79a7d88edc689479af989b3a88e9adf07ff.

This patchset maintains the current behaviour
of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:

 #define this_cpu_cmpxchg_1(pcp, o, n)  \                                                                           
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)  \                                                                           
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)  \                                                                           
-       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+       _pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)  \                                                                           
+       _pcp_protect_return(cmpxchg, pcp, o, n)

> > +
> > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> 
> I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?

Since cmpxchg_local = cmpxchg_relaxed, seems like this is not necessary.

> Nothing about your patch along since it was the same before, but I'm
> wondering whether this is a good time to switchover.

I would say that another patch is more appropriate to change this, 
if desired.

> The other thing is would it be good to copy arch-list for each arch patch?
> Maybe it'll help to extend the audience too.

Yes, should have done that (or CC each individual maintainer). Will do
on next version.

Thanks.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-03-02 21:01   ` Peter Xu
@ 2023-03-02 21:16     ` Marcelo Tosatti
  2023-03-02 21:30       ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-02 21:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 04:01:07PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> > +static void vmstat_shepherd(struct work_struct *w)
> > +{
> > +	int cpu;
> > +
> > +	cpus_read_lock();
> > +	for_each_online_cpu(cpu) {
> > +		cpu_vm_stats_fold(cpu);
> 
> Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
> to replace the queuing.  Would it be cleaner to move the ifdef into
> vmstat_shepherd, then, and keep the common logic?

https://lore.kernel.org/lkml/20221223144150.GA79369@lothringen/

Could have

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
int cpu_flush_vm_stats(int cpu)
{
	return cpu_vm_stats_fold(cpu);
}
#else
int cpu_flush_vm_stats(int cpu)
{
	struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

	if (!delayed_work_pending(dw) && need_update(cpu))
		queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
}
#endif

static void vmstat_shepherd(struct work_struct *w)
{
       int cpu;

       cpus_read_lock();
       for_each_online_cpu(cpu) {
	       cpu_flush_vm_stats(cpu);
               cond_resched();
       }
       cpus_read_unlock();

       schedule_delayed_work(&shepherd,
               round_jiffies_relative(sysctl_stat_interval));
}

This looks really awkward to me. But then, we don't want
schedule_delayed_work if !CONFIG_HAVE_CMPXCHG_LOCAL.
The common part would be the cpus_read_lock and for_each_online_cpu
loop.

So it seems the current separation is quite readable
(unless you have a suggestion).

> > +		cond_resched();
> > +	}
> > +	cpus_read_unlock();
> > +
> > +	schedule_delayed_work(&shepherd,
> > +		round_jiffies_relative(sysctl_stat_interval));
> > +}
> > +#else
> >  static void vmstat_shepherd(struct work_struct *w)
> >  {
> >  	int cpu;
> > @@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
> >  	schedule_delayed_work(&shepherd,
> >  		round_jiffies_relative(sysctl_stat_interval));
> >  }
> > +#endif
> >  
> >  static void __init start_shepherd_timer(void)
> >  {
> > 
> > 
> > 
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-03-02 13:55     ` Marcelo Tosatti
@ 2023-03-02 21:19       ` Peter Xu
  2023-03-03 15:17         ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2023-03-02 21:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 10:55:09AM -0300, Marcelo Tosatti wrote:
> >   (2) If someone can modify the dead cpu's vm_stat_diff,
> 
> The only context that can modify the cpu's vm_stat_diff are:
> 
> 1) The CPU itself (increases the counter).
> 2) cpu_vm_stats_fold (from vmstat_shepherd kernel thread), from 
> x -> 0 only.

I think I didn't continue reading so I didn't see cpu_vm_stats_fold() will
be reused when commenting, sorry.

Now with a reworked (and SMP-safe) cpu_vm_stats_fold() and vmstats, I'm
wondering the possibility of merging it with refresh_cpu_vm_stats() since
they really look similar.

IIUC the new refresh_cpu_vm_stats() logically doesn't need the small
preempt disabled sections, not anymore, if with a cpu_id passed over to
cpu_vm_stats_fold(), which seems to be even a good side effect. But not
sure I missed something.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-02 21:04     ` Marcelo Tosatti
@ 2023-03-02 21:25       ` Peter Xu
  2023-03-03 15:39         ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2023-03-02 21:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 06:04:25PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> > On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > > Goal is to have vmstat_shepherd to transfer from
> > > per-CPU counters to global counters remotely. For this, 
> > > an atomic this_cpu_cmpxchg is necessary.
> > > 
> > > Following the kernel convention for cmpxchg/cmpxchg_local,
> > > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> > 
> > I can follow on the necessity of having the _local version, however two
> > questions below.
> > 
> > > 
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > 
> > > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > ===================================================================
> > > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> > >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> > >  
> > >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > 
> > This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> > for arm64) memory barrier implications since cmpxchg() has a strong memory
> > barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.
> > 
> > Maybe it's not a big deal if the audience of this helper is still limited
> > (e.g. we can add memory barriers if we don't want strict ordering
> > implication), but just to check with you on whether it's intended, and if
> > so whether it may worth some comments.
> 
> It happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.
> 
> See cf10b79a7d88edc689479af989b3a88e9adf07ff.

This is more or less a comment in general, rather than for arm only.

Fundamentally starting from this patch it's redefining this_cpu_cmpxchg().
What I meant is whether we should define it properly then implement the
arch patches with what is defined.

We're adding non-local semantics into it, which is obvious to me.

We're (silently, in this patch for aarch64) adding memory barrier semantics
too, this is not obvious to me on whether all archs should implement this
api the same way.

It will make a difference IMHO when the helpers are used in any other code
clips, because IIUC proper definition of memory barrier implications will
decide whether the callers need explicit barriers when ordering is required.

> 
> This patchset maintains the current behaviour
> of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:
> 
>  #define this_cpu_cmpxchg_1(pcp, o, n)  \                                                                           
> -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_2(pcp, o, n)  \                                                                           
> -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_4(pcp, o, n)  \                                                                           
> -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
>  #define this_cpu_cmpxchg_8(pcp, o, n)  \                                                                           
> +       _pcp_protect_return(cmpxchg, pcp, o, n)
> 
> > > +
> > > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> > >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > 
> > I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> > we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?
> 
> Since cmpxchg_local = cmpxchg_relaxed, seems like this is not necessary.
> 
> > Nothing about your patch along since it was the same before, but I'm
> > wondering whether this is a good time to switchover.
> 
> I would say that another patch is more appropriate to change this, 
> if desired.

Sure on this one.  Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-03-02 21:16     ` Marcelo Tosatti
@ 2023-03-02 21:30       ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2023-03-02 21:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 06:16:42PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 02, 2023 at 04:01:07PM -0500, Peter Xu wrote:
> > On Thu, Feb 09, 2023 at 12:02:00PM -0300, Marcelo Tosatti wrote:
> > > +#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
> > > +/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
> > > +static void vmstat_shepherd(struct work_struct *w)
> > > +{
> > > +	int cpu;
> > > +
> > > +	cpus_read_lock();
> > > +	for_each_online_cpu(cpu) {
> > > +		cpu_vm_stats_fold(cpu);
> > 
> > Nitpick: IIUC this line is the only change with CONFIG_HAVE_CMPXCHG_LOCAL
> > to replace the queuing.  Would it be cleaner to move the ifdef into
> > vmstat_shepherd, then, and keep the common logic?
> 
> https://lore.kernel.org/lkml/20221223144150.GA79369@lothringen/

:-)

[...]

> So it seems the current separation is quite readable
> (unless you have a suggestion).

No, feel free to ignore any of my nitpicks when you don't think proper. :)
Keeping it is fine to me.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-03-02 21:19       ` Peter Xu
@ 2023-03-03 15:17         ` Marcelo Tosatti
  0 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 15:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 04:19:50PM -0500, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 10:55:09AM -0300, Marcelo Tosatti wrote:
> > >   (2) If someone can modify the dead cpu's vm_stat_diff,
> > 
> > The only context that can modify the cpu's vm_stat_diff are:
> > 
> > 1) The CPU itself (increases the counter).
> > 2) cpu_vm_stats_fold (from vmstat_shepherd kernel thread), from 
> > x -> 0 only.
> 
> I think I didn't continue reading so I didn't see cpu_vm_stats_fold() will
> be reused when commenting, sorry.
> 
> Now with a reworked (and SMP-safe) cpu_vm_stats_fold() and vmstats, I'm
> wondering the possibility of merging it with refresh_cpu_vm_stats() since
> they really look similar.

Seems like a possibility. However that might require replacing

v = this_cpu_xchg(pzstats->vm_stat_diff[i], 0);

with

pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);

Which would drop the this_cpu optimization described at 

7340a0b15280c9d902c7dd0608b8e751b5a7c403

Also you would not want the unified function to sync NUMA events 
(as it would be called from NOHZ entry and exit).

See f19298b9516c1a031b34b4147773457e3efe743b

> IIUC the new refresh_cpu_vm_stats() logically doesn't need the small
> preempt disabled sections, not anymore, 

What preempt disabled sections you refer to?

> if with a cpu_id passed over to
> cpu_vm_stats_fold(), which seems to be even a good side effect. But not
> sure I missed something.
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-02 21:25       ` Peter Xu
@ 2023-03-03 15:39         ` Marcelo Tosatti
  0 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 15:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 04:25:08PM -0500, Peter Xu wrote:
> On Thu, Mar 02, 2023 at 06:04:25PM -0300, Marcelo Tosatti wrote:
> > On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> > > On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > > > Goal is to have vmstat_shepherd to transfer from
> > > > per-CPU counters to global counters remotely. For this, 
> > > > an atomic this_cpu_cmpxchg is necessary.
> > > > 
> > > > Following the kernel convention for cmpxchg/cmpxchg_local,
> > > > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > > > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> > > 
> > > I can follow on the necessity of having the _local version, however two
> > > questions below.
> > > 
> > > > 
> > > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > > 
> > > > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > > ===================================================================
> > > > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > > > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > > > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> > > >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> > > >  
> > > >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > > > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> > > 
> > > This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> > > for arm64) memory barrier implications since cmpxchg() has a strong memory
> > > barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.
> > > 
> > > Maybe it's not a big deal if the audience of this helper is still limited
> > > (e.g. we can add memory barriers if we don't want strict ordering
> > > implication), but just to check with you on whether it's intended, and if
> > > so whether it may worth some comments.
> > 
> > It happens that on ARM-64 cmpxchg_local == cmpxchg_relaxed.
> > 
> > See cf10b79a7d88edc689479af989b3a88e9adf07ff.
> 
> This is more or less a comment in general, rather than for arm only.
> 
> Fundamentally starting from this patch it's redefining this_cpu_cmpxchg().
> What I meant is whether we should define it properly then implement the
> arch patches with what is defined.
> 
> We're adding non-local semantics into it, which is obvious to me.

Which match the cmpxchg() function semantics.

> We're (silently, in this patch for aarch64) adding memory barrier semantics
> too, this is not obvious to me on whether all archs should implement this
> api the same way.

Documentation/atomic_t.txt says that _relaxed means "no barriers".

So i'd assume:

cmpxchg_relaxed: no additional barriers
cmpxchg_local:   only guarantees atomicity to wrt local CPU.
cmpxchg:	 atomic in SMP context.

https://lore.kernel.org/linux-arm-kernel/20180505103550.s7xsnto7tgppkmle@gmail.com/#r

There seems to be a lack of clarity in documentation.

> It will make a difference IMHO when the helpers are used in any other code
> clips, because IIUC proper definition of memory barrier implications will
> decide whether the callers need explicit barriers when ordering is required.

Trying to limit the scope of changes to solve the problem at hand.

More specifically what this patch does is:

1) Add this_cpu_cmpxchg_local, uses arch cmpxchg_local implementation
to back it.
2) Add this_cpu_cmpxchg, uses arch cmpxchg implementation to back it.

Note that now becomes consistent with cmpxchg and cmpxchg_local
semantics.

> > This patchset maintains the current behaviour
> > of this_cpu_cmpxch (for this_cpu_cmpxch_local), which was:
> > 
> >  #define this_cpu_cmpxchg_1(pcp, o, n)  \                                                                           
> > -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_2(pcp, o, n)  \                                                                           
> > -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_4(pcp, o, n)  \                                                                           
> > -       _pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_8(pcp, o, n)  \                                                                           
> > +       _pcp_protect_return(cmpxchg, pcp, o, n)
> > 
> > > > +
> > > > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> > > >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > > > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > > 
> > > I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> > > we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?
> > 
> > Since cmpxchg_local = cmpxchg_relaxed, seems like this is not necessary.
> > 
> > > Nothing about your patch along since it was the same before, but I'm
> > > wondering whether this is a good time to switchover.
> > 
> > I would say that another patch is more appropriate to change this, 
> > if desired.
> 
> Sure on this one.  Thanks,
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-02 20:53   ` Peter Xu
  2023-03-02 21:04     ` Marcelo Tosatti
@ 2023-03-03 15:47     ` Marcelo Tosatti
  1 sibling, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-03 15:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 02, 2023 at 03:53:12PM -0500, Peter Xu wrote:
> On Thu, Feb 09, 2023 at 12:01:52PM -0300, Marcelo Tosatti wrote:
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this, 
> > an atomic this_cpu_cmpxchg is necessary.
> > 
> > Following the kernel convention for cmpxchg/cmpxchg_local,
> > change ARM's this_cpu_cmpxchg_ helpers to be atomic,
> > and add this_cpu_cmpxchg_local_ helpers which are not atomic.
> 
> I can follow on the necessity of having the _local version, however two
> questions below.
> 
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > ===================================================================
> > --- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
> > +++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
> > @@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
> >  	_pcp_protect_return(xchg_relaxed, pcp, val)
> >  
> >  #define this_cpu_cmpxchg_1(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_2(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_4(pcp, o, n)	\
> > -	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> >  #define this_cpu_cmpxchg_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg, pcp, o, n)
> 
> This makes this_cpu_cmpxchg_*() not only non-local, but also (especially
> for arm64) memory barrier implications since cmpxchg() has a strong memory
> barrier, while the old this_cpu_cmpxchg*() doesn't have, afaiu.

A later patch changes users of this_cpu_cmpxchg to
this_cpu_cmpxchg_local, which maintains behaviour.

> Maybe it's not a big deal if the audience of this helper is still limited
> (e.g. we can add memory barriers if we don't want strict ordering
> implication), but just to check with you on whether it's intended, and if
> so whether it may worth some comments.
> 
> > +
> > +#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
> >  	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> > +#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
> > +	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
> 
> I think cmpxchg_relaxed()==cmpxchg_local() here for aarch64, however should
> we still use cmpxchg_local() to pair with this_cpu_cmpxchg_local_*()?
> 
> Nothing about your patch along since it was the same before, but I'm
> wondering whether this is a good time to switchover.
> 
> The other thing is would it be good to copy arch-list for each arch patch?
> Maybe it'll help to extend the audience too.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-02-09 15:01 ` [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
  2023-03-02 10:42   ` David Hildenbrand
  2023-03-02 20:53   ` Peter Xu
@ 2023-03-15 23:56   ` Christoph Lameter
  2023-03-16 10:54     ` Marcelo Tosatti
  2 siblings, 1 reply; 47+ messages in thread
From: Christoph Lameter @ 2023-03-15 23:56 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, 9 Feb 2023, Marcelo Tosatti wrote:

> Goal is to have vmstat_shepherd to transfer from
> per-CPU counters to global counters remotely. For this,
> an atomic this_cpu_cmpxchg is necessary.

The definition for this_cpu_functionality is that it is *not* incurring
atomic overhead and it was introduced to *avoid* the overhead of atomic
operations.

This sabotages this_cpu functionality,


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-15 23:56   ` Christoph Lameter
@ 2023-03-16 10:54     ` Marcelo Tosatti
  0 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-16 10:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, Mar 16, 2023 at 12:56:20AM +0100, Christoph Lameter wrote:
> On Thu, 9 Feb 2023, Marcelo Tosatti wrote:
> 
> > Goal is to have vmstat_shepherd to transfer from
> > per-CPU counters to global counters remotely. For this,
> > an atomic this_cpu_cmpxchg is necessary.
> 
> The definition for this_cpu_functionality is that it is *not* incurring
> atomic overhead and it was introduced to *avoid* the overhead of atomic
> operations.
> 
> This sabotages this_cpu functionality,

Christoph,

Two points:

1) If you look at patch 6, users of this_cpu_cmpxchg are converted
to this_cpu_cmpxchg_local (except per-CPU vmstat counters).
Its up to the user of the interface, depending on its requirements,
to decide whether or not atomic operations are necessary
(atomic with reference to other processors).

this_cpu_cmpxchg still has the benefits of use of segment registers:

:Author: Christoph Lameter, August 4th, 2014
:Author: Pranith Kumar, Aug 2nd, 2014

this_cpu operations are a way of optimizing access to per cpu
variables associated with the *currently* executing processor. This is
done through the use of segment registers (or a dedicated register where
the cpu permanently stored the beginning of the per cpu area for a
specific processor).

this_cpu operations add a per cpu variable offset to the processor
specific per cpu base and encode that operation in the instruction
operating on the per cpu variable.

This means that there are no atomicity issues between the calculation of
the offset and the operation on the data. Therefore it is not
necessary to disable preemption or interrupts to ensure that the
processor is not changed between the calculation of the address and
the operation on the data.

2) The performance results seem to indicate that 
cache locking is effective on modern processors (on this particular case and others as well):

4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

    As preparation for dealing with both of those problems, protect the
    lists with a spinlock.  The IRQ-unsafe version of the lock is used
    because IRQs are already disabled by local_lock_irqsave.  spin_trylock
    is used in combination with local_lock_irqsave() but later will be
    replaced with a spin_trylock_irqsave when the local_lock is removed.

    The per_cpu_pages still fits within the same number of cache lines after
    this patch relative to before the series.

    struct per_cpu_pages {
            spinlock_t                 lock;                 /*     0     4 */
            int                        count;                /*     4     4 */
            int                        high;                 /*     8     4 */
            int                        batch;                /*    12     4 */
            short int                  free_factor;          /*    16     2 */
            short int                  expire;               /*    18     2 */

            /* XXX 4 bytes hole, try to pack */

            struct list_head           lists[13];            /*    24   208 */

            /* size: 256, cachelines: 4, members: 7 */
            /* sum members: 228, holes: 1, sum holes: 4 */
            /* padding: 24 */
    } __attribute__((__aligned__(64)));

    There is overhead in the fast path due to acquiring the spinlock even
    though the spinlock is per-cpu and uncontended in the common case.  Page
    Fault Test (PFT) running on a 1-socket reported the following results on a
    1 socket machine.

                                         5.19.0-rc3               5.19.0-rc3
                                            vanilla      mm-pcpspinirq-v5r16
    Hmean     faults/sec-1   869275.7381 (   0.00%)   874597.5167 *   0.61%*
    Hmean     faults/sec-3  2370266.6681 (   0.00%)  2379802.0362 *   0.40%*
    Hmean     faults/sec-5  2701099.7019 (   0.00%)  2664889.7003 *  -1.34%*
    Hmean     faults/sec-7  3517170.9157 (   0.00%)  3491122.8242 *  -0.74%*
    Hmean     faults/sec-8  3965729.6187 (   0.00%)  3939727.0243 *  -0.66%*

And for this case:

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla                 Patched         Diff
115.25                  117             1.4%

(to be honest, the results are in the noise as well, during the tests
the "LOCK cmpxchg" shows no significant difference to the "cmpxchg"
version for the page allocator benchmark).


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-03-02 10:10       ` David Hildenbrand
@ 2023-03-21 15:20         ` Mel Gorman
  2023-03-21 17:31           ` Marcelo Tosatti
  0 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2023-03-21 15:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Marcelo Tosatti, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm,
	Vlastimil Babka

On Thu, Mar 02, 2023 at 11:10:03AM +0100, David Hildenbrand wrote:
> [...]
> 
> > 
> > > (2) drain_zone_pages() documents that we're draining the PCP
> > >      (bulk-freeing them) of the current CPU on remote nodes. That bulk-
> > >      freeing will properly adjust free memory counters. What exactly is
> > >      the impact when no longer doing that? Won't the "snapshot" of some
> > >      counters eventually be wrong? Do we care?
> > 
> > Don't see why the snapshot of counters will be wrong.
> > 
> > Instead of freeing pages on pcp list of remote nodes after they are
> > considered idle ("3 seconds idle till flush"), what will happen is that
> > drain_all_pages() will free those pcps, for example after an allocation
> > fails on direct reclaim:
> > 
> >          page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> > 
> >          /*
> >           * If an allocation failed after direct reclaim, it could be because
> >           * pages are pinned on the per-cpu lists or in high alloc reserves.
> >           * Shrink them and try again
> >           */
> >          if (!page && !drained) {
> >                  unreserve_highatomic_pageblock(ac, false);
> >                  drain_all_pages(NULL);
> >                  drained = true;
> >                  goto retry;
> >          }
> > 
> > In both cases the pages are freed (and counters maintained) here:
> > 
> > static inline void __free_one_page(struct page *page,
> >                  unsigned long pfn,
> >                  struct zone *zone, unsigned int order,
> >                  int migratetype, fpi_t fpi_flags)
> > {
> >          struct capture_control *capc = task_capc(zone);
> >          unsigned long buddy_pfn = 0;
> >          unsigned long combined_pfn;
> >          struct page *buddy;
> >          bool to_tail;
> > 
> >          VM_BUG_ON(!zone_is_initialized(zone));
> >          VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> > 
> >          VM_BUG_ON(migratetype == -1);
> >          if (likely(!is_migrate_isolate(migratetype)))
> >                  __mod_zone_freepage_state(zone, 1 << order, migratetype);
> > 
> >          VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> >          VM_BUG_ON_PAGE(bad_range(zone, page), page);
> > 
> >          while (order < MAX_ORDER - 1) {
> >                  if (compaction_capture(capc, page, order, migratetype)) {
> >                          __mod_zone_freepage_state(zone, -(1 << order),
> >                                                                  migratetype);
> >                          return;
> >                  }
> > 
> > > Describing the difference between instructed refresh of vmstat and "remotely
> > > drain per-cpu lists" in order to move free memory from the pcp to the buddy
> > > would be great.
> > 
> > The difference is that now remote PCPs will be drained on demand, either via
> > kcompactd or direct reclaim (through drain_all_pages), when memory is
> > low.
> > 
> > For example, with the following test:
> > 
> > dd if=/dev/zero of=file bs=1M count=32000 on a tmpfs filesystem:
> > 
> >        kcompactd0-116     [005] ...1 228232.042873: drain_all_pages <-kcompactd_do_work
> >        kcompactd0-116     [005] ...1 228232.042873: __drain_all_pages <-kcompactd_do_work
> >                dd-479485  [003] ...1 228232.455130: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> >                dd-479485  [011] ...1 228232.721994: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> >       gnome-shell-3750    [015] ...1 228232.723729: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > 
> > The commit message was indeed incorrect. Updated one:
> > 
> > "mm/vmstat: remove remote node draining
> > 
> > Draining of pages from the local pcp for a remote zone should not be
> > necessary, since once the system is low on memory (or compaction on a
> > zone is in effect), drain_all_pages should be called freeing any unused
> > pcps."
> > 
> > Thanks!
> 
> Thanks for the explanation, that makes sense to me. Feel free to add my
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 
> ... hoping that some others (Mel, Vlastimil?) can have another look.
> 

I was on extended leave and am still in the process of triaging a few
thousand mails so I'm working off memory here instead of the code. This
is a straight-forward enough question to answer quickly in case I forget
later.

Short answer: I'm not a fan of the patch in concept and I do not think it
should be merged.

I agree that drain_all_pages() would free the PCP pages on demand in
direct reclaim context but it happens after reclaim has already
happened. Hence, the reclaim may be necessary and may cause overreclaim
in some situations due to remote CPUs pinning memory in PCP lists.

Similarly, kswapd may trigger early because PCP pages do not contribute
to NR_FREE_PAGES so watermark checks can fail even though pages are
free, just inaccessible.

Finally, remote pages expire because ideally CPUs allocate local memory
assuming memory policies are not forcing use of remote nodes. The expiry
means that remote pages get freed back to the buddy lists after a short
period. By removing the expiry, it's possible that a local allocation will
fail and spill over to a remote node prematurely because free pages were
pinned on the PCP lists.

As this patch has the possibility of reclaiming early in both direct and
kswapd context and increases the risk of remote node fallback, I think it
needs much stronger justification and a warning about the side-effects. For
this version unless I'm very wrong -- NAK :(

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 01/11] mm/vmstat: remove remote node draining
  2023-03-21 15:20         ` Mel Gorman
@ 2023-03-21 17:31           ` Marcelo Tosatti
  0 siblings, 0 replies; 47+ messages in thread
From: Marcelo Tosatti @ 2023-03-21 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Hildenbrand, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm,
	Vlastimil Babka

On Tue, Mar 21, 2023 at 03:20:31PM +0000, Mel Gorman wrote:
> On Thu, Mar 02, 2023 at 11:10:03AM +0100, David Hildenbrand wrote:
> > [...]
> > 
> > > 
> > > > (2) drain_zone_pages() documents that we're draining the PCP
> > > >      (bulk-freeing them) of the current CPU on remote nodes. That bulk-
> > > >      freeing will properly adjust free memory counters. What exactly is
> > > >      the impact when no longer doing that? Won't the "snapshot" of some
> > > >      counters eventually be wrong? Do we care?
> > > 
> > > Don't see why the snapshot of counters will be wrong.
> > > 
> > > Instead of freeing pages on pcp list of remote nodes after they are
> > > considered idle ("3 seconds idle till flush"), what will happen is that
> > > drain_all_pages() will free those pcps, for example after an allocation
> > > fails on direct reclaim:
> > > 
> > >          page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> > > 
> > >          /*
> > >           * If an allocation failed after direct reclaim, it could be because
> > >           * pages are pinned on the per-cpu lists or in high alloc reserves.
> > >           * Shrink them and try again
> > >           */
> > >          if (!page && !drained) {
> > >                  unreserve_highatomic_pageblock(ac, false);
> > >                  drain_all_pages(NULL);
> > >                  drained = true;
> > >                  goto retry;
> > >          }
> > > 
> > > In both cases the pages are freed (and counters maintained) here:
> > > 
> > > static inline void __free_one_page(struct page *page,
> > >                  unsigned long pfn,
> > >                  struct zone *zone, unsigned int order,
> > >                  int migratetype, fpi_t fpi_flags)
> > > {
> > >          struct capture_control *capc = task_capc(zone);
> > >          unsigned long buddy_pfn = 0;
> > >          unsigned long combined_pfn;
> > >          struct page *buddy;
> > >          bool to_tail;
> > > 
> > >          VM_BUG_ON(!zone_is_initialized(zone));
> > >          VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
> > > 
> > >          VM_BUG_ON(migratetype == -1);
> > >          if (likely(!is_migrate_isolate(migratetype)))
> > >                  __mod_zone_freepage_state(zone, 1 << order, migratetype);
> > > 
> > >          VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> > >          VM_BUG_ON_PAGE(bad_range(zone, page), page);
> > > 
> > >          while (order < MAX_ORDER - 1) {
> > >                  if (compaction_capture(capc, page, order, migratetype)) {
> > >                          __mod_zone_freepage_state(zone, -(1 << order),
> > >                                                                  migratetype);
> > >                          return;
> > >                  }
> > > 
> > > > Describing the difference between instructed refresh of vmstat and "remotely
> > > > drain per-cpu lists" in order to move free memory from the pcp to the buddy
> > > > would be great.
> > > 
> > > The difference is that now remote PCPs will be drained on demand, either via
> > > kcompactd or direct reclaim (through drain_all_pages), when memory is
> > > low.
> > > 
> > > For example, with the following test:
> > > 
> > > dd if=/dev/zero of=file bs=1M count=32000 on a tmpfs filesystem:
> > > 
> > >        kcompactd0-116     [005] ...1 228232.042873: drain_all_pages <-kcompactd_do_work
> > >        kcompactd0-116     [005] ...1 228232.042873: __drain_all_pages <-kcompactd_do_work
> > >                dd-479485  [003] ...1 228232.455130: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > >                dd-479485  [011] ...1 228232.721994: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > >       gnome-shell-3750    [015] ...1 228232.723729: __drain_all_pages <-__alloc_pages_slowpath.constprop.0
> > > 
> > > The commit message was indeed incorrect. Updated one:
> > > 
> > > "mm/vmstat: remove remote node draining
> > > 
> > > Draining of pages from the local pcp for a remote zone should not be
> > > necessary, since once the system is low on memory (or compaction on a
> > > zone is in effect), drain_all_pages should be called freeing any unused
> > > pcps."
> > > 
> > > Thanks!
> > 
> > Thanks for the explanation, that makes sense to me. Feel free to add my
> > 
> > Acked-by: David Hildenbrand <david@redhat.com>
> > 
> > ... hoping that some others (Mel, Vlastimil?) can have another look.
> > 
> 
> I was on extended leave and am still in the process of triaging a few
> thousand mails so I'm working off memory here instead of the code. This
> is a straight-forward enough question to answer quickly in case I forget
> later.
> 
> Short answer: I'm not a fan of the patch in concept and I do not think it
> should be merged.
> 
> I agree that drain_all_pages() would free the PCP pages on demand in
> direct reclaim context but it happens after reclaim has already
> happened. Hence, the reclaim may be necessary and may cause overreclaim
> in some situations due to remote CPUs pinning memory in PCP lists.
> 
> Similarly, kswapd may trigger early because PCP pages do not contribute
> to NR_FREE_PAGES so watermark checks can fail even though pages are
> free, just inaccessible.
> 
> Finally, remote pages expire because ideally CPUs allocate local memory
> assuming memory policies are not forcing use of remote nodes. The expiry
> means that remote pages get freed back to the buddy lists after a short
> period. By removing the expiry, it's possible that a local allocation will
> fail and spill over to a remote node prematurely because free pages were
> pinned on the PCP lists.
> 
> As this patch has the possibility of reclaiming early in both direct and
> kswapd context and increases the risk of remote node fallback, I think it
> needs much stronger justification and a warning about the side-effects. For
> this version unless I'm very wrong -- NAK :(

Mel,

Agreed. -v7 of the series dropped this patch and implements draining 
of non-local NUMA node memory on pcp caches to happen from cpu_vm_stats_fold
(which might execute remotely from vmstat_shepherd).
If you can take a look at -v7, it would be awesome.

Subject: [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold

Large NUMA systems might have significant portions
of system memory to be trapped in pcp queues. The number of pcp is
determined by the number of processors and nodes in a system. A system
with 4 processors and 2 nodes has 8 pcps which is okay. But a system
with 1024 processors and 512 nodes has 512k pcps with a high potential
for large amount of memory being caught in them.

Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case,
where vmstat_shepherd will perform the aging and draining via
cpu_vm_stats_fold.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>




> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2023-03-21 17:32 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-09 15:01 [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 01/11] mm/vmstat: remove remote node draining Marcelo Tosatti
2023-02-28 15:53   ` David Hildenbrand
2023-02-28 19:36     ` Marcelo Tosatti
2023-03-02 10:10       ` David Hildenbrand
2023-03-21 15:20         ` Mel Gorman
2023-03-21 17:31           ` Marcelo Tosatti
2023-03-02 17:21   ` Peter Xu
2023-03-02 17:27     ` Peter Xu
2023-03-02 19:17       ` Marcelo Tosatti
2023-03-02 18:56     ` Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 02/11] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-03-02 10:42   ` David Hildenbrand
2023-03-02 10:51     ` David Hildenbrand
2023-03-02 14:32     ` Marcelo Tosatti
2023-03-02 20:53   ` Peter Xu
2023-03-02 21:04     ` Marcelo Tosatti
2023-03-02 21:25       ` Peter Xu
2023-03-03 15:39         ` Marcelo Tosatti
2023-03-03 15:47     ` Marcelo Tosatti
2023-03-15 23:56   ` Christoph Lameter
2023-03-16 10:54     ` Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 03/11] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 04/11] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 05/11] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 06/11] this_cpu_cmpxchg: asm-generic: " Marcelo Tosatti
2023-02-09 15:01 ` [PATCH v2 07/11] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-03-02 20:54   ` Peter Xu
2023-02-09 15:01 ` [PATCH v2 08/11] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-03-02 10:47   ` David Hildenbrand
2023-03-02 14:47     ` Marcelo Tosatti
2023-03-02 16:20       ` Peter Xu
2023-03-02 19:11         ` Marcelo Tosatti
2023-03-02 20:06           ` Peter Xu
2023-02-09 15:01 ` [PATCH v2 09/11] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
2023-03-01 22:57   ` Peter Xu
2023-03-02 13:55     ` Marcelo Tosatti
2023-03-02 21:19       ` Peter Xu
2023-03-03 15:17         ` Marcelo Tosatti
2023-02-09 15:02 ` [PATCH v2 10/11] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-02 21:01   ` Peter Xu
2023-03-02 21:16     ` Marcelo Tosatti
2023-03-02 21:30       ` Peter Xu
2023-02-09 15:02 ` [PATCH v2 11/11] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
2023-02-23 14:54 ` [PATCH v2 00/11] fold per-CPU vmstats remotely Marcelo Tosatti
2023-02-24  2:34   ` Hillf Danton
2023-02-27 19:41     ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.