linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/13] fold per-CPU vmstats remotely
@ 2023-03-20 18:03 Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
                   ` (14 more replies)
  0 siblings, 15 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko

This patch series addresses the following two problems:

1. A customer provided evidence indicating that a process
   was stalled in direct reclaim:

 - The process was trapped in throttle_direct_reclaim().
   The function wait_event_killable() was called to wait condition
   allow_direct_reclaim(pgdat) for current node to be true.
   The allow_direct_reclaim(pgdat) examined the number of free pages
   on the node by zone_page_state() which just returns value in
   zone->vm_stat[NR_FREE_PAGES].

 - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
   However, the freelist on this node was not empty.

 - This inconsistent of vmstat value was caused by percpu vmstat on
   nohz_full cpus. Every increment/decrement of vmstat is performed
   on percpu vmstat counter at first, then pooled diffs are cumulated
   to the zone's vmstat counter in timely manner. However, on nohz_full
   cpus (in case of this customer's system, 48 of 52 cpus) these pooled
   diffs were not cumulated once the cpu had no event on it so that
   the cpu started sleeping infinitely.
   I checked percpu vmstat and found there were total 69 counts not
   cumulated to the zone's vmstat counter yet.

 - In this situation, kswapd did not help the trapped process.
   In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
   of free pages on the node by zone_page_state_snapshot() which
   checks pending counts on percpu vmstat.
   Therefore kswapd could know there were 69 free pages correctly.
   Since zone->_watermark = {8, 20, 32}, kswapd did not work because
   69 was greater than 32 as high watermark.

 2. With a task that busy loops on a given CPU,
    the kworker interruption to execute vmstat_update
    is undesired and may exceed latency thresholds
    for certain applications.

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.


Performance details for the kworker interruption:

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
 
The example above shows an additional 7us for the

        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.

v7:
- Fix allow_direct_reclaim issue by using
  zone_page_state_snapshot				(Michal Hocko)

v6:
- Add more information on throttle_direct_reclaim problem 
  to commit logs	       			       (Michal Hocko)

v5:
- Drop "mm/vmstat: remove remote node draining"        (Vlastimil Babka)
- Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka)

v4:
- Switch per-CPU vmstat counters to s32, required 
  by RISC-V, ARC architectures			

v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats  (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
  stats more accurate				       (Peter Xu)
- Improve changelog of
  "mm/vmstat: switch counter modification to cmpxchg"  (Peter Xu / David)
- Improve changelog of
  "mm/vmstat: remove remote node draining"	       (David Hildenbrand)


v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
  (Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
  (Uros Bizjak / Matthew Wilcox)

 arch/arm64/include/asm/percpu.h     |   16 ++
 arch/loongarch/include/asm/percpu.h |   23 +++-
 arch/s390/include/asm/percpu.h      |    5 
 arch/x86/include/asm/percpu.h       |   39 +++---
 include/asm-generic/percpu.h        |   17 ++
 include/linux/mmzone.h              |    8 -
 include/linux/percpu-defs.h         |    2 
 include/linux/vmstat.h              |    2 
 kernel/fork.c                       |    2 
 kernel/scs.c                        |    2 
 mm/page_alloc.c                     |    5 
 mm/vmscan.c                         |    2 
 mm/vmstat.c                         |  440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------
 13 files changed, 361 insertions(+), 202 deletions(-)



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:21   ` Michal Hocko
  2023-03-20 18:03 ` [PATCH v7 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

A customer provided evidence indicating that a process
was stalled in direct reclaim:

 - The process was trapped in throttle_direct_reclaim().
   The function wait_event_killable() was called to wait condition     
   allow_direct_reclaim(pgdat) for current node to be true.     
   The allow_direct_reclaim(pgdat) examined the number of free pages     
   on the node by zone_page_state() which just returns value in     
   zone->vm_stat[NR_FREE_PAGES].     
                                                
 - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.            
   However, the freelist on this node was not empty.     
                           
 - This inconsistent of vmstat value was caused by percpu vmstat on     
   nohz_full cpus. Every increment/decrement of vmstat is performed     
   on percpu vmstat counter at first, then pooled diffs are cumulated     
   to the zone's vmstat counter in timely manner. However, on nohz_full     
   cpus (in case of this customer's system, 48 of 52 cpus) these pooled     
   diffs were not cumulated once the cpu had no event on it so that     
   the cpu started sleeping infinitely.                       
   I checked percpu vmstat and found there were total 69 counts not         
   cumulated to the zone's vmstat counter yet.     
                                         
 - In this situation, kswapd did not help the trapped process.     
   In pgdat_balanced(), zone_wakermark_ok_safe() examined the number     
   of free pages on the node by zone_page_state_snapshot() which     
   checks pending counts on percpu vmstat.     
   Therefore kswapd could know there were 69 free pages correctly.     
   Since zone->_watermark = {8, 20, 32}, kswapd did not work because     
   69 was greater than 32 as high watermark.     

Change allow_direct_reclaim to use zone_page_state_snapshot, which
allows a more precise version of the vmstat counters to be used.

allow_direct_reclaim will only be called from try_to_free_pages,
which is not a hot path.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmscan.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmscan.c
+++ linux-vmstat-remote/mm/vmscan.c
@@ -6861,7 +6861,7 @@ static bool allow_direct_reclaim(pg_data
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
-		free_pages += zone_page_state(zone, NR_FREE_PAGES);
+		free_pages += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 	}
 
 	/* If there are no reserves (unexpected config) then do not throttle */



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 03/13] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this, 
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change ARM's this_cpu_cmpxchg_ helpers to be atomic,
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
+++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
@@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
 	_pcp_protect_return(xchg_relaxed, pcp, val)
 
 #define this_cpu_cmpxchg_1(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg, pcp, o, n)
+
+#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
 	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 03/13] this_cpu_cmpxchg: loongarch: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 04/13] this_cpu_cmpxchg: S390: " Marcelo Tosatti
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add this_cpu_cmpxchg_local helpers to Loongarch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/loongarch/include/asm/percpu.h
+++ linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
@@ -150,6 +150,16 @@ static inline unsigned long __percpu_xch
 }
 
 /* this_cpu_cmpxchg */
+#define _protect_cmpxchg(pcp, o, n)				\
+({								\
+	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
+	preempt_disable_notrace();				\
+	__ret = cmpxchg(raw_cpu_ptr(&(pcp)), o, n);		\
+	preempt_enable_notrace();				\
+	__ret;							\
+})
+
+/* this_cpu_cmpxchg_local */
 #define _protect_cmpxchg_local(pcp, o, n)			\
 ({								\
 	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
@@ -222,10 +232,15 @@ do {									\
 #define this_cpu_xchg_4(pcp, val) _percpu_xchg(pcp, val)
 #define this_cpu_xchg_8(pcp, val) _percpu_xchg(pcp, val)
 
-#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+
+#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg(ptr, o, n)
 
 #include <asm-generic/percpu.h>
 



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 04/13] this_cpu_cmpxchg: S390: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 03/13] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 05/13] this_cpu_cmpxchg: x86: " Marcelo Tosatti
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add S390's this_cpu_cmpxchg_local.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/s390/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/s390/include/asm/percpu.h
+++ linux-vmstat-remote/arch/s390/include/asm/percpu.h
@@ -148,6 +148,11 @@
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+
 #define arch_this_cpu_xchg(pcp, nval)					\
 ({									\
 	typeof(pcp) *ptr__;						\



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 05/13] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 04/13] this_cpu_cmpxchg: S390: " Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change x86's this_cpu_cmpxchg_ helpers to be atomic.
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/x86/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/x86/include/asm/percpu.h
+++ linux-vmstat-remote/arch/x86/include/asm/percpu.h
@@ -197,11 +197,11 @@ do {									\
  * cmpxchg has no such implied lock semantics as a result it is much
  * more efficient for cpu local operations.
  */
-#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval)		\
+#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval, lockp)	\
 ({									\
 	__pcpu_type_##size pco_old__ = __pcpu_cast_##size(_oval);	\
 	__pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval);	\
-	asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]",		\
+	asm qual (__pcpu_op2_##size(lockp "cmpxchg", "%[nval]",		\
 				    __percpu_arg([var]))		\
 		  : [oval] "+a" (pco_old__),				\
 		    [var] "+m" (_var)					\
@@ -279,16 +279,20 @@ do {									\
 #define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, , pcp, val)
 #define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, , pcp, val)
 #define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, , pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval, "")
 
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, volatile, pcp, val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, volatile, pcp, val)
 #define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, volatile, pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, "")
+
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 #ifdef CONFIG_X86_CMPXCHG64
 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
@@ -319,16 +323,17 @@ do {									\
 #define raw_cpu_or_8(pcp, val)			percpu_to_op(8, , "or", (pcp), val)
 #define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, , pcp, val)
 #define raw_cpu_xchg_8(pcp, nval)		raw_percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval, "")
 
-#define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
-#define this_cpu_add_8(pcp, val)		percpu_add_op(8, volatile, (pcp), val)
-#define this_cpu_and_8(pcp, val)		percpu_to_op(8, volatile, "and", (pcp), val)
-#define this_cpu_or_8(pcp, val)			percpu_to_op(8, volatile, "or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
-#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_read_8(pcp)				percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)			percpu_to_op(8, volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)			percpu_add_op(8, volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val)			percpu_to_op(8, volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val)				percpu_to_op(8, volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val)			percpu_add_return_op(8, volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval)			percpu_xchg_op(8, volatile, pcp, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_8(pcp, oval, nval)		percpu_cmpxchg_op(8, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 06/13] add this_cpu_cmpxchg_local and asm-generic definitions
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 05/13] this_cpu_cmpxchg: x86: " Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Add this_cpu_cmpxchg_local_ helpers to asm-generic/percpu.h.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/include/asm-generic/percpu.h
===================================================================
--- linux-vmstat-remote.orig/include/asm-generic/percpu.h
+++ linux-vmstat-remote/include/asm-generic/percpu.h
@@ -424,6 +424,23 @@ do {									\
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
+#ifndef this_cpu_cmpxchg_local_1
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_2
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_4
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_8
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+
 #ifndef this_cpu_cmpxchg_double_1
 #define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
Index: linux-vmstat-remote/include/linux/percpu-defs.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/percpu-defs.h
+++ linux-vmstat-remote/include/linux/percpu-defs.h
@@ -513,6 +513,8 @@ do {									\
 #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
 	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
+#define this_cpu_cmpxchg_local(pcp, oval, nval) \
+	__pcpu_size_call_return2(this_cpu_cmpxchg_local_, pcp, oval, nval)
 #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
 



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 08/13] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Peter Xu, Marcelo Tosatti

this_cpu_cmpxchg was modified to atomic version, which 
can be more costly than non-atomic version.

Switch users of this_cpu_cmpxchg to this_cpu_cmpxchg_local
(which preserves pre-non-atomic this_cpu_cmpxchg behaviour).

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/kernel/fork.c
===================================================================
--- linux-vmstat-remote.orig/kernel/fork.c
+++ linux-vmstat-remote/kernel/fork.c
@@ -203,7 +203,7 @@ static bool try_release_thread_stack_to_
 	unsigned int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+		if (this_cpu_cmpxchg_local(cached_stacks[i], NULL, vm) != NULL)
 			continue;
 		return true;
 	}
Index: linux-vmstat-remote/kernel/scs.c
===================================================================
--- linux-vmstat-remote.orig/kernel/scs.c
+++ linux-vmstat-remote/kernel/scs.c
@@ -83,7 +83,7 @@ void scs_free(void *s)
 	 */
 
 	for (i = 0; i < NR_CACHED_SCS; i++)
-		if (this_cpu_cmpxchg(scs_cache[i], 0, s) == NULL)
+		if (this_cpu_cmpxchg_local(scs_cache[i], 0, s) == NULL)
 			return;
 
 	kasan_unpoison_vmalloc(s, SCS_SIZE, KASAN_VMALLOC_PROT_NORMAL);



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 08/13] mm/vmstat: switch counter modification to cmpxchg
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 09/13] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

In preparation for switching vmstat shepherd to flush
per-CPU counters remotely, switch the __{mod,inc,dec} functions that
modify the counters to use cmpxchg.

To facilitate reviewing, functions are ordered in the text file, as:

__{mod,inc,dec}_{zone,node}_page_state
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
#endif

This patch defines the __ versions for the
CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts:

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state = {mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state
#endif

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c 
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla			Patched		Diff
115.25			117		1.4%

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/*
+ * If we have cmpxchg_local support then we do not need to incur the overhead
+ * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
+ *
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *     0       No overstepping
+ *     1       Overstepping half of threshold
+ *     -1      Overstepping minus half of threshold
+ */
+static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
+				  long delta, int overstep_mode)
+{
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a zone.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		zone_page_state_add(z, zone, item);
+}
+
+void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			 long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_zone_page_state);
+
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			   long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+void inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_zone_page_state);
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_zone_page_state);
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+				  enum node_stat_item item,
+				  int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	if (vmstat_item_in_bytes(item)) {
+		/*
+		 * Only cgroups use subpage accounting right now; at
+		 * the global level, these items still change in
+		 * multiples of whole pages. Store them as pages
+		 * internally to keep the per-cpu counters compact.
+		 */
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+#else
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -541,149 +723,6 @@ void __dec_node_page_state(struct page *
 }
 EXPORT_SYMBOL(__dec_node_page_state);
 
-#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
-/*
- * If we have cmpxchg_local support then we do not need to incur the overhead
- * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
- *
- * mod_state() modifies the zone counter state through atomic per cpu
- * operations.
- *
- * Overstep mode specifies how overstep should handled:
- *     0       No overstepping
- *     1       Overstepping half of threshold
- *     -1      Overstepping minus half of threshold
-*/
-static inline void mod_zone_state(struct zone *zone,
-       enum zone_stat_item item, long delta, int overstep_mode)
-{
-	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long o, n, t, z;
-
-	do {
-		z = 0;  /* overflow to zone counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a zone.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to zone counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		zone_page_state_add(z, zone, item);
-}
-
-void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-			 long delta)
-{
-	mod_zone_state(zone, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_zone_page_state);
-
-void inc_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_zone_page_state);
-
-void dec_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_zone_page_state);
-
-static inline void mod_node_state(struct pglist_data *pgdat,
-       enum node_stat_item item, int delta, int overstep_mode)
-{
-	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	long o, n, t, z;
-
-	if (vmstat_item_in_bytes(item)) {
-		/*
-		 * Only cgroups use subpage accounting right now; at
-		 * the global level, these items still change in
-		 * multiples of whole pages. Store them as pages
-		 * internally to keep the per-cpu counters compact.
-		 */
-		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
-		delta >>= PAGE_SHIFT;
-	}
-
-	do {
-		z = 0;  /* overflow to node counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a node.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to node counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		node_page_state_add(z, pgdat, item);
-}
-
-void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
-					long delta)
-{
-	mod_node_state(pgdat, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_node_page_state);
-
-void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
-{
-	mod_node_state(pgdat, item, 1, 1);
-}
-
-void inc_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_node_page_state);
-
-void dec_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_node_page_state);
-#else
 /*
  * Use interrupt disable to serialize counter updates
  */
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -8628,9 +8628,6 @@ static int page_alloc_cpu_dead(unsigned
 	/*
 	 * Zero the differential counters of the dead processor
 	 * so that the vm statistics are consistent.
-	 *
-	 * This is only okay since the processor is dead and cannot
-	 * race with what we are doing.
 	 */
 	cpu_vm_stats_fold(cpu);
 



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 09/13] vmstat: switch per-cpu vmstat counters to 32-bits
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 08/13] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Some architectures only provide xchg/cmpxchg in 32/64-bit quantities.

Since the next patch is about to use xchg on per-CPU vmstat counters,
switch them to s32.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -689,8 +689,8 @@ struct per_cpu_pages {
 
 struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
-	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
-	s8 stat_threshold;
+	s32 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
+	s32 stat_threshold;
 #endif
 #ifdef CONFIG_NUMA
 	/*
@@ -703,8 +703,8 @@ struct per_cpu_zonestat {
 };
 
 struct per_cpu_nodestat {
-	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 stat_threshold;
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -351,7 +351,7 @@ static inline void mod_zone_state(struct
 				  long delta, int overstep_mode)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long o, n, t, z;
 
 	do {
@@ -428,7 +428,7 @@ static inline void mod_node_state(struct
 				  int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	if (vmstat_item_in_bytes(item)) {
@@ -525,7 +525,7 @@ void __mod_zone_page_state(struct zone *
 			   long delta)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long x;
 	long t;
 
@@ -556,7 +556,7 @@ void __mod_node_page_state(struct pglist
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -614,8 +614,8 @@ EXPORT_SYMBOL(__mod_node_page_state);
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -623,7 +623,7 @@ void __inc_zone_state(struct zone *zone,
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
@@ -635,8 +635,8 @@ void __inc_zone_state(struct zone *zone,
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -646,7 +646,7 @@ void __inc_node_state(struct pglist_data
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -670,8 +670,8 @@ EXPORT_SYMBOL(__inc_node_page_state);
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -679,7 +679,7 @@ void __dec_zone_state(struct zone *zone,
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
@@ -691,8 +691,8 @@ void __dec_zone_state(struct zone *zone,
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -702,7 +702,7 @@ void __dec_node_state(struct pglist_data
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 09/13] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use xchg instead of a
pair of read/write instructions.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -924,7 +924,7 @@ static int refresh_cpu_vm_stats(bool do_
 }
 
 /*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
@@ -945,8 +945,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = pzstats->vm_stat_diff[i];
-				pzstats->vm_stat_diff[i] = 0;
+				v = xchg(&pzstats->vm_stat_diff[i], 0);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
@@ -956,8 +955,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_numa_event[i]) {
 				unsigned long v;
 
-				v = pzstats->vm_numa_event[i];
-				pzstats->vm_numa_event[i] = 0;
+				v = xchg(&pzstats->vm_numa_event[i], 0);
 				zone_numa_event_add(v, zone, i);
 			}
 		}
@@ -973,8 +971,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (p->vm_node_stat_diff[i]) {
 				int v;
 
-				v = p->vm_node_stat_diff[i];
-				p->vm_node_stat_diff[i] = 0;
+				v = xchg(&p->vm_node_stat_diff[i], 0);
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
 			}



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 12/13] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

With a task that busy loops on a given CPU,
the kworker interruption to execute vmstat_update
is undesired and may exceed latency thresholds
for certain applications.

Performance details for the kworker interruption:

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.

To fix this, now that the counters are modified via cmpxchg
both CPU locally (via the account functions), and remotely (via
cpu_vm_stats_fold), its possible to switch vmstat_shepherd to perform
the per-CPU vmstats folding remotely.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -2043,6 +2043,23 @@ static void vmstat_shepherd(struct work_
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		cpu_vm_stats_fold(cpu);
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
+}
+#else
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
@@ -2062,6 +2079,7 @@ static void vmstat_shepherd(struct work_
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }
+#endif
 
 static void __init start_shepherd_timer(void)
 {



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 12/13] mm/vmstat: refresh stats remotely instead of via work item
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 18:03 ` [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Refresh per-CPU stats remotely, instead of queueing 
work items, for the stat_refresh procfs method.

This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -1901,11 +1901,20 @@ static DEFINE_PER_CPU(struct delayed_wor
 int sysctl_stat_interval __read_mostly = HZ;
 
 #ifdef CONFIG_PROC_FS
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+static int refresh_all_vm_stats(void);
+#else
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
 }
 
+static int refresh_all_vm_stats(void)
+{
+	return schedule_on_each_cpu(refresh_vm_stats);
+}
+#endif
+
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -1925,7 +1934,7 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
+	err = refresh_all_vm_stats();
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
@@ -2045,7 +2054,7 @@ static DECLARE_DEFERRABLE_WORK(shepherd,
 
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
-static void vmstat_shepherd(struct work_struct *w)
+static int refresh_all_vm_stats(void)
 {
 	int cpu;
 
@@ -2055,7 +2064,12 @@ static void vmstat_shepherd(struct work_
 		cond_resched();
 	}
 	cpus_read_unlock();
+	return 0;
+}
 
+static void vmstat_shepherd(struct work_struct *w)
+{
+	refresh_all_vm_stats();
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (11 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 12/13] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
@ 2023-03-20 18:03 ` Marcelo Tosatti
  2023-03-20 20:43   ` Tim Chen
  2023-03-20 18:25 ` [PATCH v7 00/13] fold per-CPU vmstats remotely Michal Hocko
  2023-04-18 22:02 ` Andrew Morton
  14 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Large NUMA systems might have significant portions 
of system memory to be trapped in pcp queues. The number of pcp is                                                             
determined by the number of processors and nodes in a system. A system                                                              
with 4 processors and 2 nodes has 8 pcps which is okay. But a system                                                                
with 1024 processors and 512 nodes has 512k pcps with a high potential                                                              
for large amount of memory being caught in them.

Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case,
where vmstat_shepherd will perform the aging and draining via
cpu_vm_stats_fold.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -928,7 +928,7 @@ static int refresh_cpu_vm_stats(bool do_
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
-void cpu_vm_stats_fold(int cpu)
+void cpu_vm_stats_fold(int cpu, bool do_pagesets)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
@@ -938,6 +938,9 @@ void cpu_vm_stats_fold(int cpu)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat *pzstats;
+#ifdef CONFIG_NUMA
+		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+#endif
 
 		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 
@@ -948,6 +951,11 @@ void cpu_vm_stats_fold(int cpu)
 				v = xchg(&pzstats->vm_stat_diff[i], 0);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
+#ifdef CONFIG_NUMA
+				/* 3 seconds idle till flush */
+				if (do_pagesets)
+					pcp->expire = 3;
+#endif
 			}
 		}
 #ifdef CONFIG_NUMA
@@ -959,6 +967,38 @@ void cpu_vm_stats_fold(int cpu)
 				zone_numa_event_add(v, zone, i);
 			}
 		}
+
+		if (do_pagesets) {
+			cond_resched();
+			/*
+			 * Deal with draining the remote pageset of a
+			 * processor
+			 *
+			 * Check if there are pages remaining in this pageset
+			 * if not then there is nothing to expire.
+			 */
+			if (!pcp->expire || !pcp->count)
+				continue;
+
+			/*
+			 * We never drain zones local to this processor.
+			 */
+			if (zone_to_nid(zone) == cpu_to_node(cpu)) {
+				pcp->expire = 0;
+				continue;
+			}
+
+			WARN_ON(pcp->expire < 0);
+			/*
+			 * pcp->expire is only accessed from vmstat_shepherd context,
+			 * therefore no locking is required.
+			 */
+			if (--pcp->expire)
+				continue;
+
+			if (pcp->count)
+				drain_zone_pages(zone, pcp);
+		}
 #endif
 	}
 
@@ -2060,7 +2100,7 @@ static int refresh_all_vm_stats(void)
 
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		cpu_vm_stats_fold(cpu);
+		cpu_vm_stats_fold(cpu, true);
 		cond_resched();
 	}
 	cpus_read_unlock();
Index: linux-vmstat-remote/include/linux/vmstat.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/vmstat.h
+++ linux-vmstat-remote/include/linux/vmstat.h
@@ -291,7 +291,7 @@ extern void __dec_zone_state(struct zone
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
-void cpu_vm_stats_fold(int cpu);
+void cpu_vm_stats_fold(int cpu, bool do_pagesets);
 void refresh_zone_stat_thresholds(void);
 
 struct ctl_table;
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -8629,7 +8629,7 @@ static int page_alloc_cpu_dead(unsigned
 	 * Zero the differential counters of the dead processor
 	 * so that the vm statistics are consistent.
 	 */
-	cpu_vm_stats_fold(cpu);
+	cpu_vm_stats_fold(cpu, false);
 
 	for_each_populated_zone(zone)
 		zone_pcp_update(zone, 0);



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot
  2023-03-20 18:03 ` [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
@ 2023-03-20 18:21   ` Michal Hocko
  2023-03-20 18:32     ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-03-20 18:21 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon 20-03-23 15:03:33, Marcelo Tosatti wrote:
> A customer provided evidence indicating that a process
> was stalled in direct reclaim:
> 
>  - The process was trapped in throttle_direct_reclaim().
>    The function wait_event_killable() was called to wait condition     
>    allow_direct_reclaim(pgdat) for current node to be true.     
>    The allow_direct_reclaim(pgdat) examined the number of free pages     
>    on the node by zone_page_state() which just returns value in     
>    zone->vm_stat[NR_FREE_PAGES].     
>                                                 
>  - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.            
>    However, the freelist on this node was not empty.     
>                            
>  - This inconsistent of vmstat value was caused by percpu vmstat on     
>    nohz_full cpus. Every increment/decrement of vmstat is performed     
>    on percpu vmstat counter at first, then pooled diffs are cumulated     
>    to the zone's vmstat counter in timely manner. However, on nohz_full     
>    cpus (in case of this customer's system, 48 of 52 cpus) these pooled     
>    diffs were not cumulated once the cpu had no event on it so that     
>    the cpu started sleeping infinitely.                       
>    I checked percpu vmstat and found there were total 69 counts not         
>    cumulated to the zone's vmstat counter yet.     
>                                          
>  - In this situation, kswapd did not help the trapped process.     
>    In pgdat_balanced(), zone_wakermark_ok_safe() examined the number     
>    of free pages on the node by zone_page_state_snapshot() which     
>    checks pending counts on percpu vmstat.     
>    Therefore kswapd could know there were 69 free pages correctly.     
>    Since zone->_watermark = {8, 20, 32}, kswapd did not work because     
>    69 was greater than 32 as high watermark.     
> 
> Change allow_direct_reclaim to use zone_page_state_snapshot, which
> allows a more precise version of the vmstat counters to be used.
> 
> allow_direct_reclaim will only be called from try_to_free_pages,
> which is not a hot path.

Have you managed to test this patch to confirm it addresses the above
issue? It should but better double check that.

> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

The patch makes sense regardless but a note about testing should be
added.

Acked-by: Michal Hocko <mhocko@suse.com>

> 
> ---
> 
> Index: linux-vmstat-remote/mm/vmscan.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmscan.c
> +++ linux-vmstat-remote/mm/vmscan.c
> @@ -6861,7 +6861,7 @@ static bool allow_direct_reclaim(pg_data
>  			continue;
>  
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> -		free_pages += zone_page_state(zone, NR_FREE_PAGES);
> +		free_pages += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  	}
>  
>  	/* If there are no reserves (unexpected config) then do not throttle */
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (12 preceding siblings ...)
  2023-03-20 18:03 ` [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-20 18:25 ` Michal Hocko
  2023-03-20 19:07   ` Marcelo Tosatti
  2023-04-18 22:02 ` Andrew Morton
  14 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-03-20 18:25 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> This patch series addresses the following two problems:
> 
> 1. A customer provided evidence indicating that a process
>    was stalled in direct reclaim:
> 
This is addressed by the trivial patch 1.

[...]
>  2. With a task that busy loops on a given CPU,
>     the kworker interruption to execute vmstat_update
>     is undesired and may exceed latency thresholds
>     for certain applications.

Yes it can but why does that matter?

> By having vmstat_shepherd flush the per-CPU counters to the
> global counters from remote CPUs.
> 
> This is done using cmpxchg to manipulate the counters,
> both CPU locally (via the account functions),
> and remotely (via cpu_vm_stats_fold).
> 
> Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> the initial patch series.
> 
> 
> Performance details for the kworker interruption:
> 
> oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
>  
> The example above shows an additional 7us for the
> 
>         oslat -> kworker -> oslat
> 
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold for certain applications.

I do not think we have ever promissed any specific latency guarantees
for vmstat. These are statistics have been mostly used for debugging
purposes AFAIK. I am not aware of any specific user space use case that
would be latency sensitive. Your changelog doesn't go into details there
either.

[...]
>  mm/vmstat.c                         |  440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------

This requires much more detailed story why we really need that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot
  2023-03-20 18:21   ` Michal Hocko
@ 2023-03-20 18:32     ` Marcelo Tosatti
  2023-03-22 10:03       ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon, Mar 20, 2023 at 07:21:04PM +0100, Michal Hocko wrote:
> On Mon 20-03-23 15:03:33, Marcelo Tosatti wrote:
> > A customer provided evidence indicating that a process
> > was stalled in direct reclaim:
> > 
> >  - The process was trapped in throttle_direct_reclaim().
> >    The function wait_event_killable() was called to wait condition     
> >    allow_direct_reclaim(pgdat) for current node to be true.     
> >    The allow_direct_reclaim(pgdat) examined the number of free pages     
> >    on the node by zone_page_state() which just returns value in     
> >    zone->vm_stat[NR_FREE_PAGES].     
> >                                                 
> >  - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.            
> >    However, the freelist on this node was not empty.     
> >                            
> >  - This inconsistent of vmstat value was caused by percpu vmstat on     
> >    nohz_full cpus. Every increment/decrement of vmstat is performed     
> >    on percpu vmstat counter at first, then pooled diffs are cumulated     
> >    to the zone's vmstat counter in timely manner. However, on nohz_full     
> >    cpus (in case of this customer's system, 48 of 52 cpus) these pooled     
> >    diffs were not cumulated once the cpu had no event on it so that     
> >    the cpu started sleeping infinitely.                       
> >    I checked percpu vmstat and found there were total 69 counts not         
> >    cumulated to the zone's vmstat counter yet.     
> >                                          
> >  - In this situation, kswapd did not help the trapped process.     
> >    In pgdat_balanced(), zone_wakermark_ok_safe() examined the number     
> >    of free pages on the node by zone_page_state_snapshot() which     
> >    checks pending counts on percpu vmstat.     
> >    Therefore kswapd could know there were 69 free pages correctly.     
> >    Since zone->_watermark = {8, 20, 32}, kswapd did not work because     
> >    69 was greater than 32 as high watermark.     
> > 
> > Change allow_direct_reclaim to use zone_page_state_snapshot, which
> > allows a more precise version of the vmstat counters to be used.
> > 
> > allow_direct_reclaim will only be called from try_to_free_pages,
> > which is not a hot path.
> 
> Have you managed to test this patch to confirm it addresses the above
> issue? It should but better double check that.
> 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> The patch makes sense regardless but a note about testing should be
> added.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Michal,

The patch has not been tested in the original setup where the problem 
was found, however i don't think its easy to do that validation
(checking with the reporter anyway).

Perhaps one could find a synthetic reproducer.

It is pretty easy to note that, on an isolated nohz_full CPU, the 
deferrable timer that is queued on it (timer which should queue vmstat_update
on that CPU) does not execute for long periods.
This makes the global stats stale (since per-CPU free pages can become
stale for as long as the CPU has tick processing stopped).
Which matches the data available.

Thanks!


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-20 18:25 ` [PATCH v7 00/13] fold per-CPU vmstats remotely Michal Hocko
@ 2023-03-20 19:07   ` Marcelo Tosatti
  2023-03-22 10:13     ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-20 19:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > This patch series addresses the following two problems:
> > 
> > 1. A customer provided evidence indicating that a process
> >    was stalled in direct reclaim:
> > 
> This is addressed by the trivial patch 1.
> 
> [...]
> >  2. With a task that busy loops on a given CPU,
> >     the kworker interruption to execute vmstat_update
> >     is undesired and may exceed latency thresholds
> >     for certain applications.
> 
> Yes it can but why does that matter?

It matters for the application that is executing and expects
not to be interrupted.

> > By having vmstat_shepherd flush the per-CPU counters to the
> > global counters from remote CPUs.
> > 
> > This is done using cmpxchg to manipulate the counters,
> > both CPU locally (via the account functions),
> > and remotely (via cpu_vm_stats_fold).
> > 
> > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > the initial patch series.
> > 
> > 
> > Performance details for the kworker interruption:
> > 
> > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> >  
> > The example above shows an additional 7us for the
> > 
> >         oslat -> kworker -> oslat
> > 
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold for certain applications.
> 
> I do not think we have ever promissed any specific latency guarantees
> for vmstat. These are statistics have been mostly used for debugging
> purposes AFAIK. I am not aware of any specific user space use case that
> would be latency sensitive. Your changelog doesn't go into details there
> either.

There is a class of workloads for which response time can be
of interest. MAC scheduler is an example:

https://par.nsf.gov/servlets/purl/10090368

Thanks!


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold
  2023-03-20 18:03 ` [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
@ 2023-03-20 20:43   ` Tim Chen
  2023-03-22  1:20     ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Tim Chen @ 2023-03-20 20:43 UTC (permalink / raw)
  To: Marcelo Tosatti, Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko

On Mon, 2023-03-20 at 15:03 -0300, Marcelo Tosatti wrote:
> 
> +
> +		if (do_pagesets) {
> +			cond_resched();
> +			/*
> +			 * Deal with draining the remote pageset of a
> +			 * processor
> +			 *
> +			 * Check if there are pages remaining in this pageset
> +			 * if not then there is nothing to expire.
> +			 */
> +			if (!pcp->expire || !pcp->count)
> +				continue;refresh_cpu_vm_stats
> +
> +			/*
> +			 * We never drain zones local to this processor.
> +			 */
> +			if (zone_to_nid(zone) == cpu_to_node(cpu)) {
> +				pcp->expire = 0;
> +				continue;
> +			}
> +
> +			WARN_ON(pcp->expire < 0);
> +			/*
> +			 * pcp->expire is only accessed from vmstat_shepherd context,
> +			 * therefore no locking is required.
> +			 */
> +			if (--pcp->expire)
> +				continue;
> +
> +			if (pcp->count)
> +				drain_zone_pages(zone, pcp);
> +		}

This logic is the same to that for the do_pagesets portion of code in refresh_cpu_vm_stats().
Is it possible to consolidate to avoid replicating the logic across two functions?

Tim 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold
  2023-03-20 20:43   ` Tim Chen
@ 2023-03-22  1:20     ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-22  1:20 UTC (permalink / raw)
  To: Tim Chen
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka, Michal Hocko

On Mon, Mar 20, 2023 at 01:43:09PM -0700, Tim Chen wrote:
> On Mon, 2023-03-20 at 15:03 -0300, Marcelo Tosatti wrote:
> > 
> > +
> > +		if (do_pagesets) {
> > +			cond_resched();
> > +			/*
> > +			 * Deal with draining the remote pageset of a
> > +			 * processor
> > +			 *
> > +			 * Check if there are pages remaining in this pageset
> > +			 * if not then there is nothing to expire.
> > +			 */
> > +			if (!pcp->expire || !pcp->count)
> > +				continue;refresh_cpu_vm_stats
> > +
> > +			/*
> > +			 * We never drain zones local to this processor.
> > +			 */
> > +			if (zone_to_nid(zone) == cpu_to_node(cpu)) {
> > +				pcp->expire = 0;
> > +				continue;
> > +			}
> > +
> > +			WARN_ON(pcp->expire < 0);
> > +			/*
> > +			 * pcp->expire is only accessed from vmstat_shepherd context,
> > +			 * therefore no locking is required.
> > +			 */
> > +			if (--pcp->expire)
> > +				continue;
> > +
> > +			if (pcp->count)
> > +				drain_zone_pages(zone, pcp);
> > +		}
> 
> This logic is the same to that for the do_pagesets portion of code in refresh_cpu_vm_stats().
> Is it possible to consolidate to avoid replicating the logic across two functions?
> 
> Tim 

Tim,

One of the versions is for remote access, the other version is for local
access (thats why merging the two is not so simple).

I suppose the next logical step would be ensure that the current 
!CONFIG_HAVE_CMPXCHG_LOCAL also uses the cmpxchg remotely (then 
refresh_cpu_vm_stats can be removed).


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot
  2023-03-20 18:32     ` Marcelo Tosatti
@ 2023-03-22 10:03       ` Michal Hocko
  0 siblings, 0 replies; 56+ messages in thread
From: Michal Hocko @ 2023-03-22 10:03 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon 20-03-23 15:32:15, Marcelo Tosatti wrote:
> On Mon, Mar 20, 2023 at 07:21:04PM +0100, Michal Hocko wrote:
> > On Mon 20-03-23 15:03:33, Marcelo Tosatti wrote:
> > > A customer provided evidence indicating that a process
> > > was stalled in direct reclaim:
> > > 
> > >  - The process was trapped in throttle_direct_reclaim().
> > >    The function wait_event_killable() was called to wait condition     
> > >    allow_direct_reclaim(pgdat) for current node to be true.     
> > >    The allow_direct_reclaim(pgdat) examined the number of free pages     
> > >    on the node by zone_page_state() which just returns value in     
> > >    zone->vm_stat[NR_FREE_PAGES].     
> > >                                                 
> > >  - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.            
> > >    However, the freelist on this node was not empty.     
> > >                            
> > >  - This inconsistent of vmstat value was caused by percpu vmstat on     
> > >    nohz_full cpus. Every increment/decrement of vmstat is performed     
> > >    on percpu vmstat counter at first, then pooled diffs are cumulated     
> > >    to the zone's vmstat counter in timely manner. However, on nohz_full     
> > >    cpus (in case of this customer's system, 48 of 52 cpus) these pooled     
> > >    diffs were not cumulated once the cpu had no event on it so that     
> > >    the cpu started sleeping infinitely.                       
> > >    I checked percpu vmstat and found there were total 69 counts not         
> > >    cumulated to the zone's vmstat counter yet.     
> > >                                          
> > >  - In this situation, kswapd did not help the trapped process.     
> > >    In pgdat_balanced(), zone_wakermark_ok_safe() examined the number     
> > >    of free pages on the node by zone_page_state_snapshot() which     
> > >    checks pending counts on percpu vmstat.     
> > >    Therefore kswapd could know there were 69 free pages correctly.     
> > >    Since zone->_watermark = {8, 20, 32}, kswapd did not work because     
> > >    69 was greater than 32 as high watermark.     
> > > 
> > > Change allow_direct_reclaim to use zone_page_state_snapshot, which
> > > allows a more precise version of the vmstat counters to be used.
> > > 
> > > allow_direct_reclaim will only be called from try_to_free_pages,
> > > which is not a hot path.
> > 
> > Have you managed to test this patch to confirm it addresses the above
> > issue? It should but better double check that.
> > 
> > > Suggested-by: Michal Hocko <mhocko@suse.com>
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > The patch makes sense regardless but a note about testing should be
> > added.
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Michal,
> 
> The patch has not been tested in the original setup where the problem 
> was found, however i don't think its easy to do that validation
> (checking with the reporter anyway).

This is a fair point and I would just add it to the changelog for the
future reference.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-20 19:07   ` Marcelo Tosatti
@ 2023-03-22 10:13     ` Michal Hocko
  2023-03-22 11:23       ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-03-22 10:13 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > This patch series addresses the following two problems:
> > > 
> > > 1. A customer provided evidence indicating that a process
> > >    was stalled in direct reclaim:
> > > 
> > This is addressed by the trivial patch 1.
> > 
> > [...]
> > >  2. With a task that busy loops on a given CPU,
> > >     the kworker interruption to execute vmstat_update
> > >     is undesired and may exceed latency thresholds
> > >     for certain applications.
> > 
> > Yes it can but why does that matter?
> 
> It matters for the application that is executing and expects
> not to be interrupted.

Those workloads shouldn't enter the kernel in the first place, no?
Otherwise the in kernel execution with all the direct or indirect
dependencies (e.g. via locks) can throw any latency expectations off the
window.

> > > By having vmstat_shepherd flush the per-CPU counters to the
> > > global counters from remote CPUs.
> > > 
> > > This is done using cmpxchg to manipulate the counters,
> > > both CPU locally (via the account functions),
> > > and remotely (via cpu_vm_stats_fold).
> > > 
> > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > the initial patch series.
> > > 
> > > 
> > > Performance details for the kworker interruption:
> > > 
> > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > >  
> > > The example above shows an additional 7us for the
> > > 
> > >         oslat -> kworker -> oslat
> > > 
> > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > observed in the guest is higher than 50us, violating the acceptable
> > > latency threshold for certain applications.
> > 
> > I do not think we have ever promissed any specific latency guarantees
> > for vmstat. These are statistics have been mostly used for debugging
> > purposes AFAIK. I am not aware of any specific user space use case that
> > would be latency sensitive. Your changelog doesn't go into details there
> > either.
> 
> There is a class of workloads for which response time can be
> of interest. MAC scheduler is an example:
> 
> https://par.nsf.gov/servlets/purl/10090368

Yes, I am not disputing low latency workloads in general. I am just
saying that you haven't really established a very sound justification
here. Of course there are workloads which do not want to conflict with
any in kernel house keeping. Those have to be configured and implemented
very carefully though. Vmstat as such should not collide with those
workloads as long as they do not interact with the kernel in a way
counters are updated. Is this hard or impossible to avoid? 

I can imagine that those workloads have an start up sequence where the
kernel is involved and counters updated so that deferred flushing could
interfere with the later and latency sensitive phase. Is that a real
problem in practice? Please tell us much more why we need to make the
vmstat code more complex.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-22 10:13     ` Michal Hocko
@ 2023-03-22 11:23       ` Marcelo Tosatti
  2023-03-22 13:35         ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-22 11:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote:
> On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > > This patch series addresses the following two problems:
> > > > 
> > > > 1. A customer provided evidence indicating that a process
> > > >    was stalled in direct reclaim:
> > > > 
> > > This is addressed by the trivial patch 1.
> > > 
> > > [...]
> > > >  2. With a task that busy loops on a given CPU,
> > > >     the kworker interruption to execute vmstat_update
> > > >     is undesired and may exceed latency thresholds
> > > >     for certain applications.
> > > 
> > > Yes it can but why does that matter?
> > 
> > It matters for the application that is executing and expects
> > not to be interrupted.
> 
> Those workloads shouldn't enter the kernel in the first place, no?

It depends on the latency requirements and individual system calls.

> Otherwise the in kernel execution with all the direct or indirect
> dependencies (e.g. via locks) can throw any latency expectations off the
> window.
> 
> > > > By having vmstat_shepherd flush the per-CPU counters to the
> > > > global counters from remote CPUs.
> > > > 
> > > > This is done using cmpxchg to manipulate the counters,
> > > > both CPU locally (via the account functions),
> > > > and remotely (via cpu_vm_stats_fold).
> > > > 
> > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > > the initial patch series.
> > > > 
> > > > 
> > > > Performance details for the kworker interruption:
> > > > 
> > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > >  
> > > > The example above shows an additional 7us for the
> > > > 
> > > >         oslat -> kworker -> oslat
> > > > 
> > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > observed in the guest is higher than 50us, violating the acceptable
> > > > latency threshold for certain applications.
> > > 
> > > I do not think we have ever promissed any specific latency guarantees
> > > for vmstat. These are statistics have been mostly used for debugging
> > > purposes AFAIK. I am not aware of any specific user space use case that
> > > would be latency sensitive. Your changelog doesn't go into details there
> > > either.
> > 
> > There is a class of workloads for which response time can be
> > of interest. MAC scheduler is an example:
> > 
> > https://par.nsf.gov/servlets/purl/10090368
> 
> Yes, I am not disputing low latency workloads in general. I am just
> saying that you haven't really established a very sound justification
> here.

The -v7 cover letter was updated with additional details, 
as you requested (perhaps you missed it):

"Performance details for the kworker interruption:

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications."

> Of course there are workloads which do not want to conflict with
> any in kernel house keeping. Those have to be configured and implemented
> very carefully though. Vmstat as such should not collide with those
> workloads as long as they do not interact with the kernel in a way
> counters are updated. Is this hard or impossible to avoid? 

The practical problem we have been seeing is -RT app initialization.
For example:

1) mlock();
2) enter loop without system calls

> I can imagine that those workloads have an start up sequence where the
> kernel is involved and counters updated so that deferred flushing could
> interfere with the later and latency sensitive phase. Is that a real
> problem in practice? Please tell us much more why we need to make the
> vmstat code more complex.

Yes, it is. I have attached traces and performance numbers above.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-22 11:23       ` Marcelo Tosatti
@ 2023-03-22 13:35         ` Michal Hocko
  2023-03-22 14:20           ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-03-22 13:35 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed 22-03-23 08:23:21, Marcelo Tosatti wrote:
> On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote:
> > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > > > This patch series addresses the following two problems:
> > > > > 
> > > > > 1. A customer provided evidence indicating that a process
> > > > >    was stalled in direct reclaim:
> > > > > 
> > > > This is addressed by the trivial patch 1.
> > > > 
> > > > [...]
> > > > >  2. With a task that busy loops on a given CPU,
> > > > >     the kworker interruption to execute vmstat_update
> > > > >     is undesired and may exceed latency thresholds
> > > > >     for certain applications.
> > > > 
> > > > Yes it can but why does that matter?
> > > 
> > > It matters for the application that is executing and expects
> > > not to be interrupted.
> > 
> > Those workloads shouldn't enter the kernel in the first place, no?
> 
> It depends on the latency requirements and individual system calls.
> 
> > Otherwise the in kernel execution with all the direct or indirect
> > dependencies (e.g. via locks) can throw any latency expectations off the
> > window.
> > 
> > > > > By having vmstat_shepherd flush the per-CPU counters to the
> > > > > global counters from remote CPUs.
> > > > > 
> > > > > This is done using cmpxchg to manipulate the counters,
> > > > > both CPU locally (via the account functions),
> > > > > and remotely (via cpu_vm_stats_fold).
> > > > > 
> > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > > > the initial patch series.
> > > > > 
> > > > > 
> > > > > Performance details for the kworker interruption:
> > > > > 
> > > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > >  
> > > > > The example above shows an additional 7us for the
> > > > > 
> > > > >         oslat -> kworker -> oslat
> > > > > 
> > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > latency threshold for certain applications.
> > > > 
> > > > I do not think we have ever promissed any specific latency guarantees
> > > > for vmstat. These are statistics have been mostly used for debugging
> > > > purposes AFAIK. I am not aware of any specific user space use case that
> > > > would be latency sensitive. Your changelog doesn't go into details there
> > > > either.
> > > 
> > > There is a class of workloads for which response time can be
> > > of interest. MAC scheduler is an example:
> > > 
> > > https://par.nsf.gov/servlets/purl/10090368
> > 
> > Yes, I am not disputing low latency workloads in general. I am just
> > saying that you haven't really established a very sound justification
> > here.
> 
> The -v7 cover letter was updated with additional details, 
> as you requested (perhaps you missed it):
> 
> "Performance details for the kworker interruption:
> 
> oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> 
> The example above shows an additional 7us for the
> 
>         oslat -> kworker -> oslat
> 
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable
> latency threshold for certain applications."

Yes, I have seen that but it doesn't really give a wider context to
understand why those numbers matter.

> > Of course there are workloads which do not want to conflict with
> > any in kernel house keeping. Those have to be configured and implemented
> > very carefully though. Vmstat as such should not collide with those
> > workloads as long as they do not interact with the kernel in a way
> > counters are updated. Is this hard or impossible to avoid? 
> 
> The practical problem we have been seeing is -RT app initialization.
> For example:
> 
> 1) mlock();
> 2) enter loop without system calls

OK, that is what I have kinda expected. Would have been better to
mention it explicitly.

I expect this to be a very common pattern and vmstat might not be the
only subsystem that could interfere later on. Would it make more sense
to address this by a more generic solution? E.g. a syscall to flush all
per-cpu caches so they won't interfere later unless userspace hits the
kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)?
The above pattern could then be implemented as

	do_initial_setup()
	sched_setaffinity(getpid(), cpumask);
	flush_cpu_caches(cpumask, 0);
	do_userspace_loop()

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-22 13:35         ` Michal Hocko
@ 2023-03-22 14:20           ` Marcelo Tosatti
  2023-03-23  7:51             ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-22 14:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> On Wed 22-03-23 08:23:21, Marcelo Tosatti wrote:
> > On Wed, Mar 22, 2023 at 11:13:02AM +0100, Michal Hocko wrote:
> > > On Mon 20-03-23 16:07:29, Marcelo Tosatti wrote:
> > > > On Mon, Mar 20, 2023 at 07:25:55PM +0100, Michal Hocko wrote:
> > > > > On Mon 20-03-23 15:03:32, Marcelo Tosatti wrote:
> > > > > > This patch series addresses the following two problems:
> > > > > > 
> > > > > > 1. A customer provided evidence indicating that a process
> > > > > >    was stalled in direct reclaim:
> > > > > > 
> > > > > This is addressed by the trivial patch 1.
> > > > > 
> > > > > [...]
> > > > > >  2. With a task that busy loops on a given CPU,
> > > > > >     the kworker interruption to execute vmstat_update
> > > > > >     is undesired and may exceed latency thresholds
> > > > > >     for certain applications.
> > > > > 
> > > > > Yes it can but why does that matter?
> > > > 
> > > > It matters for the application that is executing and expects
> > > > not to be interrupted.
> > > 
> > > Those workloads shouldn't enter the kernel in the first place, no?
> > 
> > It depends on the latency requirements and individual system calls.
> > 
> > > Otherwise the in kernel execution with all the direct or indirect
> > > dependencies (e.g. via locks) can throw any latency expectations off the
> > > window.
> > > 
> > > > > > By having vmstat_shepherd flush the per-CPU counters to the
> > > > > > global counters from remote CPUs.
> > > > > > 
> > > > > > This is done using cmpxchg to manipulate the counters,
> > > > > > both CPU locally (via the account functions),
> > > > > > and remotely (via cpu_vm_stats_fold).
> > > > > > 
> > > > > > Thanks to Aaron Tomlin for diagnosing issue 1 and writing
> > > > > > the initial patch series.
> > > > > > 
> > > > > > 
> > > > > > Performance details for the kworker interruption:
> > > > > > 
> > > > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > >  
> > > > > > The example above shows an additional 7us for the
> > > > > > 
> > > > > >         oslat -> kworker -> oslat
> > > > > > 
> > > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > > latency threshold for certain applications.
> > > > > 
> > > > > I do not think we have ever promissed any specific latency guarantees
> > > > > for vmstat. These are statistics have been mostly used for debugging
> > > > > purposes AFAIK. I am not aware of any specific user space use case that
> > > > > would be latency sensitive. Your changelog doesn't go into details there
> > > > > either.
> > > > 
> > > > There is a class of workloads for which response time can be
> > > > of interest. MAC scheduler is an example:
> > > > 
> > > > https://par.nsf.gov/servlets/purl/10090368
> > > 
> > > Yes, I am not disputing low latency workloads in general. I am just
> > > saying that you haven't really established a very sound justification
> > > here.
> > 
> > The -v7 cover letter was updated with additional details, 
> > as you requested (perhaps you missed it):
> > 
> > "Performance details for the kworker interruption:
> > 
> > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > 
> > The example above shows an additional 7us for the
> > 
> >         oslat -> kworker -> oslat
> > 
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable
> > latency threshold for certain applications."
> 
> Yes, I have seen that but it doesn't really give a wider context to
> understand why those numbers matter.

OK.

"In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
interruption observed in a guest (which is above the safety
threshold for this application)."

Is that OK?


> > > Of course there are workloads which do not want to conflict with
> > > any in kernel house keeping. Those have to be configured and implemented
> > > very carefully though. Vmstat as such should not collide with those
> > > workloads as long as they do not interact with the kernel in a way
> > > counters are updated. Is this hard or impossible to avoid? 
> > 
> > The practical problem we have been seeing is -RT app initialization.
> > For example:
> > 
> > 1) mlock();
> > 2) enter loop without system calls
> 
> OK, that is what I have kinda expected. Would have been better to
> mention it explicitly.
> 
> I expect this to be a very common pattern and vmstat might not be the
> only subsystem that could interfere later on. Would it make more sense
> to address this by a more generic solution? E.g. a syscall to flush all
> per-cpu caches so they won't interfere later unless userspace hits the
> kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)?
> The above pattern could then be implemented as
> 
> 	do_initial_setup()
> 	sched_setaffinity(getpid(), cpumask);
> 	flush_cpu_caches(cpumask, 0);
> 	do_userspace_loop()

I would argue that fixing this without introducing a userspace tunable 
is more generic as all programs (modified to use a syscall or not)
benefit from the improvement. HPC workloads, for example.

But it might be necessary to do what you suggest for other
reasons (where you'd want a behaviour to be enabled which
is undesired for other application types).


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-22 14:20           ` Marcelo Tosatti
@ 2023-03-23  7:51             ` Michal Hocko
  2023-03-23 10:52               ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-03-23  7:51 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
[...]
> > > "Performance details for the kworker interruption:
> > > 
> > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > 
> > > The example above shows an additional 7us for the
> > > 
> > >         oslat -> kworker -> oslat
> > > 
> > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > observed in the guest is higher than 50us, violating the acceptable
> > > latency threshold for certain applications."
> > 
> > Yes, I have seen that but it doesn't really give a wider context to
> > understand why those numbers matter.
> 
> OK.
> 
> "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> interruption observed in a guest (which is above the safety
> threshold for this application)."
> 
> Is that OK?

This might be a sufficient information for somebody familiar with the
matter (not me). So no, not enough. We need to hear a more complete
story. 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-23  7:51             ` Michal Hocko
@ 2023-03-23 10:52               ` Marcelo Tosatti
  2023-03-23 10:59                 ` Marcelo Tosatti
  2023-03-23 12:17                 ` Michal Hocko
  0 siblings, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-23 10:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> [...]
> > > > "Performance details for the kworker interruption:
> > > > 
> > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > 
> > > > The example above shows an additional 7us for the
> > > > 
> > > >         oslat -> kworker -> oslat
> > > > 
> > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > observed in the guest is higher than 50us, violating the acceptable
> > > > latency threshold for certain applications."
> > > 
> > > Yes, I have seen that but it doesn't really give a wider context to
> > > understand why those numbers matter.
> > 
> > OK.
> > 
> > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > interruption observed in a guest (which is above the safety
> > threshold for this application)."
> > 
> > Is that OK?
> 
> This might be a sufficient information for somebody familiar with the
> matter (not me). So no, not enough. We need to hear a more complete
> story. 

Michal,

Please refer to 
https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf

2.3 Channel Dependent Scheduling
The purpose of scheduling is to decide which terminal will transmit data on which set
of resource blocks with what transport format to use. The objective is to assign
resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
same length of Transmission Time Interval (TTI) in LTE system.

In general:

https://en.wikipedia.org/wiki/Real-time_computing

Real-time computing (RTC) is the computer science term for hardware and
software systems subject to a "real-time constraint", for example from
event to system response.[1] Real-time programs must guarantee response
within specified time constraints, often referred to as "deadlines".[2]

Real-time responses are often understood to be in the order of
milliseconds, and sometimes microseconds. A system not specified as
operating in real time cannot usually guarantee a response within any
timeframe, although typical or expected response times may be given.
Real-time processing fails if not completed within a specified deadline
relative to an event; deadlines must always be met, regardless of system
load.

For example, for the MAC scheduler processing must occur every 1ms,
and a certain amount of computation takes place (and must finish before
the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
is considered a "failure".



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-23 10:52               ` Marcelo Tosatti
@ 2023-03-23 10:59                 ` Marcelo Tosatti
  2023-03-23 12:17                 ` Michal Hocko
  1 sibling, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-23 10:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu, Mar 23, 2023 at 07:52:22AM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > [...]
> > > > > "Performance details for the kworker interruption:
> > > > > 
> > > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > 
> > > > > The example above shows an additional 7us for the
> > > > > 
> > > > >         oslat -> kworker -> oslat
> > > > > 
> > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > latency threshold for certain applications."
> > > > 
> > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > understand why those numbers matter.
> > > 
> > > OK.
> > > 
> > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > interruption observed in a guest (which is above the safety
> > > threshold for this application)."
> > > 
> > > Is that OK?
> > 
> > This might be a sufficient information for somebody familiar with the
> > matter (not me). So no, not enough. We need to hear a more complete
> > story. 
> 
> Michal,
> 
> Please refer to 
> https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
> 
> 2.3 Channel Dependent Scheduling
> The purpose of scheduling is to decide which terminal will transmit data on which set
> of resource blocks with what transport format to use. The objective is to assign
> resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> same length of Transmission Time Interval (TTI) in LTE system.
> 
> In general:
> 
> https://en.wikipedia.org/wiki/Real-time_computing
> 
> Real-time computing (RTC) is the computer science term for hardware and
> software systems subject to a "real-time constraint", for example from
> event to system response.[1] Real-time programs must guarantee response
> within specified time constraints, often referred to as "deadlines".[2]
> 
> Real-time responses are often understood to be in the order of
> milliseconds, and sometimes microseconds. A system not specified as
> operating in real time cannot usually guarantee a response within any
> timeframe, although typical or expected response times may be given.
> Real-time processing fails if not completed within a specified deadline
> relative to an event; deadlines must always be met, regardless of system
> load.
> 
> For example, for the MAC scheduler processing must occur every 1ms,
> and a certain amount of computation takes place (and must finish before
> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> is considered a "failure".

If you need more detail, will have to ask someone else, because that is
all I know.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-23 10:52               ` Marcelo Tosatti
  2023-03-23 10:59                 ` Marcelo Tosatti
@ 2023-03-23 12:17                 ` Michal Hocko
  2023-03-23 13:30                   ` Marcelo Tosatti
  1 sibling, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-03-23 12:17 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote:
> On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > [...]
> > > > > "Performance details for the kworker interruption:
> > > > > 
> > > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > 
> > > > > The example above shows an additional 7us for the
> > > > > 
> > > > >         oslat -> kworker -> oslat
> > > > > 
> > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > latency threshold for certain applications."
> > > > 
> > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > understand why those numbers matter.
> > > 
> > > OK.
> > > 
> > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > interruption observed in a guest (which is above the safety
> > > threshold for this application)."
> > > 
> > > Is that OK?
> > 
> > This might be a sufficient information for somebody familiar with the
> > matter (not me). So no, not enough. We need to hear a more complete
> > story. 
> 
> Michal,
> 
> Please refer to 
> https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
> 
> 2.3 Channel Dependent Scheduling
> The purpose of scheduling is to decide which terminal will transmit data on which set
> of resource blocks with what transport format to use. The objective is to assign
> resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> same length of Transmission Time Interval (TTI) in LTE system.
> 
> In general:
> 
> https://en.wikipedia.org/wiki/Real-time_computing

Thank you, but not something I was really asking for (repeatedly). I am
pretty aware of what RT computing is about. I am not really interested
in a generic fluff. I am asking about specific usecases you have in mind
when pushing these changes.
 
> For example, for the MAC scheduler processing must occur every 1ms,
> and a certain amount of computation takes place (and must finish before
> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> is considered a "failure".

OK, you are claiming that much but you are not really filling up other
holes in your story. Let me just outline few questions I have. Your
measurements talk about 7us overhead the vmstat processing might add.
This is really far from > 50us above. You suggest that this is an effect
of the workload running in a guest without more details. I am quite
surprised to hear about RT expectations inside a guest system TBH.

All that being said, it would be really helpful if you were more
specific about the workload and why there is no other way but making
vmstat infrastructure more complex (it is quite complex on its own).

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-23 12:17                 ` Michal Hocko
@ 2023-03-23 13:30                   ` Marcelo Tosatti
  2023-03-23 13:32                     ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-23 13:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu, Mar 23, 2023 at 01:17:32PM +0100, Michal Hocko wrote:
> On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote:
> > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > > "Performance details for the kworker interruption:
> > > > > > 
> > > > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > > 
> > > > > > The example above shows an additional 7us for the
> > > > > > 
> > > > > >         oslat -> kworker -> oslat
> > > > > > 
> > > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > > latency threshold for certain applications."
> > > > > 
> > > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > > understand why those numbers matter.
> > > > 
> > > > OK.
> > > > 
> > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > > interruption observed in a guest (which is above the safety
> > > > threshold for this application)."
> > > > 
> > > > Is that OK?
> > > 
> > > This might be a sufficient information for somebody familiar with the
> > > matter (not me). So no, not enough. We need to hear a more complete
> > > story. 
> > 
> > Michal,
> > 
> > Please refer to 
> > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
> > 
> > 2.3 Channel Dependent Scheduling
> > The purpose of scheduling is to decide which terminal will transmit data on which set
> > of resource blocks with what transport format to use. The objective is to assign
> > resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> > same length of Transmission Time Interval (TTI) in LTE system.
> > 
> > In general:
> > 
> > https://en.wikipedia.org/wiki/Real-time_computing
> 
> Thank you, but not something I was really asking for (repeatedly). I am
> pretty aware of what RT computing is about. I am not really interested
> in a generic fluff. I am asking about specific usecases you have in mind
> when pushing these changes.
>  
> > For example, for the MAC scheduler processing must occur every 1ms,
> > and a certain amount of computation takes place (and must finish before
> > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> > is considered a "failure".
> 
> OK, you are claiming that much but you are not really filling up other
> holes in your story. Let me just outline few questions I have. Your
> measurements talk about 7us overhead the vmstat processing might add.
> This is really far from > 50us above. 

7us in the host, for the following sched_switch events:

	oslat -> kworker
	kworker -> oslat

However, if the impact is for a virtualized application:

	oslat, executing via qemu-vcpu process in the host.

	oslat executing
	qemu-vcpu VM-EXIT
	qemu-vcpu -> kworker
	kworker -> qemu-vcpu
	qemu-vcpu VM-ENTRY

is much higher than the 7us (can be above 100us).

> You suggest that this is an effect
> of the workload running in a guest without more details. I am quite
> surprised to hear about RT expectations inside a guest system TBH.

https://www.youtube.com/watch?v=zIDwc6uDszY

> All that being said, it would be really helpful if you were more
> specific about the workload and why there is no other way but making
> vmstat infrastructure more complex (it is quite complex on its own).

The patchset is just changing vmstat_shepherd from happening locally
to happening remotely. There are a number of algorithms in the kernel
that deal with concurrent access already.

What you think this particular patchset makes things complicated
and what can be done to make it simpler?




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-23 13:30                   ` Marcelo Tosatti
@ 2023-03-23 13:32                     ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-03-23 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu, Mar 23, 2023 at 10:30:13AM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 23, 2023 at 01:17:32PM +0100, Michal Hocko wrote:
> > On Thu 23-03-23 07:52:22, Marcelo Tosatti wrote:
> > > On Thu, Mar 23, 2023 at 08:51:14AM +0100, Michal Hocko wrote:
> > > > On Wed 22-03-23 11:20:55, Marcelo Tosatti wrote:
> > > > > On Wed, Mar 22, 2023 at 02:35:20PM +0100, Michal Hocko wrote:
> > > > [...]
> > > > > > > "Performance details for the kworker interruption:
> > > > > > > 
> > > > > > > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > > > > > > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > > > > > > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > > > > > > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > > > > > > 
> > > > > > > The example above shows an additional 7us for the
> > > > > > > 
> > > > > > >         oslat -> kworker -> oslat
> > > > > > > 
> > > > > > > switches. In the case of a virtualized CPU, and the vmstat_update
> > > > > > > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > > > > > > observed in the guest is higher than 50us, violating the acceptable
> > > > > > > latency threshold for certain applications."
> > > > > > 
> > > > > > Yes, I have seen that but it doesn't really give a wider context to
> > > > > > understand why those numbers matter.
> > > > > 
> > > > > OK.
> > > > > 
> > > > > "In the case of RAN, a MAC scheduler with TTI=1ms, this causes >100us
> > > > > interruption observed in a guest (which is above the safety
> > > > > threshold for this application)."
> > > > > 
> > > > > Is that OK?
> > > > 
> > > > This might be a sufficient information for somebody familiar with the
> > > > matter (not me). So no, not enough. We need to hear a more complete
> > > > story. 
> > > 
> > > Michal,
> > > 
> > > Please refer to 
> > > https://www.diva-portal.org/smash/get/diva2:541460/FULLTEXT01.pdf
> > > 
> > > 2.3 Channel Dependent Scheduling
> > > The purpose of scheduling is to decide which terminal will transmit data on which set
> > > of resource blocks with what transport format to use. The objective is to assign
> > > resources to the terminal such that the quality of service (QoS) requirement is fulfilled.
> > > Scheduling decision is taken every 1 ms by base station (termed as eNodeB) as the
> > > same length of Transmission Time Interval (TTI) in LTE system.
> > > 
> > > In general:
> > > 
> > > https://en.wikipedia.org/wiki/Real-time_computing
> > 
> > Thank you, but not something I was really asking for (repeatedly). I am
> > pretty aware of what RT computing is about. I am not really interested
> > in a generic fluff. I am asking about specific usecases you have in mind
> > when pushing these changes.
> >  
> > > For example, for the MAC scheduler processing must occur every 1ms,
> > > and a certain amount of computation takes place (and must finish before
> > > the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> > > is considered a "failure".
> > 
> > OK, you are claiming that much but you are not really filling up other
> > holes in your story. Let me just outline few questions I have. Your
> > measurements talk about 7us overhead the vmstat processing might add.
> > This is really far from > 50us above. 
> 
> 7us in the host, for the following sched_switch events:
> 
> 	oslat -> kworker
> 	kworker -> oslat
> 
> However, if the impact is for a virtualized application:
> 
> 	oslat, executing via qemu-vcpu process in the host.
> 
> 	oslat executing
> 	qemu-vcpu VM-EXIT
> 	qemu-vcpu -> kworker
> 	kworker -> qemu-vcpu
> 	qemu-vcpu VM-ENTRY
> 
> is much higher than the 7us (can be above 100us).

And nothing prevents this from happening:

 	oslat executing
 	qemu-vcpu VM-EXIT
 	qemu-vcpu -> kworker (in the host, to handle vmstat_update)
 	kworker -> qemu-vcpu
 	qemu-vcpu VM-ENTRY
	oslat -> kworker (in the guest, to handle vmstat_update)
	kworker -> oslat




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (13 preceding siblings ...)
  2023-03-20 18:25 ` [PATCH v7 00/13] fold per-CPU vmstats remotely Michal Hocko
@ 2023-04-18 22:02 ` Andrew Morton
  2023-04-19 11:14   ` Marcelo Tosatti
  14 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2023-04-18 22:02 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka, Michal Hocko

On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:

> This patch series addresses the following two problems:
> 
> 1. A customer provided evidence indicating that a process
>    was stalled in direct reclaim:
> 
> ...
>
>  2. With a task that busy loops on a given CPU,
>     the kworker interruption to execute vmstat_update
>     is undesired and may exceed latency thresholds
>     for certain applications.
> 

I don't think I'll be sending this upstream in the next merge window. 
Because it isn't clear that the added complexity in vmstat handling is
justified.

- Michal's request for more clarity on the end-user requirements
  seems reasonable.

- You have indicated that additional changelog material is forthcoming.

- The alternative idea of adding a syscall which tells the kernel
  "I'm about to go realtime, so please clear away all the pending crap
  which might later interrupt me" sounds pretty good.

  Partly because there are surely other places where we can use this.

  Partly because it moves all the crap-clearing into special
  crap-clearing code paths while adding less burden to the
  commonly-executed code.

  And I don't think this alternative has been fully investigated and
  discussed.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-18 22:02 ` Andrew Morton
@ 2023-04-19 11:14   ` Marcelo Tosatti
  2023-04-19 11:15     ` Marcelo Tosatti
  2023-04-19 11:29     ` Marcelo Tosatti
  0 siblings, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-19 11:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka, Michal Hocko

On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote:
> On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > This patch series addresses the following two problems:
> > 
> > 1. A customer provided evidence indicating that a process
> >    was stalled in direct reclaim:
> > 
> > ...
> >
> >  2. With a task that busy loops on a given CPU,
> >     the kworker interruption to execute vmstat_update
> >     is undesired and may exceed latency thresholds
> >     for certain applications.
> > 
> 
> I don't think I'll be sending this upstream in the next merge window. 
> Because it isn't clear that the added complexity in vmstat handling is
> justified.

From my POV this is an incorrect statement (that the complexity in
vmstat handling is not justified).

Andrew, this is the 3rd attempt to fix this problem:

First try:  https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/

Second try: https://patchew.org/linux/20230105125218.031928326@redhat.com/

Third try: syncing vmstats remotely from vmstat_shepherd (this
patchset).

And also, can you please explain: what is so complicated about the
vmstat handling? cmpxchg has been around and is used all over the
kernel, and nobody considers "excessively complicated".

> - Michal's request for more clarity on the end-user requirements
>   seems reasonable.

And i explained to Michal in great detail where the end-user 
requirements come from. For virtualized workloads, there are two
types of use-cases:

1) For example, for the MAC scheduler processing must occur every 1ms,
and a certain amount of computation takes place (and must finish before
the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
is considered a "failure".

I showed him a 7us trace caused by, and explained that will extend to >
50us in the case of virtualized vCPU.

2) PLCs. These workloads will also suffer > 50us latency spikes
which is undesirable.

Can you please explain what additional clarity is required?

RH's performance team, for example, has been performing packet
latency tests and waiting for this issue to be fixed for about 2
years now.

Andrew Theurer, can you please explain what problem is the vmstat_work
interruption causing in your testing?

> - You have indicated that additional changelog material is forthcoming.

Not really.

Do you think additional information on the changelog is necessary?

> - The alternative idea of adding a syscall which tells the kernel
>   "I'm about to go realtime, so please clear away all the pending crap
>   which might later interrupt me" sounds pretty good.
>
>   Partly because there are surely other places where we can use this.
> 
>   Partly because it moves all the crap-clearing into special
>   crap-clearing code paths while adding less burden to the
>   commonly-executed code.
> 
>   And I don't think this alternative has been fully investigated and
>   discussed.

This was tried before:
https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/

My conclusion from that discussion (and work) is that a special system
call:

1) Does not allow the benefits to be widely applied (only modified
applications will benefit). Is not portable across different operating systems. 

Removing the vmstat_work interruption is a benefit for HPC workloads, 
for example (in fact, it is a benefit for any kind of application, 
since the interruption causes cache misses).

2) Increases the system call cost for applications which would use
the interface.

So avoiding the vmstat_update update interruption, without userspace 
knowledge and modifications, is a better than solution than a modified
userspace.






^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 11:14   ` Marcelo Tosatti
@ 2023-04-19 11:15     ` Marcelo Tosatti
  2023-04-19 13:44       ` Andrew Theurer
  2023-04-19 11:29     ` Marcelo Tosatti
  1 sibling, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-19 11:15 UTC (permalink / raw)
  To: Andrew Morton, Andrew Theurer
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka, Michal Hocko

On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote:
> > On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > 
> > > This patch series addresses the following two problems:
> > > 
> > > 1. A customer provided evidence indicating that a process
> > >    was stalled in direct reclaim:
> > > 
> > > ...
> > >
> > >  2. With a task that busy loops on a given CPU,
> > >     the kworker interruption to execute vmstat_update
> > >     is undesired and may exceed latency thresholds
> > >     for certain applications.
> > > 
> > 
> > I don't think I'll be sending this upstream in the next merge window. 
> > Because it isn't clear that the added complexity in vmstat handling is
> > justified.
> 
> From my POV this is an incorrect statement (that the complexity in
> vmstat handling is not justified).
> 
> Andrew, this is the 3rd attempt to fix this problem:
> 
> First try:  https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
> 
> Second try: https://patchew.org/linux/20230105125218.031928326@redhat.com/
> 
> Third try: syncing vmstats remotely from vmstat_shepherd (this
> patchset).
> 
> And also, can you please explain: what is so complicated about the
> vmstat handling? cmpxchg has been around and is used all over the
> kernel, and nobody considers "excessively complicated".
> 
> > - Michal's request for more clarity on the end-user requirements
> >   seems reasonable.
> 
> And i explained to Michal in great detail where the end-user 
> requirements come from. For virtualized workloads, there are two
> types of use-cases:
> 
> 1) For example, for the MAC scheduler processing must occur every 1ms,
> and a certain amount of computation takes place (and must finish before
> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
> is considered a "failure".
> 
> I showed him a 7us trace caused by, and explained that will extend to >
> 50us in the case of virtualized vCPU.
> 
> 2) PLCs. These workloads will also suffer > 50us latency spikes
> which is undesirable.
> 
> Can you please explain what additional clarity is required?
> 
> RH's performance team, for example, has been performing packet
> latency tests and waiting for this issue to be fixed for about 2
> years now.
> 
> Andrew Theurer, can you please explain what problem is the vmstat_work
> interruption causing in your testing?

+CC Andrew.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 11:14   ` Marcelo Tosatti
  2023-04-19 11:15     ` Marcelo Tosatti
@ 2023-04-19 11:29     ` Marcelo Tosatti
  2023-04-19 11:59       ` Marcelo Tosatti
  2023-04-19 16:47       ` Vlastimil Babka
  1 sibling, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-19 11:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka, Michal Hocko

On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> This was tried before:
> https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
> 
> My conclusion from that discussion (and work) is that a special system
> call:
> 
> 1) Does not allow the benefits to be widely applied (only modified
> applications will benefit). Is not portable across different operating systems. 
> 
> Removing the vmstat_work interruption is a benefit for HPC workloads, 
> for example (in fact, it is a benefit for any kind of application, 
> since the interruption causes cache misses).
> 
> 2) Increases the system call cost for applications which would use
> the interface.
> 
> So avoiding the vmstat_update update interruption, without userspace 
> knowledge and modifications, is a better than solution than a modified
> userspace.

Another important point is this: if an application dirties
its own per-CPU vmstat cache, while performing a system call,
and a vmstat sync event is triggered on a different CPU, you'd have to:

1) Wait for that CPU to return to userspace and sync its stats
(unfeasible).

2) Queue work to execute on that CPU (undesirable, as that causes
an interruption).

3) Remotely sync the vmstat for that CPU.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 11:29     ` Marcelo Tosatti
@ 2023-04-19 11:59       ` Marcelo Tosatti
  2023-04-19 12:24         ` Frederic Weisbecker
  2023-04-19 16:47       ` Vlastimil Babka
  1 sibling, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-19 11:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka, Michal Hocko

On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > This was tried before:
> > https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
> > 
> > My conclusion from that discussion (and work) is that a special system
> > call:
> > 
> > 1) Does not allow the benefits to be widely applied (only modified
> > applications will benefit). Is not portable across different operating systems. 
> > 
> > Removing the vmstat_work interruption is a benefit for HPC workloads, 
> > for example (in fact, it is a benefit for any kind of application, 
> > since the interruption causes cache misses).
> > 
> > 2) Increases the system call cost for applications which would use
> > the interface.
> > 
> > So avoiding the vmstat_update update interruption, without userspace 
> > knowledge and modifications, is a better than solution than a modified
> > userspace.
> 
> Another important point is this: if an application dirties
> its own per-CPU vmstat cache, while performing a system call,

Or while handling a VM-exit from a vCPU.

This are, in my mind, sufficient reasons to discard the "flush per-cpu
caches" idea. This is also why i chose to abandon the prctrl interface
patchset.

> and a vmstat sync event is triggered on a different CPU, you'd have to:
> 
> 1) Wait for that CPU to return to userspace and sync its stats
> (unfeasible).
> 
> 2) Queue work to execute on that CPU (undesirable, as that causes
> an interruption).
> 
> 3) Remotely sync the vmstat for that CPU.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 11:59       ` Marcelo Tosatti
@ 2023-04-19 12:24         ` Frederic Weisbecker
  2023-04-19 13:48           ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Frederic Weisbecker @ 2023-04-19 12:24 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Morton, Christoph Lameter, Aaron Tomlin, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko

Le Wed, Apr 19, 2023 at 08:59:28AM -0300, Marcelo Tosatti a écrit :
> On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote:
> > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > > This was tried before:
> > > https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
> > > 
> > > My conclusion from that discussion (and work) is that a special system
> > > call:
> > > 
> > > 1) Does not allow the benefits to be widely applied (only modified
> > > applications will benefit). Is not portable across different operating systems. 
> > > 
> > > Removing the vmstat_work interruption is a benefit for HPC workloads, 
> > > for example (in fact, it is a benefit for any kind of application, 
> > > since the interruption causes cache misses).
> > > 
> > > 2) Increases the system call cost for applications which would use
> > > the interface.
> > > 
> > > So avoiding the vmstat_update update interruption, without userspace 
> > > knowledge and modifications, is a better than solution than a modified
> > > userspace.
> > 
> > Another important point is this: if an application dirties
> > its own per-CPU vmstat cache, while performing a system call,
> 
> Or while handling a VM-exit from a vCPU.
> 
> This are, in my mind, sufficient reasons to discard the "flush per-cpu
> caches" idea. This is also why i chose to abandon the prctrl interface
> patchset.

If you're running your isolated workloads on guests, which sounds quite
challenging but I guess you guys managed, I'd expect that VMEXITs are
absolutely out of question while the task runs critical code, so I'm not
sure why you would care. I guess not only your guests but also your hosts
run nohz_full, right?

I can't tell if the prctl solution which quiesces everything is the solution
for you, I don't know well enough your workloads, but I would expect that
the pattern is as follows:

1) Arrange for full isolation (no more interrupts/exceptions/VMEXITs)
2) Run critical code
3) Optionally do something once you're done

If vmstat is going to be the only thing to wait for on 1), then the remote
solution looks good enough (although I leave that to -mm guys as I'm too
clueless about those matters), if there is more to be expected, I guess the
quiescing prctl (or whatever syscall) is something to consider.

Thanks.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 11:15     ` Marcelo Tosatti
@ 2023-04-19 13:44       ` Andrew Theurer
  2023-04-20  7:55         ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Theurer @ 2023-04-19 13:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Morton, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, linux-kernel, linux-mm, Russell King,
	Huacai Chen, Heiko Carstens, x86, Vlastimil Babka, Michal Hocko



> On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
>> On Tue, Apr 18, 2023 at 03:02:00PM -0700, Andrew Morton wrote:
>>> On Mon, 20 Mar 2023 15:03:32 -0300 Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>> 
>>>> This patch series addresses the following two problems:
>>>> 
>>>> 1. A customer provided evidence indicating that a process
>>>>   was stalled in direct reclaim:
>>>> 
>>>> ...
>>>> 
>>>> 2. With a task that busy loops on a given CPU,
>>>>    the kworker interruption to execute vmstat_update
>>>>    is undesired and may exceed latency thresholds
>>>>    for certain applications.
>>>> 
>>> 
>>> I don't think I'll be sending this upstream in the next merge window. 
>>> Because it isn't clear that the added complexity in vmstat handling is
>>> justified.
>> 
>> From my POV this is an incorrect statement (that the complexity in
>> vmstat handling is not justified).
>> 
>> Andrew, this is the 3rd attempt to fix this problem:
>> 
>> First try:  https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
>> 
>> Second try: https://patchew.org/linux/20230105125218.031928326@redhat.com/
>> 
>> Third try: syncing vmstats remotely from vmstat_shepherd (this
>> patchset).
>> 
>> And also, can you please explain: what is so complicated about the
>> vmstat handling? cmpxchg has been around and is used all over the
>> kernel, and nobody considers "excessively complicated".
>> 
>>> - Michal's request for more clarity on the end-user requirements
>>>  seems reasonable.
>> 
>> And i explained to Michal in great detail where the end-user 
>> requirements come from. For virtualized workloads, there are two
>> types of use-cases:
>> 
>> 1) For example, for the MAC scheduler processing must occur every 1ms,
>> and a certain amount of computation takes place (and must finish before
>> the next 1ms timeframe). A > 50us latency spike as observed by cyclictest
>> is considered a "failure".
>> 
>> I showed him a 7us trace caused by, and explained that will extend to >
>> 50us in the case of virtualized vCPU.
>> 
>> 2) PLCs. These workloads will also suffer > 50us latency spikes
>> which is undesirable.
>> 
>> Can you please explain what additional clarity is required?
>> 
>> RH's performance team, for example, has been performing packet
>> latency tests and waiting for this issue to be fixed for about 2
>> years now.
>> 
>> Andrew Theurer, can you please explain what problem is the vmstat_work
>> interruption causing in your testing?
> 
> +CC Andrew.

Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency as measured by cyclictest & oslat.  We cannot achieve under 20 usec with the vmstats interruption.

-Andrew

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 12:24         ` Frederic Weisbecker
@ 2023-04-19 13:48           ` Marcelo Tosatti
  2023-04-19 14:35             ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-19 13:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Christoph Lameter, Aaron Tomlin, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko

On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> Le Wed, Apr 19, 2023 at 08:59:28AM -0300, Marcelo Tosatti a écrit :
> > On Wed, Apr 19, 2023 at 08:29:47AM -0300, Marcelo Tosatti wrote:
> > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > > > This was tried before:
> > > > https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
> > > > 
> > > > My conclusion from that discussion (and work) is that a special system
> > > > call:
> > > > 
> > > > 1) Does not allow the benefits to be widely applied (only modified
> > > > applications will benefit). Is not portable across different operating systems. 
> > > > 
> > > > Removing the vmstat_work interruption is a benefit for HPC workloads, 
> > > > for example (in fact, it is a benefit for any kind of application, 
> > > > since the interruption causes cache misses).
> > > > 
> > > > 2) Increases the system call cost for applications which would use
> > > > the interface.
> > > > 
> > > > So avoiding the vmstat_update update interruption, without userspace 
> > > > knowledge and modifications, is a better than solution than a modified
> > > > userspace.
> > > 
> > > Another important point is this: if an application dirties
> > > its own per-CPU vmstat cache, while performing a system call,
> > 
> > Or while handling a VM-exit from a vCPU.
> > 
> > This are, in my mind, sufficient reasons to discard the "flush per-cpu
> > caches" idea. This is also why i chose to abandon the prctrl interface
> > patchset.
> 
> If you're running your isolated workloads on guests, which sounds quite
> challenging but I guess you guys managed, I'd expect that VMEXITs are
> absolutely out of question while the task runs critical code, so I'm not
> sure why you would care. I guess not only your guests but also your hosts
> run nohz_full, right?

The answer is: there are VM-exits. For example to write MSRs to program
LAPIC timer.

Yes both host and guest are nohz_full (but for example, cyclictest 
or a PLC program can call nanosleep in the guest which translate to 
MSR writes to program LAPIC timer which is a VM-exit).

> I can't tell if the prctl solution which quiesces everything is the solution
> for you, I don't know well enough your workloads, but I would expect that
> the pattern is as follows:
> 
> 1) Arrange for full isolation (no more interrupts/exceptions/VMEXITs)

Yes, this in the general scheme. Full isolation is automated by
tuned (realtime-virtual-host/realtime-virtual-guest profiles).

There are VM-exits in our use-case.
There might be use-cases where interrupts are desired.

For more details:
https://www.youtube.com/watch?v=SyhfctYqjc8

> 2) Run critical code
> 3) Optionally do something once you're done
> 
> If vmstat is going to be the only thing to wait for on 1), then the remote
> solution looks good enough (although I leave that to -mm guys as I'm too
> clueless about those matters), 

I am mostly clueless too, but i don't see a problem with the proposed
patch (and no one has pointed any problem either).

> if there is more to be expected, I guess the
> quiescing prctl (or whatever syscall) is something to consider.
> 
> Thanks.

I don't know of anything else to consider ATM, and for all cases we have
analyzed so far there has always been the possibility to do the work remotely,
via RCU or some other locking scheme, rather than requiring the application
to be modified (which decreases the number of userspace applications that
can benefit).





^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 13:48           ` Marcelo Tosatti
@ 2023-04-19 14:35             ` Michal Hocko
  2023-04-19 16:35               ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-04-19 14:35 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
[...]
> > 2) Run critical code
> > 3) Optionally do something once you're done
> > 
> > If vmstat is going to be the only thing to wait for on 1), then the remote
> > solution looks good enough (although I leave that to -mm guys as I'm too
> > clueless about those matters), 
> 
> I am mostly clueless too, but i don't see a problem with the proposed
> patch (and no one has pointed any problem either).

I really hate to repeat myself again. The biggest pushback has been on
a) justification and b) single purpose solution which is very likely
incomplete. For a) we are getting the story piece by piece which doesn't
speed up the process. You are proposing a non-trivial change to an
already convoluted code so having a solid justification is something
that shouldn't be all that surprising.

b) is what concerns me more though. There are other per-cpu specific
things going on that require some regular flushing. Just to mention
another one that your group has been brought up was the memcg pcp
caches. Again with a non-trivial proposal to deal with that problem
[1]. It has turned out that we can do a simpler thing [2]. I do not
think it is a stretch to expect that similar things will pop out every
now and then and rather than dealing with each one in its own way it
kinda makes sense to come up with a more general concept so that all
those cases can be handled at a single place at least. All I hear about
that is that the code of those special applications would need to be
changed to use that. Well, true but is that bar so impractical that we
are going to grow kernel complexity and therefore a maintenance burden?
Everything for a very specialized workloads?

[1] http://lkml.kernel.org/r/20221102020243.522358-1-leobras@redhat.com
[2] http://lkml.kernel.org/r/20230317134448.11082-1-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 14:35             ` Michal Hocko
@ 2023-04-19 16:35               ` Marcelo Tosatti
  2023-04-20  8:40                 ` Michal Hocko
  2023-04-20 13:45                 ` Marcelo Tosatti
  0 siblings, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-19 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

Hi Michal,

On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> [...]
> > > 2) Run critical code
> > > 3) Optionally do something once you're done
> > > 
> > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > clueless about those matters), 
> > 
> > I am mostly clueless too, but i don't see a problem with the proposed
> > patch (and no one has pointed any problem either).
> 
> I really hate to repeat myself again. The biggest pushback has been on
> a) justification and b) single purpose solution which is very likely
> incomplete. For a) we are getting the story piece by piece which doesn't
> speed up the process. You are proposing a non-trivial change to an
> already convoluted code so having a solid justification is something
> that shouldn't be all that surprising.

The justification is simple and concise:

 2. With a task that busy loops on a given CPU,
    the kworker interruption to execute vmstat_update
    is undesired and may exceed latency thresholds
    for certain applications.

Performance details for the kworker interruption:   

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

       	oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable 
latency threshold for certain applications.

---

An additional use-case is what has been noted by Andrew Theurer:

Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
as measured by cyclictest & oslat.  We cannot achieve under 20 usec with 
the vmstats interruption.

---

It seems to me this is solid justification (it seems you want 
particular microsecond values, but those depend on application
and the CPU). The point is there are several applications which do not
want to be interrupted (we can ignore the specifics and focus on
that fact).

Moreover, unrelated interruptions might occur close in time
(for example, random function call IPIs generated by other
kernel subsystems), which renders the "lets just consider this 
one application, running on this CPU, to decide what to do" 
short sighted.

> b) is what concerns me more though. There are other per-cpu specific
> things going on that require some regular flushing. Just to mention
> another one that your group has been brought up was the memcg pcp
> caches. Again with a non-trivial proposal to deal with that problem
> [1]. 

Yes.

> It has turned out that we can do a simpler thing [2]. 

For the particular memcg case, there was a simpler fix.

For the vmstat_update case, i don't see a simpler fix. 

> I do not think it is a stretch to expect that similar things will pop
> out every now and then

Agree.

> and rather than dealing with each one in its own way it
> kinda makes sense to come up with a more general concept so that all
> those cases can be handled at a single place at least. 

I can understand where you are coming from. Unfortunately,
for some cases it is appropriate to perform the work from a
remote CPU (and i think this is one such case).

> All I hear about
> that is that the code of those special applications would need to be
> changed to use that. 

This is a burden for application writers and for system configuration.

Or it could be done automatically (from outside of the application).
Which is what is described and implemented here:

https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/

"Task isolation is divided in two main steps: configuration and
activation.

Each step can be performed by an external tool or the latency
sensitive application itself. util-linux contains the "chisol" tool
for this purpose."

But not only that, the second thing is:

"> Another important point is this: if an application dirties                                                                          
> its own per-CPU vmstat cache, while performing a system call,                                                                       

Or while handling a VM-exit from a vCPU.

This are, in my mind, sufficient reasons to discard the "flush per-cpu
caches" idea. This is also why i chose to abandon the prctrl interface
patchset.

> and a vmstat sync event is triggered on a different CPU, you'd have to:                                                             
>                                                                                                                                     
> 1) Wait for that CPU to return to userspace and sync its stats                                                                      
> (unfeasible).                                                                                                                       
>                                                                                                                                     
> 2) Queue work to execute on that CPU (undesirable, as that causes                                                                   
> an interruption).                                                                                                                   
>                                                                                                                                     
> 3) Remotely sync the vmstat for that CPU."

So the only option is to remotely sync vmstat for the CPU
(unless you have a better suggestion).

> Well, true but is that bar so impractical that we
> are going to grow kernel complexity and therefore a maintenance burden?

Honestly, this patchset is just using cmpxchg to transfer data from
per-CPU counters to global counters. I don't see why its that 
complicated.

If you have a suggestion on how to reduce the apparent complexity,
that would be great.

> Everything for a very specialized workloads?

Well the kernel has been increasing in complexity, and the maintenance
burden has increased as a side-effect, to accomodate more workloads
than it was initially designed for.

> [1] http://lkml.kernel.org/r/20221102020243.522358-1-leobras@redhat.com
> [2] http://lkml.kernel.org/r/20230317134448.11082-1-mhocko@kernel.org
> -- 
> Michal Hocko
> SUSE Labs
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 11:29     ` Marcelo Tosatti
  2023-04-19 11:59       ` Marcelo Tosatti
@ 2023-04-19 16:47       ` Vlastimil Babka
  2023-04-19 19:15         ` Marcelo Tosatti
  1 sibling, 1 reply; 56+ messages in thread
From: Vlastimil Babka @ 2023-04-19 16:47 UTC (permalink / raw)
  To: Marcelo Tosatti, Andrew Morton
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Michal Hocko

On 4/19/23 13:29, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
>> This was tried before:
>> https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
>> 
>> My conclusion from that discussion (and work) is that a special system
>> call:
>> 
>> 1) Does not allow the benefits to be widely applied (only modified
>> applications will benefit). Is not portable across different operating systems. 
>> 
>> Removing the vmstat_work interruption is a benefit for HPC workloads, 
>> for example (in fact, it is a benefit for any kind of application, 
>> since the interruption causes cache misses).
>> 
>> 2) Increases the system call cost for applications which would use
>> the interface.
>> 
>> So avoiding the vmstat_update update interruption, without userspace 
>> knowledge and modifications, is a better than solution than a modified
>> userspace.
> 
> Another important point is this: if an application dirties
> its own per-CPU vmstat cache, while performing a system call,
> and a vmstat sync event is triggered on a different CPU, you'd have to:
> 
> 1) Wait for that CPU to return to userspace and sync its stats
> (unfeasible).
> 
> 2) Queue work to execute on that CPU (undesirable, as that causes
> an interruption).

So you're saying the application might do a syscall from the isolcpu, so
IIUC it cannot expect any latency guarantees at that very moment, but then
it immediately starts expecting them again after returning to userspace, and
a single interruption for a one-time flush after the syscall would be too
intrusive?

(elsewhere in the thread you described an RT app initialization that may
generate vmstats to flush and then entry userspace loop, again, would a
single interruption soon after entering the loop be so critical?)

> 3) Remotely sync the vmstat for that CPU.
> 
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 16:47       ` Vlastimil Babka
@ 2023-04-19 19:15         ` Marcelo Tosatti
  2023-05-03 13:51           ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-19 19:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, linux-kernel, linux-mm, Russell King,
	Huacai Chen, Heiko Carstens, x86, Michal Hocko

On Wed, Apr 19, 2023 at 06:47:30PM +0200, Vlastimil Babka wrote:
> On 4/19/23 13:29, Marcelo Tosatti wrote:
> > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> >> This was tried before:
> >> https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
> >> 
> >> My conclusion from that discussion (and work) is that a special system
> >> call:
> >> 
> >> 1) Does not allow the benefits to be widely applied (only modified
> >> applications will benefit). Is not portable across different operating systems. 
> >> 
> >> Removing the vmstat_work interruption is a benefit for HPC workloads, 
> >> for example (in fact, it is a benefit for any kind of application, 
> >> since the interruption causes cache misses).
> >> 
> >> 2) Increases the system call cost for applications which would use
> >> the interface.
> >> 
> >> So avoiding the vmstat_update update interruption, without userspace 
> >> knowledge and modifications, is a better than solution than a modified
> >> userspace.
> > 
> > Another important point is this: if an application dirties
> > its own per-CPU vmstat cache, while performing a system call,
> > and a vmstat sync event is triggered on a different CPU, you'd have to:
> > 
> > 1) Wait for that CPU to return to userspace and sync its stats
> > (unfeasible).
> > 
> > 2) Queue work to execute on that CPU (undesirable, as that causes
> > an interruption).
> 
> So you're saying the application might do a syscall from the isolcpu, so
> IIUC it cannot expect any latency guarantees at that very moment, 

Why not? cyclictest uses nanosleep and its the main tool for measuring
latency.

> but then
> it immediately starts expecting them again after returning to userspace, 

No, the expectation more generally is this:

For certain types of applications (for example PLC software or
RAN processing), upon occurrence of an event, it is necessary to
complete a certain task in a maximum amount of time (deadline).

One way to express this requirement is with a pair of numbers,
deadline time and execution time, where:

        * deadline time: length of time between event and deadline.
        * execution time: length of time it takes for processing of event
                          to occur on a particular hardware platform
                          (uninterrupted).

The particular values depend on use-case. For the case
where the realtime application executes in a virtualized
guest, an interruption which must be serviced in the host will cause
the following sequence of events:

        1) VM-exit
        2) execution of IPI (and function call) (or switch to kwork
	thread to execute some work item).
        3) VM-entry

Which causes an excess of 50us latency as observed by cyclictest
(this violates the latency requirement of vRAN application with 1ms TTI,
for example).

> and
> a single interruption for a one-time flush after the syscall would be too
> intrusive?

Generally, if you can't complete the task (which involves executing a
number of instructions) before the deadline, then its a problem.

One-time flush? You mean to switch between:

rt-app -> kworker (to execute vmstat_update flush) -> rt-app

My measurement, which probably had vmstat_update code/data in cache, took 7us.
It might be the case that the code to execute must be brought in from
memory, which takes even longer.

> (elsewhere in the thread you described an RT app initialization that may
> generate vmstats to flush and then entry userspace loop, again, would a
> single interruption soon after entering the loop be so critical?)

1) It depends on the application. For the use-case above, where < 50us
interruption is desired, yes it is critical.

2) The interruptions can come from different sources.

Time
0			rt-app executing instruction 1
1			rt-app executing instruction 2
2			scheduler switches between rt-app and kworker
3			kworker runs vmstat_work
4			scheduler switches between kworker and rt-app
5			rt-app executing instruction 3
6			ipi to handle a KVM request IPI
7			fill in your preferred IPI handler

So the argument "a single interruption might not cause your deadline
to be exceeded" fails (because the time to handle the 
different interruptions might sum).

Does that make sense?

> > 3) Remotely sync the vmstat for that CPU.
> > 
> > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 13:44       ` Andrew Theurer
@ 2023-04-20  7:55         ` Michal Hocko
  2023-04-23  1:25           ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-04-20  7:55 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: Marcelo Tosatti, Andrew Morton, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, linux-kernel, linux-mm, Russell King,
	Huacai Chen, Heiko Carstens, x86, Vlastimil Babka

On Wed 19-04-23 08:44:23, Andrew Theurer wrote:
> > On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >> Andrew Theurer, can you please explain what problem is the vmstat_work
> >> interruption causing in your testing?
> > 
> > +CC Andrew.
> 
> Nearly every telco we work with for 5G RAN is demanding <20 usec CPU
> latency as measured by cyclictest & oslat.  We cannot achieve under 20
> usec with the vmstats interruption.

Are you able to get those latency requirements with PREEMPT_RT?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 16:35               ` Marcelo Tosatti
@ 2023-04-20  8:40                 ` Michal Hocko
  2023-04-23  1:10                   ` Marcelo Tosatti
  2023-04-20 13:45                 ` Marcelo Tosatti
  1 sibling, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-04-20  8:40 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed 19-04-23 13:35:12, Marcelo Tosatti wrote:
[...]
> This is a burden for application writers and for system configuration.

Yes. And I find it reasonable to expect that burden put there as there
are non-trivial requirements for those workloads anyway. It is not
out-of-the-box thing, right?

> Or it could be done automatically (from outside of the application).
> Which is what is described and implemented here:
> 
> https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/
> 
> "Task isolation is divided in two main steps: configuration and
> activation.
> 
> Each step can be performed by an external tool or the latency
> sensitive application itself. util-linux contains the "chisol" tool
> for this purpose."

I cannot say I would be a fan of prctl interfaces in general but I do
agree with the overal idea to forcing a quiescent state on a set of
CPUs.

> But not only that, the second thing is:
> 
> "> Another important point is this: if an application dirties                                                                          
> > its own per-CPU vmstat cache, while performing a system call,                                                                       
> 
> Or while handling a VM-exit from a vCPU.

Do you have any specific examples on this?

> This are, in my mind, sufficient reasons to discard the "flush per-cpu
> caches" idea. This is also why i chose to abandon the prctrl interface
> patchset.
> 
> > and a vmstat sync event is triggered on a different CPU, you'd have to:                                                             
> >                                                                                                                                     
> > 1) Wait for that CPU to return to userspace and sync its stats                                                                      
> > (unfeasible).                                                                                                                       
> >                                                                                                                                     
> > 2) Queue work to execute on that CPU (undesirable, as that causes                                                                   
> > an interruption).                                                                                                                   
> >                                                                                                                                     
> > 3) Remotely sync the vmstat for that CPU."
> 
> So the only option is to remotely sync vmstat for the CPU
> (unless you have a better suggestion).

`echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same
without any kernel changes.

But let me repeat, this is not just about vmstats. Just have a look at
other queue_work_on users. You do not want to handy pick each and every
one and do so in the future as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 16:35               ` Marcelo Tosatti
  2023-04-20  8:40                 ` Michal Hocko
@ 2023-04-20 13:45                 ` Marcelo Tosatti
  2023-04-26 14:34                   ` Marcelo Tosatti
  2023-04-26 15:04                   ` Vlastimil Babka
  1 sibling, 2 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-20 13:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed, Apr 19, 2023 at 01:35:12PM -0300, Marcelo Tosatti wrote:
> Hi Michal,
> 
> On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> > [...]
> > > > 2) Run critical code
> > > > 3) Optionally do something once you're done
> > > > 
> > > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > > clueless about those matters), 
> > > 
> > > I am mostly clueless too, but i don't see a problem with the proposed
> > > patch (and no one has pointed any problem either).
> > 
> > I really hate to repeat myself again. The biggest pushback has been on
> > a) justification and b) single purpose solution which is very likely
> > incomplete. For a) we are getting the story piece by piece which doesn't
> > speed up the process. You are proposing a non-trivial change to an
> > already convoluted code so having a solid justification is something
> > that shouldn't be all that surprising.
> 
> The justification is simple and concise:
> 
>  2. With a task that busy loops on a given CPU,
>     the kworker interruption to execute vmstat_update
>     is undesired and may exceed latency thresholds
>     for certain applications.
> 
> Performance details for the kworker interruption:   
> 
> oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> 
> The example above shows an additional 7us for the
> 
>        	oslat -> kworker -> oslat
> 
> switches. In the case of a virtualized CPU, and the vmstat_update
> interruption in the host (of a qemu-kvm vcpu), the latency penalty
> observed in the guest is higher than 50us, violating the acceptable 
> latency threshold for certain applications.
> 
> ---
> 
> An additional use-case is what has been noted by Andrew Theurer:
> 
> Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
> as measured by cyclictest & oslat.  We cannot achieve under 20 usec with 
> the vmstats interruption.
> 
> ---
> 
> It seems to me this is solid justification (it seems you want 
> particular microsecond values, but those depend on application
> and the CPU). The point is there are several applications which do not
> want to be interrupted (we can ignore the specifics and focus on
> that fact).
> 
> Moreover, unrelated interruptions might occur close in time
> (for example, random function call IPIs generated by other
> kernel subsystems), which renders the "lets just consider this 
> one application, running on this CPU, to decide what to do" 
> short sighted.
> 
> > b) is what concerns me more though. There are other per-cpu specific
> > things going on that require some regular flushing. Just to mention
> > another one that your group has been brought up was the memcg pcp
> > caches. Again with a non-trivial proposal to deal with that problem
> > [1]. 
> 
> Yes.
> 
> > It has turned out that we can do a simpler thing [2]. 
> 
> For the particular memcg case, there was a simpler fix.
> 
> For the vmstat_update case, i don't see a simpler fix. 
> 
> > I do not think it is a stretch to expect that similar things will pop
> > out every now and then
> 
> Agree.
> 
> > and rather than dealing with each one in its own way it
> > kinda makes sense to come up with a more general concept so that all
> > those cases can be handled at a single place at least. 
> 
> I can understand where you are coming from. Unfortunately,
> for some cases it is appropriate to perform the work from a
> remote CPU (and i think this is one such case).
> 
> > All I hear about
> > that is that the code of those special applications would need to be
> > changed to use that. 
> 
> This is a burden for application writers and for system configuration.
> 
> Or it could be done automatically (from outside of the application).
> Which is what is described and implemented here:
> 
> https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/
> 
> "Task isolation is divided in two main steps: configuration and
> activation.
> 
> Each step can be performed by an external tool or the latency
> sensitive application itself. util-linux contains the "chisol" tool
> for this purpose."
> 
> But not only that, the second thing is:
> 
> "> Another important point is this: if an application dirties                                                                          
> > its own per-CPU vmstat cache, while performing a system call,                                                                       
> 
> Or while handling a VM-exit from a vCPU.
> 
> This are, in my mind, sufficient reasons to discard the "flush per-cpu
> caches" idea. This is also why i chose to abandon the prctrl interface
> patchset.

There are additional details that were not mentioned. When we think
of flushing caches, or disabling per-CPU caches, this means that the
isolated application loses the benefit of those caches (which means you
are turning a "general purpose" programming environment into
potentially slower environment for applications to execute).

(yes, of course, one has to be mindful of which system calls can be
used, for example the execution time of system calls which take locks will
depend on whether, and how many, users of those locks there are at a
given moment).

For example, if we flush the vm-stats at every system call return,
and the application happens to switch between different phases of 

	isolated, userspace only behaviour (computation)
	system call intensive behaviour

(which is an HPC example Thomas Gleixner mentioned in a different
thread), then you see the disadvantage of the "special" environment
where flushing is performed on return to system calls.

So it seems to me (unless there are points that show otherwise, which
would indicate that explicit userspace interfaces are preferred) _not_
requiring userspace changes is a superior solution.

Perhaps the complexity should be judged for individual cases 
of interruptions, and if a given interruption-free conversion 
is seen as too complex, then a "disable feature which makes use of per-CPU
caches" style solution can be made (and then userspace has to
explicitly request for that per-CPU feature to be disabled).

But i don't see that this patchset introduces unmanageable complexity,
neither: 

01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-20  8:40                 ` Michal Hocko
@ 2023-04-23  1:10                   ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-23  1:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu, Apr 20, 2023 at 10:40:25AM +0200, Michal Hocko wrote:
> On Wed 19-04-23 13:35:12, Marcelo Tosatti wrote:
> [...]
> > This is a burden for application writers and for system configuration.
> 
> Yes. And I find it reasonable to expect that burden put there as there
> are non-trivial requirements for those workloads anyway. It is not
> out-of-the-box thing, right?

See below.

> > Or it could be done automatically (from outside of the application).
> > Which is what is described and implemented here:
> > 
> > https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/
> > 
> > "Task isolation is divided in two main steps: configuration and
> > activation.
> > 
> > Each step can be performed by an external tool or the latency
> > sensitive application itself. util-linux contains the "chisol" tool
> > for this purpose."
> 
> I cannot say I would be a fan of prctl interfaces in general but I do
> agree with the overal idea to forcing a quiescent state on a set of
> CPUs.

This has been avoided with success so far.

> > But not only that, the second thing is:
> > 
> > "> Another important point is this: if an application dirties
> > > its own per-CPU vmstat cache, while performing a system call,
> > 
> > Or while handling a VM-exit from a vCPU.
> 
> Do you have any specific examples on this?

Interrupt handling freeing a page. 

handle_access_fault             (ARM64) ->
handle_changed_spte_acc_track   (x86)   -> kvm_set_pfn_accessed -> kvm_set_page_accessed -> mark_page_accessed ->
                                           folio_mark_accessed -> folio_activate -> folio_activate_fn ->
                                           lruvec_add_folio -> update_lru_size -> __update_lru_size ->
                                            __mod_zone_page_state(&pgdat->node_zones[zid],
                                                                  NR_ZONE_LRU_BASE + lru, nr_pages);


The other option would be to _FORBID_ use of  __mod_zone_page_state in certain code sections.

> > This are, in my mind, sufficient reasons to discard the "flush per-cpu
> > caches" idea. This is also why i chose to abandon the prctrl interface
> > patchset.
> > 
> > > and a vmstat sync event is triggered on a different CPU, you'd have to:                                                             
> > >                                                                                                                                     
> > > 1) Wait for that CPU to return to userspace and sync its stats
> > > (unfeasible).
> > >
> > > 2) Queue work to execute on that CPU (undesirable, as that causes
> > > an interruption).
> > >
> > > 3) Remotely sync the vmstat for that CPU."
> > 
> > So the only option is to remotely sync vmstat for the CPU
> > (unless you have a better suggestion).
> 
> `echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same
> without any kernel changes.

It is unsuitable. You'd have to guarantee that, by the time you return
from the write() system call to that file, there has been no other 
mod_zone_page_state call. For example, no interrupt 
or exception that frees or allocates a page through rmqueue 
(NR_FREE_PAGES counter), or that bounce_end_io cannot be called 
(since it calls dec_zone_page_state).

It has been used internally as a workaround, but it is not reliable.

> But let me repeat, this is not just about vmstats. Just have a look at
> other queue_work_on users. You do not want to handy pick each and every
> one and do so in the future as well.

The ones that are problematic are being fixed for sometime now. For example:

commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Thu Sep 10 23:33:18 2020 +0200

    net: try to avoid unneeded backlog flush

    flush_all_backlogs() may cause deadlock on systems
    running processes with FIFO scheduling policy.

    The above is critical in -RT scenarios, where user-space
    specifically ensure no network activity is scheduled on
    the CPU running the mentioned FIFO process, but still get
    stuck.

    This commit tries to address the problem checking the
    backlog status on the remote CPUs before scheduling the
    flush operation. If the backlog is empty, we can skip it.

    v1 -> v2:
     - explicitly clear flushed cpu mask - Eric

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


And it has been a normal process so far.

I think what needs to be done is to avoid new queue_work_on() 
users from being introduced in the tree (the number of
existing ones is finite and can therefore be fixed).


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-20  7:55         ` Michal Hocko
@ 2023-04-23  1:25           ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-23  1:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Theurer, Andrew Morton, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, linux-kernel, linux-mm, Russell King,
	Huacai Chen, Heiko Carstens, x86, Vlastimil Babka

On Thu, Apr 20, 2023 at 09:55:40AM +0200, Michal Hocko wrote:
> On Wed 19-04-23 08:44:23, Andrew Theurer wrote:
> > > On Apr 19, 2023, at 6:15 AM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >> Andrew Theurer, can you please explain what problem is the vmstat_work
> > >> interruption causing in your testing?
> > > 
> > > +CC Andrew.
> > 
> > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU
> > latency as measured by cyclictest & oslat.  We cannot achieve under 20
> > usec with the vmstats interruption.
> 
> Are you able to get those latency requirements with PREEMPT_RT?

What do you mean, exactly?

PREEMPT_RT allows for the preemption of tasks in kernel context
(so that higher priority tasks can interrupt lower priority tasks).
It also enables IRQ handling to happen in thread context 
(so that a given thread might be given higher priority than executing
a particular IRQ handler).

If the question is: "Are you able to achieve <20 usec latency while
allowing switching between different tasks ?" The answer with current
processor and memory speeds is probably: no.
But with more performant processors, you might.

However, with a fully isolated processor which does not require
switching between tasks, yes you can achieve < 20 usec latency.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-20 13:45                 ` Marcelo Tosatti
@ 2023-04-26 14:34                   ` Marcelo Tosatti
  2023-04-27  8:31                     ` Michal Hocko
  2023-04-26 15:04                   ` Vlastimil Babka
  1 sibling, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-26 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 01:35:12PM -0300, Marcelo Tosatti wrote:
> > Hi Michal,
> > 
> > On Wed, Apr 19, 2023 at 04:35:51PM +0200, Michal Hocko wrote:
> > > On Wed 19-04-23 10:48:03, Marcelo Tosatti wrote:
> > > > On Wed, Apr 19, 2023 at 02:24:01PM +0200, Frederic Weisbecker wrote:
> > > [...]
> > > > > 2) Run critical code
> > > > > 3) Optionally do something once you're done
> > > > > 
> > > > > If vmstat is going to be the only thing to wait for on 1), then the remote
> > > > > solution looks good enough (although I leave that to -mm guys as I'm too
> > > > > clueless about those matters), 
> > > > 
> > > > I am mostly clueless too, but i don't see a problem with the proposed
> > > > patch (and no one has pointed any problem either).
> > > 
> > > I really hate to repeat myself again. The biggest pushback has been on
> > > a) justification and b) single purpose solution which is very likely
> > > incomplete. For a) we are getting the story piece by piece which doesn't
> > > speed up the process. You are proposing a non-trivial change to an
> > > already convoluted code so having a solid justification is something
> > > that shouldn't be all that surprising.
> > 
> > The justification is simple and concise:
> > 
> >  2. With a task that busy loops on a given CPU,
> >     the kworker interruption to execute vmstat_update
> >     is undesired and may exceed latency thresholds
> >     for certain applications.
> > 
> > Performance details for the kworker interruption:   
> > 
> > oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
> > oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
> > oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
> > kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
> > 
> > The example above shows an additional 7us for the
> > 
> >        	oslat -> kworker -> oslat
> > 
> > switches. In the case of a virtualized CPU, and the vmstat_update
> > interruption in the host (of a qemu-kvm vcpu), the latency penalty
> > observed in the guest is higher than 50us, violating the acceptable 
> > latency threshold for certain applications.
> > 
> > ---
> > 
> > An additional use-case is what has been noted by Andrew Theurer:
> > 
> > Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
> > as measured by cyclictest & oslat.  We cannot achieve under 20 usec with 
> > the vmstats interruption.
> > 
> > ---
> > 
> > It seems to me this is solid justification (it seems you want 
> > particular microsecond values, but those depend on application
> > and the CPU). The point is there are several applications which do not
> > want to be interrupted (we can ignore the specifics and focus on
> > that fact).
> > 
> > Moreover, unrelated interruptions might occur close in time
> > (for example, random function call IPIs generated by other
> > kernel subsystems), which renders the "lets just consider this 
> > one application, running on this CPU, to decide what to do" 
> > short sighted.
> > 
> > > b) is what concerns me more though. There are other per-cpu specific
> > > things going on that require some regular flushing. Just to mention
> > > another one that your group has been brought up was the memcg pcp
> > > caches. Again with a non-trivial proposal to deal with that problem
> > > [1]. 
> > 
> > Yes.
> > 
> > > It has turned out that we can do a simpler thing [2]. 
> > 
> > For the particular memcg case, there was a simpler fix.
> > 
> > For the vmstat_update case, i don't see a simpler fix. 
> > 
> > > I do not think it is a stretch to expect that similar things will pop
> > > out every now and then
> > 
> > Agree.
> > 
> > > and rather than dealing with each one in its own way it
> > > kinda makes sense to come up with a more general concept so that all
> > > those cases can be handled at a single place at least. 
> > 
> > I can understand where you are coming from. Unfortunately,
> > for some cases it is appropriate to perform the work from a
> > remote CPU (and i think this is one such case).
> > 
> > > All I hear about
> > > that is that the code of those special applications would need to be
> > > changed to use that. 
> > 
> > This is a burden for application writers and for system configuration.
> > 
> > Or it could be done automatically (from outside of the application).
> > Which is what is described and implemented here:
> > 
> > https://lore.kernel.org/lkml/20220204173537.429902988@fedora.localdomain/
> > 
> > "Task isolation is divided in two main steps: configuration and
> > activation.
> > 
> > Each step can be performed by an external tool or the latency
> > sensitive application itself. util-linux contains the "chisol" tool
> > for this purpose."
> > 
> > But not only that, the second thing is:
> > 
> > "> Another important point is this: if an application dirties                                                                          
> > > its own per-CPU vmstat cache, while performing a system call,                                                                       
> > 
> > Or while handling a VM-exit from a vCPU.
> > 
> > This are, in my mind, sufficient reasons to discard the "flush per-cpu
> > caches" idea. This is also why i chose to abandon the prctrl interface
> > patchset.
> 
> There are additional details that were not mentioned. When we think
> of flushing caches, or disabling per-CPU caches, this means that the
> isolated application loses the benefit of those caches (which means you
> are turning a "general purpose" programming environment into
> potentially slower environment for applications to execute).

https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html

> (yes, of course, one has to be mindful of which system calls can be
> used, for example the execution time of system calls which take locks will
> depend on whether, and how many, users of those locks there are at a
> given moment).
> 
> For example, if we flush the vm-stats at every system call return,
> and the application happens to switch between different phases of 
> 
> 	isolated, userspace only behaviour (computation)
> 	system call intensive behaviour
> 
> (which is an HPC example Thomas Gleixner mentioned in a different
> thread), then you see the disadvantage of the "special" environment
> where flushing is performed on return to system calls.

https://lore.kernel.org/linux-mm/87im9d4ezq.fsf@nanos.tec.linutronix.de/

"In a real-world usecase we had the situation of compute bursts and an
unfortunate hw enforced requirement to go into the kernel between them
for synchronization between the compute threads and hardware (A quick
hardware assisted save/load).

Unmodified NOHZ full accumulated to more than 6% loss compared to a
fully undisturbed run. Most of it was caused by cache effects and not by
the actually used CPU time.

A full enforced quiescing upfront gained ~2-3%, but a lazy approach of
accepting that some stuff might happen once and does not happen again
gained almost 5%. In that particular scenario 5% _is_ a huge win."

> 
> So it seems to me (unless there are points that show otherwise, which
> would indicate that explicit userspace interfaces are preferred) _not_
> requiring userspace changes is a superior solution.
> 
> Perhaps the complexity should be judged for individual cases 
> of interruptions, and if a given interruption-free conversion 
> is seen as too complex, then a "disable feature which makes use of per-CPU
> caches" style solution can be made (and then userspace has to
> explicitly request for that per-CPU feature to be disabled).
> 
> But i don't see that this patchset introduces unmanageable complexity,
> neither: 
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-20 13:45                 ` Marcelo Tosatti
  2023-04-26 14:34                   ` Marcelo Tosatti
@ 2023-04-26 15:04                   ` Vlastimil Babka
  2023-04-26 16:10                     ` Marcelo Tosatti
  1 sibling, 1 reply; 56+ messages in thread
From: Vlastimil Babka @ 2023-04-26 15:04 UTC (permalink / raw)
  To: Marcelo Tosatti, Michal Hocko
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86

On 4/20/23 15:45, Marcelo Tosatti wrote:
> Perhaps the complexity should be judged for individual cases 
> of interruptions, and if a given interruption-free conversion 
> is seen as too complex, then a "disable feature which makes use of per-CPU
> caches" style solution can be made (and then userspace has to
> explicitly request for that per-CPU feature to be disabled).
> 
> But i don't see that this patchset introduces unmanageable complexity,
> neither: 
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

Well that one is a bit different, as there was one kind of lock replaced
with other kind of lock, the lock is uncontended unless there's remote
flushes happening so it's not causing extra overhead for the fast paths,
and later even the irq disabling was removed, which should even improve
things. But this patchset is turning all vmstat counter increments a
cmpxchg.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-26 15:04                   ` Vlastimil Babka
@ 2023-04-26 16:10                     ` Marcelo Tosatti
  2023-04-27  8:39                       ` Michal Hocko
  0 siblings, 1 reply; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-26 16:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Frederic Weisbecker, Andrew Morton,
	Christoph Lameter, Aaron Tomlin, linux-kernel, linux-mm,
	Russell King, Huacai Chen, Heiko Carstens, x86

On Wed, Apr 26, 2023 at 05:04:49PM +0200, Vlastimil Babka wrote:
> On 4/20/23 15:45, Marcelo Tosatti wrote:
> > Perhaps the complexity should be judged for individual cases 
> > of interruptions, and if a given interruption-free conversion 
> > is seen as too complex, then a "disable feature which makes use of per-CPU
> > caches" style solution can be made (and then userspace has to
> > explicitly request for that per-CPU feature to be disabled).
> > 
> > But i don't see that this patchset introduces unmanageable complexity,
> > neither: 
> > 
> > 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> > 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
> 
> Well that one is a bit different, as there was one kind of lock replaced
> with other kind of lock, 

local_lock is defined to NULL if CONFIG_PREEMPT_RT is not set. 
So for the !CONFIG_PREEMPT_RT case, it introduced a lock.

> the lock is uncontended unless there's remote
> flushes happening so it's not causing extra overhead for the fast paths,
> and later even the irq disabling was removed, which should even improve
> things. But this patchset is turning all vmstat counter increments a
> cmpxchg.

Yes, and we have a similar situation in this case:

1) CMPXCHG is already used to protect many vmstat counter increments.
2) The patchset adds "LOCK CMPXCHG" to existing CMPXCHG user.
3) The performance decrease is negligible, because cache locking is
effective.

"To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

       	/** Loop to measure **/
       	for (i = 0; i < rec->loops; i++) {
               	my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                       	return 0;
                __free_page(my_page);
        }                                                                                                           

Unit is cycles.

Vanilla                 Patched         Diff
115.25                  117             1.4%"

To be honest, that 1.4% difference was not stable but fluctuated between
positive and negative percentages (so the performance difference was in
the noise).

So performance is not a decisive factor in this case.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-26 14:34                   ` Marcelo Tosatti
@ 2023-04-27  8:31                     ` Michal Hocko
  2023-04-27 14:59                       ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-04-27  8:31 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed 26-04-23 11:34:00, Marcelo Tosatti wrote:
> On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote:
[...]
> > There are additional details that were not mentioned. When we think
> > of flushing caches, or disabling per-CPU caches, this means that the
> > isolated application loses the benefit of those caches (which means you
> > are turning a "general purpose" programming environment into
> > potentially slower environment for applications to execute).

I do not really buy this argument! Nothing is really free and somebody
has to pay for the overhead. You want highly specialized workload to
enjoy all the performance while having high demand on latency yet the
overhead has to pay everybody else.

> https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html

This is just talking about who benefits from isolation and I do not
think there is any dispute in that regard. I haven't questioned that. My
main argument was that those really need to be special and careful to
achieve their goal and Thomas says a very similar thing. I do not see
any objection to an explicit programming model to achieve that goal.
 
> > (yes, of course, one has to be mindful of which system calls can be
> > used, for example the execution time of system calls which take locks will
> > depend on whether, and how many, users of those locks there are at a
> > given moment).

This is simply not maintainble state. Once you enter the kernel you
cannot really expect your _ultra low_ latency expectations.
 
[...]
> > So it seems to me (unless there are points that show otherwise, which
> > would indicate that explicit userspace interfaces are preferred) _not_
> > requiring userspace changes is a superior solution.
> > 
> > Perhaps the complexity should be judged for individual cases 
> > of interruptions, and if a given interruption-free conversion 
> > is seen as too complex, then a "disable feature which makes use of per-CPU
> > caches" style solution can be made (and then userspace has to
> > explicitly request for that per-CPU feature to be disabled).
> > 
> > But i don't see that this patchset introduces unmanageable complexity,
> > neither: 

As I've tried to explain, I disagree about the approach you are taking.
You are fixing your problem at a wrong layer. You really need to address
the fundamental issue and that is that you do not want housekeeping done
on isolated cpu(s) while your workload is running there.

vmstat updates are just one of schedule_on_cpu users who could disturb
your workload.  We do not want to chase every single one and keep
doing that for ever as new callers of that API are added. See the
point? "Fixing" vmstat will not make your isolated workload more
reliable. You really need a more generic solution rather than a quick
hack.

Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC
this is used by NOHZ. I do not remember any details but if anything this
is something I would have a look into.

There is close to 0 benefit to teaching remote stat flushing. As I've
said stats are only for debugging purposes and imprecise values
shouldn't matter. So this just adds a complexity without any actual real
benefit.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-26 16:10                     ` Marcelo Tosatti
@ 2023-04-27  8:39                       ` Michal Hocko
  2023-04-27 16:25                         ` Marcelo Tosatti
  0 siblings, 1 reply; 56+ messages in thread
From: Michal Hocko @ 2023-04-27  8:39 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Vlastimil Babka, Frederic Weisbecker, Andrew Morton,
	Christoph Lameter, Aaron Tomlin, linux-kernel, linux-mm,
	Russell King, Huacai Chen, Heiko Carstens, x86

On Wed 26-04-23 13:10:54, Marcelo Tosatti wrote:
[...]
> "To test the performance difference, a page allocator microbenchmark:
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
> with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
> 
> For the single_page_alloc_free test, which does
> 
>        	/** Loop to measure **/
>        	for (i = 0; i < rec->loops; i++) {
>                	my_page = alloc_page(gfp_mask);
>                 if (unlikely(my_page == NULL))
>                        	return 0;
>                 __free_page(my_page);
>         }                                                                                                           
> 
> Unit is cycles.
> 
> Vanilla                 Patched         Diff
> 115.25                  117             1.4%"
> 
> To be honest, that 1.4% difference was not stable but fluctuated between
> positive and negative percentages (so the performance difference was in
> the noise).
> 
> So performance is not a decisive factor in this case.

It is not neglible considering that majority worklods will not benefit
from this change. You are clearly ignoring that vmstat code has been
highly optimized for local per-cpu access exactly to avoid locked
operations and cache line bouncing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-27  8:31                     ` Michal Hocko
@ 2023-04-27 14:59                       ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-27 14:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Frederic Weisbecker, Andrew Morton, Christoph Lameter,
	Aaron Tomlin, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Thu, Apr 27, 2023 at 10:31:21AM +0200, Michal Hocko wrote:
> On Wed 26-04-23 11:34:00, Marcelo Tosatti wrote:
> > On Thu, Apr 20, 2023 at 10:45:20AM -0300, Marcelo Tosatti wrote:
> [...]
> > > There are additional details that were not mentioned. When we think
> > > of flushing caches, or disabling per-CPU caches, this means that the
> > > isolated application loses the benefit of those caches (which means you
> > > are turning a "general purpose" programming environment into
> > > potentially slower environment for applications to execute).
> 
> I do not really buy this argument! Nothing is really free and somebody
> has to pay for the overhead. 

About the overhead, modern processors perform "cache locking":

https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-261.html

Which means that as long as memory is completly contained in a cacheline
(for write-back memory), then it is not necessary to perform the LOCK 
operation on the bus.
Multiple experiments have confirmed this is the case.

This is the case with per-CPU vmstat memory as well.

> You want highly specialized workload to
> enjoy all the performance while having high demand on latency yet the
> overhead has to pay everybody else.

Yes, the overhead is that code should avoid interrupting CPUs.

Your argument is that: "Avoiding the interruptions adds unsurmountable 
complexity therefore those workloads should not be supported" ?

It seems to me this argument can be used against any new piece of code 
of functionality that is added to the kernel, isnt it ?

The same argument could be used to reject (at the time) new additions 
such as RCU (because systems with large number of processors are a
a highly specialized workload), memory hotplug (same thing), PCI hotplug 
(same thing).

> > https://www.uwsg.indiana.edu/hypermail/linux/kernel/2012.0/06823.html
> 
> This is just talking about who benefits from isolation and I do not
> think there is any dispute in that regard. I haven't questioned that. My
> main argument was that those really need to be special and careful to
> achieve their goal 

I see.

> and Thomas says a very similar thing. I do not see
> any objection to an explicit programming model to achieve that goal.

Yes, but it seems to me that the best possible (and most widely
applicable) solution is to avoid any explicit programming if possible.

> > > (yes, of course, one has to be mindful of which system calls can be
> > > used, for example the execution time of system calls which take locks will
> > > depend on whether, and how many, users of those locks there are at a
> > > given moment).
> 
> This is simply not maintainble state. Once you enter the kernel you
> cannot really expect your _ultra low_ latency expectations.

Whether or not its OK to perform system calls is up to the application:
What matters is the latency expectation from the outside world [1]
VS how long it takes to execute a set of instructions.

I can give two concrete example:

1) Cyclictest use sys_nanosleep(). It makes sense
to abstract the details of HLT'ing to the operating system.

A whole class of programs (which must handle periodic tasks)
will sleep via the kernel.

2) The HPC example from Thomas, where:

"
 1     read_data_set() <- involving syscalls/OS obviously
 2     compute_set()   <- let me alone
 3     save_data_set() <- involving syscalls/OS obviously

       repeat the above...

then it's at his discretion to decide to inflict a particular isolation
set on the task which is obviously ineffective while doing #1 and #3 but
might provide the so desired 0.9% boost for compute_set() which
dominates the judgement."

It seems the operating system is capable of providing an isolation free
environment for the application without explicit knowledge from it
(other than taskset).

So all the above are good reasons to try and avoid an explicit
programming interface (again, i did write an explicit programming
interface, and seen in practice its downsides). I will assume you now
understand and agree that the additional complexity added to the kernel
is worthwhile (since its not that huge complexity to perform particular
work remotely, with appropriate locking and/or usage of lockless
algorithms, these things have been around and used in the kernel for a
while now). 

As for the next topic...

> [...]
> > > So it seems to me (unless there are points that show otherwise, which
> > > would indicate that explicit userspace interfaces are preferred) _not_
> > > requiring userspace changes is a superior solution.
> > > 
> > > Perhaps the complexity should be judged for individual cases 
> > > of interruptions, and if a given interruption-free conversion 
> > > is seen as too complex, then a "disable feature which makes use of per-CPU
> > > caches" style solution can be made (and then userspace has to
> > > explicitly request for that per-CPU feature to be disabled).
> > > 
> > > But i don't see that this patchset introduces unmanageable complexity,
> > > neither: 
> 
> As I've tried to explain, I disagree about the approach you are taking.
> You are fixing your problem at a wrong layer. You really need to address
> the fundamental issue and that is that you do not want housekeeping done
> on isolated cpu(s) while your workload is running there.

OK, that is a problem. But the fact is that there are interfaces to request
work to be performed on remote CPUs (usually on per-CPU data), and those
must be addressed one-by-one. We (as in the community) are looking into
ways to address multiple classes of interruptions at once, for example:

https://lpc.events/event/16/contributions/1218/
https://www.spinics.net/lists/linux-s390/msg57118.html

"The current CPU isolation is a best effort approach and I agree that for
more strict isolation modes we need to be able to enforce that and hunt
down offenders and think about them one by one."

But we can't block IPIs or requests for work to be executed remotely.

> vmstat updates are just one of schedule_on_cpu users who could disturb
> your workload.  

Yes. But its one case which we see right now. So our approach so far has
been to monitor the workload and remove individual interruptions that
are observed.

But as long as you have code that is doing:

	queue_work_on(CPU, &this_work);
	flush_work(&this_work);

There is nothing one can do about it other than:

	1) Return an error to avoid the interruption
	   (which is what we are trying here:
	   https://lpc.events/event/16/contributions/1218/).
	   However this might not be suitable for all cases
	   (because you rely on the functionality).
	2) Convert it to avoid the remote work somehow.
	3) Don't queue the work (which might result in incorrect
	   system operation).

Again, we are trying to widen the number of callsites that can be 
handled with certain approaches (see the above URLs).

> We do not want to chase every single one and keep
> doing that for ever as new callers of that API are added. See the
> point? 

Yes, agree with this point. Possible solutions:

1) Change the APIs so that any new users that attempt
to use the APIs are encouraged to avoid executing code on
isolated CPUs (or have to handle the errors).

/**
 * queue_work_on - queue work on specific cpu
 * @cpu: CPU number to execute work on
 * @wq: workqueue to use
 * @work: work to queue
 *
 * We queue the work to a specific CPU, the caller must ensure it
 * can't go away.  Callers that fail to ensure that the specified
 * CPU cannot go away will execute on a randomly chosen CPU.
 *
 * Return: %false if @work was already on a queue, %true otherwise.
 */
bool queue_work_on(int cpu, struct workqueue_struct *wq,
                   struct work_struct *work)
{

Perhaps _fail variants of APIs to queue remote work 
(https://lore.kernel.org/lkml/20220908192859.546633738@redhat.com/T/#mc25ddea62ff095dba61d244fbfdca1f61221c915)
plus
a checkpatch.pl error in case of addition of queue_work_on would 
be helpful?

2) Change the patch acceptance criteria to avoid introducing 
queue_work_on's to isolated CPUs.

If you have additional suggestions or ideas, they are welcome.

> "Fixing" vmstat will not make your isolated workload more
> reliable. You really need a more generic solution rather than a quick
> hack.

It does as long as no one else executes queue_work_on() :-)

Yes, i understand and agree this is weak and fragile.

> Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC
> this is used by NOHZ. I do not remember any details but if anything this
> is something I would have a look into.

Its not sufficient since what it does is only flushing the per-CPU
vmstats when entering idle. There has been work on a patchset
(https://lkml.org/lkml/2022/9/24/306) to improve the infrastructure, to
call quiet_vmstat on return to userspace when nohz_full is used, but
honestly the remote flushing is superior: it potentially avoids many
unecessary flushes (which can happen if the application performs a lot
of system calls).

> There is close to 0 benefit to teaching remote stat flushing. As I've
> said stats are only for debugging purposes and imprecise values
> shouldn't matter. So this just adds a complexity without any actual real
> benefit.

Well there is the need to request a synchronization of the events from
all CPUs, right? For example:

int vmstat_refresh(struct ctl_table *table, int write,
                   void *buffer, size_t *lenp, loff_t *ppos)
{
        long val;
        int err;
        int i;

        /*
         * The regular update, every sysctl_stat_interval, may come later
         * than expected: leaving a significant amount in per_cpu buckets.
         * This is particularly misleading when checking a quantity of HUGE
         * pages, immediately after running a test.  /proc/sys/vm/stat_refresh,
         * which can equally be echo'ed to or cat'ted from (by root),
         * can be used to update the stats just before reading them.
         *
         * Oh, and since global_zone_page_state() etc. are so careful to hide
         * transiently negative values, report an error here if any of
         * the stats is negative, so we know to go looking for imbalance.
         */
        err = schedule_on_each_cpu(refresh_vm_stats);

So if you let a remote CPU dirty its own cache, then at somepoint you
need to flush back that cache.

Are you suggesting to disable synchronization of the per-CPU
vmstats for a given CPU (if isolated, for example), and then any
callsites which require proper values must loop over all CPUs
when requesting uptodate statistics?

Works for me and the addresses the requests from users 
(since it removes the interruption to isolated CPUs).


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-27  8:39                       ` Michal Hocko
@ 2023-04-27 16:25                         ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-04-27 16:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Frederic Weisbecker, Andrew Morton,
	Christoph Lameter, Aaron Tomlin, linux-kernel, linux-mm,
	Russell King, Huacai Chen, Heiko Carstens, x86

On Thu, Apr 27, 2023 at 10:39:29AM +0200, Michal Hocko wrote:
> On Wed 26-04-23 13:10:54, Marcelo Tosatti wrote:
> [...]
> > "To test the performance difference, a page allocator microbenchmark:
> > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c
> > with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.
> > 
> > For the single_page_alloc_free test, which does
> > 
> >        	/** Loop to measure **/
> >        	for (i = 0; i < rec->loops; i++) {
> >                	my_page = alloc_page(gfp_mask);
> >                 if (unlikely(my_page == NULL))
> >                        	return 0;
> >                 __free_page(my_page);
> >         }                                                                                                           
> > 
> > Unit is cycles.
> > 
> > Vanilla                 Patched         Diff
> > 115.25                  117             1.4%"
> > 
> > To be honest, that 1.4% difference was not stable but fluctuated between
> > positive and negative percentages (so the performance difference was in
> > the noise).
> > 
> > So performance is not a decisive factor in this case.
> 
> It is not neglible considering that majority worklods will not benefit
> from this change. You are clearly ignoring that vmstat code has been
> highly optimized for local per-cpu access exactly to avoid locked
> operations and cache line bouncing.
> -- 
> Michal Hocko
> SUSE Labs

Again, the values fluctuate between positive and negative
performance difference (i happen to have copied a positive value).

So the performance difference is in the noise (its not stable at 1.4%),
but rather close to 0%.

So the data is showing that there is no negative performance impact.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v7 00/13] fold per-CPU vmstats remotely
  2023-04-19 19:15         ` Marcelo Tosatti
@ 2023-05-03 13:51           ` Marcelo Tosatti
  0 siblings, 0 replies; 56+ messages in thread
From: Marcelo Tosatti @ 2023-05-03 13:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Christoph Lameter, Aaron Tomlin,
	Frederic Weisbecker, linux-kernel, linux-mm, Russell King,
	Huacai Chen, Heiko Carstens, x86, Michal Hocko

On Wed, Apr 19, 2023 at 04:15:50PM -0300, Marcelo Tosatti wrote:
> On Wed, Apr 19, 2023 at 06:47:30PM +0200, Vlastimil Babka wrote:
> > On 4/19/23 13:29, Marcelo Tosatti wrote:
> > > On Wed, Apr 19, 2023 at 08:14:09AM -0300, Marcelo Tosatti wrote:
> > >> This was tried before:
> > >> https://lore.kernel.org/lkml/20220127173037.318440631@fedora.localdomain/
> > >> 
> > >> My conclusion from that discussion (and work) is that a special system
> > >> call:
> > >> 
> > >> 1) Does not allow the benefits to be widely applied (only modified
> > >> applications will benefit). Is not portable across different operating systems. 
> > >> 
> > >> Removing the vmstat_work interruption is a benefit for HPC workloads, 
> > >> for example (in fact, it is a benefit for any kind of application, 
> > >> since the interruption causes cache misses).
> > >> 
> > >> 2) Increases the system call cost for applications which would use
> > >> the interface.
> > >> 
> > >> So avoiding the vmstat_update update interruption, without userspace 
> > >> knowledge and modifications, is a better than solution than a modified
> > >> userspace.
> > > 
> > > Another important point is this: if an application dirties
> > > its own per-CPU vmstat cache, while performing a system call,
> > > and a vmstat sync event is triggered on a different CPU, you'd have to:
> > > 
> > > 1) Wait for that CPU to return to userspace and sync its stats
> > > (unfeasible).
> > > 
> > > 2) Queue work to execute on that CPU (undesirable, as that causes
> > > an interruption).
> > 
> > So you're saying the application might do a syscall from the isolcpu, so
> > IIUC it cannot expect any latency guarantees at that very moment, 
> 
> Why not? cyclictest uses nanosleep and its the main tool for measuring
> latency.
> 
> > but then
> > it immediately starts expecting them again after returning to userspace, 
> 
> No, the expectation more generally is this:
> 
> For certain types of applications (for example PLC software or
> RAN processing), upon occurrence of an event, it is necessary to
> complete a certain task in a maximum amount of time (deadline).
> 
> One way to express this requirement is with a pair of numbers,
> deadline time and execution time, where:
> 
>         * deadline time: length of time between event and deadline.
>         * execution time: length of time it takes for processing of event
>                           to occur on a particular hardware platform
>                           (uninterrupted).
> 
> The particular values depend on use-case. For the case
> where the realtime application executes in a virtualized
> guest, an interruption which must be serviced in the host will cause
> the following sequence of events:
> 
>         1) VM-exit
>         2) execution of IPI (and function call) (or switch to kwork
> 	thread to execute some work item).
>         3) VM-entry
> 
> Which causes an excess of 50us latency as observed by cyclictest
> (this violates the latency requirement of vRAN application with 1ms TTI,
> for example).
> 
> > and
> > a single interruption for a one-time flush after the syscall would be too
> > intrusive?
> 
> Generally, if you can't complete the task (which involves executing a
> number of instructions) before the deadline, then its a problem.
> 
> One-time flush? You mean to switch between:
> 
> rt-app -> kworker (to execute vmstat_update flush) -> rt-app
> 
> My measurement, which probably had vmstat_update code/data in cache, took 7us.
> It might be the case that the code to execute must be brought in from
> memory, which takes even longer.
> 
> > (elsewhere in the thread you described an RT app initialization that may
> > generate vmstats to flush and then entry userspace loop, again, would a
> > single interruption soon after entering the loop be so critical?)
> 
> 1) It depends on the application. For the use-case above, where < 50us
> interruption is desired, yes it is critical.
> 
> 2) The interruptions can come from different sources.
> 
> Time
> 0			rt-app executing instruction 1
> 1			rt-app executing instruction 2
> 2			scheduler switches between rt-app and kworker
> 3			kworker runs vmstat_work
> 4			scheduler switches between kworker and rt-app
> 5			rt-app executing instruction 3
> 6			ipi to handle a KVM request IPI
> 7			fill in your preferred IPI handler
> 
> So the argument "a single interruption might not cause your deadline
> to be exceeded" fails (because the time to handle the 
> different interruptions might sum).
> 
> Does that make sense?

Ping ? (just want to double check the reasoning above makes sense).



^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2023-05-03 13:53 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-20 18:03 [PATCH v7 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
2023-03-20 18:21   ` Michal Hocko
2023-03-20 18:32     ` Marcelo Tosatti
2023-03-22 10:03       ` Michal Hocko
2023-03-20 18:03 ` [PATCH v7 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 03/13] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 04/13] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 05/13] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 08/13] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 09/13] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 12/13] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
2023-03-20 18:03 ` [PATCH v7 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
2023-03-20 20:43   ` Tim Chen
2023-03-22  1:20     ` Marcelo Tosatti
2023-03-20 18:25 ` [PATCH v7 00/13] fold per-CPU vmstats remotely Michal Hocko
2023-03-20 19:07   ` Marcelo Tosatti
2023-03-22 10:13     ` Michal Hocko
2023-03-22 11:23       ` Marcelo Tosatti
2023-03-22 13:35         ` Michal Hocko
2023-03-22 14:20           ` Marcelo Tosatti
2023-03-23  7:51             ` Michal Hocko
2023-03-23 10:52               ` Marcelo Tosatti
2023-03-23 10:59                 ` Marcelo Tosatti
2023-03-23 12:17                 ` Michal Hocko
2023-03-23 13:30                   ` Marcelo Tosatti
2023-03-23 13:32                     ` Marcelo Tosatti
2023-04-18 22:02 ` Andrew Morton
2023-04-19 11:14   ` Marcelo Tosatti
2023-04-19 11:15     ` Marcelo Tosatti
2023-04-19 13:44       ` Andrew Theurer
2023-04-20  7:55         ` Michal Hocko
2023-04-23  1:25           ` Marcelo Tosatti
2023-04-19 11:29     ` Marcelo Tosatti
2023-04-19 11:59       ` Marcelo Tosatti
2023-04-19 12:24         ` Frederic Weisbecker
2023-04-19 13:48           ` Marcelo Tosatti
2023-04-19 14:35             ` Michal Hocko
2023-04-19 16:35               ` Marcelo Tosatti
2023-04-20  8:40                 ` Michal Hocko
2023-04-23  1:10                   ` Marcelo Tosatti
2023-04-20 13:45                 ` Marcelo Tosatti
2023-04-26 14:34                   ` Marcelo Tosatti
2023-04-27  8:31                     ` Michal Hocko
2023-04-27 14:59                       ` Marcelo Tosatti
2023-04-26 15:04                   ` Vlastimil Babka
2023-04-26 16:10                     ` Marcelo Tosatti
2023-04-27  8:39                       ` Michal Hocko
2023-04-27 16:25                         ` Marcelo Tosatti
2023-04-19 16:47       ` Vlastimil Babka
2023-04-19 19:15         ` Marcelo Tosatti
2023-05-03 13:51           ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).