linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 00/13] fold per-CPU vmstats remotely
@ 2023-05-15 18:00 Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
                   ` (14 more replies)
  0 siblings, 15 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 15697 bytes --]

This patch series addresses the following two problems:

1. A customer provided evidence indicating that a process
   was stalled in direct reclaim:

 - The process was trapped in throttle_direct_reclaim().
   The function wait_event_killable() was called to wait condition
   allow_direct_reclaim(pgdat) for current node to be true.
   The allow_direct_reclaim(pgdat) examined the number of free pages
   on the node by zone_page_state() which just returns value in
   zone->vm_stat[NR_FREE_PAGES].

 - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.
   However, the freelist on this node was not empty.

 - This inconsistent of vmstat value was caused by percpu vmstat on
   nohz_full cpus. Every increment/decrement of vmstat is performed
   on percpu vmstat counter at first, then pooled diffs are cumulated
   to the zone's vmstat counter in timely manner. However, on nohz_full
   cpus (in case of this customer's system, 48 of 52 cpus) these pooled
   diffs were not cumulated once the cpu had no event on it so that
   the cpu started sleeping infinitely.
   I checked percpu vmstat and found there were total 69 counts not
   cumulated to the zone's vmstat counter yet.

 - In this situation, kswapd did not help the trapped process.
   In pgdat_balanced(), zone_wakermark_ok_safe() examined the number
   of free pages on the node by zone_page_state_snapshot() which
   checks pending counts on percpu vmstat.
   Therefore kswapd could know there were 69 free pages correctly.
   Since zone->_watermark = {8, 20, 32}, kswapd did not work because
   69 was greater than 32 as high watermark.

 2. With a task that busy loops on a given CPU,
    the kworker interruption to execute vmstat_update
    is undesired and may exceed latency thresholds
    for certain applications.

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.


Performance details for the kworker interruption:

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
 
The example above shows an additional 7us for the

        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.

Follows a summary of the arguments against this patchset
and responses to them. They are numbered from 1 and 
start with "O-x)", where x is a number. A line with 
"*** Response: ***" precedes the response to each numbered
objection.


O-1) `echo 1 > /proc/sys/vm/stat_refresh' achieves essentially the same
without any kernel changes.

*** Response ***

The stat_refresh interface is not reliable as it triggers "queue_work_on" for all
CPUs which have dirty per-CPU mm counters. So if you have two threads that desire not to be
interrupted, starting one of the threads can interrupt an already executing
and isolated thread.

O-2) Why not always loop through all CPUs when reading the counters? 
(that is replace zone_page_state with zone_page_state_snapshot).

*** Response ***

Consider zone_watermark_fast, called from get_page_from_freelist.

https://lwn.net/Articles/684616/

On x86 systems, DMA is not usually the problem; instead, memory
allocation is. He is working with a target of 14.8 million packets (full
wire speed on a 10Gb/s link) per second; on a 3GHz system, that gives
the kernel about 200 cycles in which to process each packet. Allocating
a single page, though, takes 277 cycles on that system, making the
14.8Mpps rate unattainable. He pointed out Mel Gorman's recent work to
reduce memory-allocator overhead; that work reduced the cost to 230
cycles — a significant improvement, but nowhere near enough.

O-3) Also vmstat already has a concept of silencing - i.e. quiet_vmstat. IIRC
this is used by NOHZ. 

*** Response ***:

The quiet_vmstat mechanism is not reliable, as it is used by the NOHZ code to
synchronize the per-CPU mm stats to global counters (when entering NOHZ mode).

Any subsequent use of per-CPU mm counters will allow vmstat shepherd thread
to queue work (therefore waking up kwork thread) on a CPU.

/*
 * Switch off vmstat processing and then fold all the remaining differentials
 * until the diffs stay at zero. The function is used by NOHZ and can only be
 * invoked when tick processing is not active.
 */
void quiet_vmstat(void)

O-4) The only applications of interest are those that do not enter the kernel.

> > >  2. With a task that busy loops on a given CPU,
> > >     the kworker interruption to execute vmstat_update
> > >     is undesired and may exceed latency thresholds
> > >     for certain applications.
> >
> > Yes it can but why does that matter?
>
> It matters for the application that is executing and expects
> not to be interrupted.                                                                                                                                                      
Those workloads shouldn't enter the kernel in the first place, no?
Otherwise the in kernel execution with all the direct or indirect
dependencies (e.g. via locks) can throw any latency expectations off the
window.

*** Response ***: a common counter example is for latency sensitive applications 
to call sys_nanosleep (for example cyclictest or PLC programs do that).

O-5) Why not implement a syscall to flush all per-cpu caches.

> The practical problem we have been seeing is -RT app initialization.                                                                                                        
> For example:                                                                                                                                                                
>                                                                                                                                                                             
> 1) mlock();                                                                                                                                                                 
> 2) enter loop without system calls

OK, that is what I have kinda expected. Would have been better to
mention it explicitly.

I expect this to be a very common pattern and vmstat might not be the
only subsystem that could interfere later on. Would it make more sense
to address this by a more generic solution? E.g. a syscall to flush all
per-cpu caches so they won't interfere later unless userspace hits the
kernel path in some way (e.g. flush_cpu_caches(cpu_set_t cpumask, int flags)?
The above pattern could then be implemented as

        do_initial_setup()
        sched_setaffinity(getpid(), cpumask);
        flush_cpu_caches(cpumask, 0);
        do_userspace_loop()

*** Response ***:

A special mode, where flushing of caches has been attempted
before:

https://lwn.net/Articles/883940/

However it has a number of drawbacks:

1) If the application is in kernel mode, the interruption
will not be avoided (this patchset will avoid the interruption
even in kernel space).

2) It requires modification of applications.

3) Applications which attempt to use this mode in combination
with system call periods, for example:

https://lore.kernel.org/linux-mm/87im9d4ezq.fsf@nanos.tec.linutronix.de/

"In a real-world usecase we had the situation of compute bursts and an
unfortunate hw enforced requirement to go into the kernel between them
for synchronization between the compute threads and hardware (A quick
hardware assisted save/load).

Unmodified NOHZ full accumulated to more than 6% loss compared to a
fully undisturbed run. Most of it was caused by cache effects and not by
the actually used CPU time.

A full enforced quiescing upfront gained ~2-3%, but a lazy approach of
accepting that some stuff might happen once and does not happen again
gained almost 5%. In that particular scenario 5% _is_ a huge win."

Will suffer performance slowdowns.

O-7) There is a lack of examples where this change is required.

*** Response ***:

Example-1: MAC scheduler processing must occur every 1ms,                                                                                                         
and a certain amount of computation takes place (and must finish before                                                                                                     
the next 1ms timeframe). A > 50us latency spike as observed by cyclictest                                                                                                   
is considered a "failure".     

Performance details for the kworker interruption being solved:
                                                                                                                                                                    
oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
                                                                                                                                                                    
The example above shows an additional 7us for the
                                                                                                                                                                    
        oslat -> kworker -> oslat                                                                                                                                   
                                                                                                                                                                    
switches. In the case of a virtualized CPU, and the vmstat_update                                                                                                   
interruption in the host (of a qemu-kvm vcpu), the latency penalty                                                                                                  
observed in the guest is higher than 50us, violating the acceptable                                                                                                 
latency threshold for certain applications.

Example-2: Based on the tracing data above, but a slight different
use case:

Nearly every telco we work with for 5G RAN is demanding <20 usec CPU latency
as measured by cyclictest & oslat.  We cannot achieve under 20 usec with the vmstats
interruption.

Example-3: 7us above has been measured on recent Intel Xeon processors.
There are use cases which use less powerful processors, such as embedded
ARM boards, where switching from kworker and back is much more expensive
(causing problems to a larger range of applications). For example, 3D printing:
https://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=2077&context=etd_projects

O-8) This is a general problem, therefore requires a general solution.

> But let me repeat, this is not just about vmstats. Just have a look at                                                                                                                        
> other queue_work_on users. You do not want to handy pick each and every                                                                                                                       
> one and do so in the future as well.                                                                                                                                                          

*** Response ***:

The ones that are problematic are being fixed for sometime now. For example:

commit 2de79ee27fdb52626ac4ac48ec6d8d52ba6f9047
Author: Paolo Abeni <pabeni@redhat.com>

    net: try to avoid unneeded backlog flush

    flush_all_backlogs() may cause deadlock on systems
    running processes with FIFO scheduling policy.

    The above is critical in -RT scenarios, where user-space
    specifically ensure no network activity is scheduled on
    the CPU running the mentioned FIFO process, but still get
    stuck.

    This commit tries to address the problem checking the
    backlog status on the remote CPUs before scheduling the
    flush operation. If the backlog is empty, we can skip it.

    v1 -> v2:
     - explicitly clear flushed cpu mask - Eric

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


And it has been a normal process so far.

I think what needs to be done is to avoid new queue_work_on()
users from being introduced in the tree (the number of
existing ones is finite and can therefore be fixed).

Agree with the criticism here, however, i can't see other
options than the following:

	1) Given an activity, which contains a sequence of instructions
	   to execute on a CPU, to change the algorithm
	   to execute that code remotely (therefore avoid interrupting a CPU),
	   or to avoid the interruption somehow (which must be dealt with 
	   on a case-by-case basis).

	2) To block that activity from happening in the first place,
	   for the sites where it can be blocked (that return errors to
	   userspace, for example).

	3) Completly isolate the CPU from the kernel (off-line it).

Working on patches to improve #2.

v8
- Add summary of discussion on -v7 to cover letter
- rebase

v7:
- Fix allow_direct_reclaim issue by using
  zone_page_state_snapshot				(Michal Hocko)

v6:
- Add more information on throttle_direct_reclaim problem 
  to commit logs	       			       (Michal Hocko)

v5:
- Drop "mm/vmstat: remove remote node draining"        (Vlastimil Babka)
- Implement remote node draining for cpu_vm_stats_fold (Vlastimil Babka)

v4:
- Switch per-CPU vmstat counters to s32, required 
  by RISC-V, ARC architectures			

v3:
- Removed unused drain_zone_pages and changes variable (David Hildenbrand)
- Use xchg instead of cmpxchg in refresh_cpu_vm_stats  (Peter Xu)
- Add drain_all_pages to vmstat_refresh to make
  stats more accurate				       (Peter Xu)
- Improve changelog of
  "mm/vmstat: switch counter modification to cmpxchg"  (Peter Xu / David)
- Improve changelog of
  "mm/vmstat: remove remote node draining"	       (David Hildenbrand)


v2:
- actually use LOCK CMPXCHG on counter mod/inc/dec functions
  (Christoph Lameter)
- use try_cmpxchg for cmpxchg loops
  (Uros Bizjak / Matthew Wilcox)

 arch/arm64/include/asm/percpu.h     |   16 ++
 arch/loongarch/include/asm/percpu.h |   23 +++-
 arch/s390/include/asm/percpu.h      |    5 
 arch/x86/include/asm/percpu.h       |   39 +++---
 include/asm-generic/percpu.h        |   17 ++
 include/linux/mmzone.h              |    8 -
 include/linux/percpu-defs.h         |    2 
 include/linux/vmstat.h              |    2 
 kernel/fork.c                       |    2 
 kernel/scs.c                        |    2 
 mm/page_alloc.c                     |    5 
 mm/vmscan.c                         |    2 
 mm/vmstat.c                         |  440 +++++++++++++++++++++++++++++++++++++++++++++++------------------------------
 13 files changed, 361 insertions(+), 202 deletions(-)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-24 12:34   ` Michal Hocko
  2023-05-25  7:05   ` Aaron Tomlin
  2023-05-15 18:00 ` [PATCH v8 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

A customer provided evidence indicating that a process
was stalled in direct reclaim:

 - The process was trapped in throttle_direct_reclaim().
   The function wait_event_killable() was called to wait condition     
   allow_direct_reclaim(pgdat) for current node to be true.     
   The allow_direct_reclaim(pgdat) examined the number of free pages     
   on the node by zone_page_state() which just returns value in     
   zone->vm_stat[NR_FREE_PAGES].     
                                                
 - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.            
   However, the freelist on this node was not empty.     
                           
 - This inconsistent of vmstat value was caused by percpu vmstat on     
   nohz_full cpus. Every increment/decrement of vmstat is performed     
   on percpu vmstat counter at first, then pooled diffs are cumulated     
   to the zone's vmstat counter in timely manner. However, on nohz_full     
   cpus (in case of this customer's system, 48 of 52 cpus) these pooled     
   diffs were not cumulated once the cpu had no event on it so that     
   the cpu started sleeping infinitely.                       
   I checked percpu vmstat and found there were total 69 counts not         
   cumulated to the zone's vmstat counter yet.     
                                         
 - In this situation, kswapd did not help the trapped process.     
   In pgdat_balanced(), zone_wakermark_ok_safe() examined the number     
   of free pages on the node by zone_page_state_snapshot() which     
   checks pending counts on percpu vmstat.     
   Therefore kswapd could know there were 69 free pages correctly.     
   Since zone->_watermark = {8, 20, 32}, kswapd did not work because     
   69 was greater than 32 as high watermark.     

Change allow_direct_reclaim to use zone_page_state_snapshot, which
allows a more precise version of the vmstat counters to be used.

allow_direct_reclaim will only be called from try_to_free_pages,
which is not a hot path.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmscan.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmscan.c
+++ linux-vmstat-remote/mm/vmscan.c
@@ -6886,7 +6886,7 @@ static bool allow_direct_reclaim(pg_data
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
-		free_pages += zone_page_state(zone, NR_FREE_PAGES);
+		free_pages += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 	}
 
 	/* If there are no reserves (unexpected config) then do not throttle */



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 03/13] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this, 
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change ARM's this_cpu_cmpxchg_ helpers to be atomic,
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/arm64/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/arm64/include/asm/percpu.h
+++ linux-vmstat-remote/arch/arm64/include/asm/percpu.h
@@ -232,13 +232,23 @@ PERCPU_RET_OP(add, add, ldadd)
 	_pcp_protect_return(xchg_relaxed, pcp, val)
 
 #define this_cpu_cmpxchg_1(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_2(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_4(pcp, o, n)	\
-	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+	_pcp_protect_return(cmpxchg, pcp, o, n)
 #define this_cpu_cmpxchg_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg, pcp, o, n)
+
+#define this_cpu_cmpxchg_local_1(pcp, o, n)	\
 	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_2(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_4(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+#define this_cpu_cmpxchg_local_8(pcp, o, n)	\
+	_pcp_protect_return(cmpxchg_relaxed, pcp, o, n)
+
 
 #ifdef __KVM_NVHE_HYPERVISOR__
 extern unsigned long __hyp_per_cpu_offset(unsigned int cpu);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 03/13] this_cpu_cmpxchg: loongarch: switch this_cpu_cmpxchg to locked, add _local function
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 04/13] this_cpu_cmpxchg: S390: " Marcelo Tosatti
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add this_cpu_cmpxchg_local helpers to Loongarch.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/loongarch/include/asm/percpu.h
+++ linux-vmstat-remote/arch/loongarch/include/asm/percpu.h
@@ -150,6 +150,16 @@ static inline unsigned long __percpu_xch
 }
 
 /* this_cpu_cmpxchg */
+#define _protect_cmpxchg(pcp, o, n)				\
+({								\
+	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
+	preempt_disable_notrace();				\
+	__ret = cmpxchg(raw_cpu_ptr(&(pcp)), o, n);		\
+	preempt_enable_notrace();				\
+	__ret;							\
+})
+
+/* this_cpu_cmpxchg_local */
 #define _protect_cmpxchg_local(pcp, o, n)			\
 ({								\
 	typeof(*raw_cpu_ptr(&(pcp))) __ret;			\
@@ -222,10 +232,15 @@ do {									\
 #define this_cpu_xchg_4(pcp, val) _percpu_xchg(pcp, val)
 #define this_cpu_xchg_8(pcp, val) _percpu_xchg(pcp, val)
 
-#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
-#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+#define this_cpu_cmpxchg_local_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n)
+
+#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg(ptr, o, n)
+#define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg(ptr, o, n)
 
 #include <asm-generic/percpu.h>
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 04/13] this_cpu_cmpxchg: S390: switch this_cpu_cmpxchg to locked, add _local function
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 03/13] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 05/13] this_cpu_cmpxchg: x86: " Marcelo Tosatti
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
add S390's this_cpu_cmpxchg_local.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/s390/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/s390/include/asm/percpu.h
+++ linux-vmstat-remote/arch/s390/include/asm/percpu.h
@@ -148,6 +148,11 @@
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+
 #define arch_this_cpu_xchg(pcp, nval)					\
 ({									\
 	typeof(pcp) *ptr__;						\



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 05/13] this_cpu_cmpxchg: x86: switch this_cpu_cmpxchg to locked, add _local function
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 04/13] this_cpu_cmpxchg: S390: " Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Following the kernel convention for cmpxchg/cmpxchg_local,
change x86's this_cpu_cmpxchg_ helpers to be atomic.
and add this_cpu_cmpxchg_local_ helpers which are not atomic.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/arch/x86/include/asm/percpu.h
===================================================================
--- linux-vmstat-remote.orig/arch/x86/include/asm/percpu.h
+++ linux-vmstat-remote/arch/x86/include/asm/percpu.h
@@ -197,11 +197,11 @@ do {									\
  * cmpxchg has no such implied lock semantics as a result it is much
  * more efficient for cpu local operations.
  */
-#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval)		\
+#define percpu_cmpxchg_op(size, qual, _var, _oval, _nval, lockp)	\
 ({									\
 	__pcpu_type_##size pco_old__ = __pcpu_cast_##size(_oval);	\
 	__pcpu_type_##size pco_new__ = __pcpu_cast_##size(_nval);	\
-	asm qual (__pcpu_op2_##size("cmpxchg", "%[nval]",		\
+	asm qual (__pcpu_op2_##size(lockp "cmpxchg", "%[nval]",		\
 				    __percpu_arg([var]))		\
 		  : [oval] "+a" (pco_old__),				\
 		    [var] "+m" (_var)					\
@@ -279,16 +279,20 @@ do {									\
 #define raw_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, , pcp, val)
 #define raw_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, , pcp, val)
 #define raw_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, , pcp, val)
-#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval)
-#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, , pcp, oval, nval, "")
+#define raw_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, , pcp, oval, nval, "")
 
 #define this_cpu_add_return_1(pcp, val)		percpu_add_return_op(1, volatile, pcp, val)
 #define this_cpu_add_return_2(pcp, val)		percpu_add_return_op(2, volatile, pcp, val)
 #define this_cpu_add_return_4(pcp, val)		percpu_add_return_op(4, volatile, pcp, val)
-#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval)
-#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval)
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, "")
+
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(1, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(2, volatile, pcp, oval, nval, LOCK_PREFIX)
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(4, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 #ifdef CONFIG_X86_CMPXCHG64
 #define percpu_cmpxchg8b_double(pcp1, pcp2, o1, o2, n1, n2)		\
@@ -319,16 +323,17 @@ do {									\
 #define raw_cpu_or_8(pcp, val)			percpu_to_op(8, , "or", (pcp), val)
 #define raw_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, , pcp, val)
 #define raw_cpu_xchg_8(pcp, nval)		raw_percpu_xchg_op(pcp, nval)
-#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval)
+#define raw_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, , pcp, oval, nval, "")
 
-#define this_cpu_read_8(pcp)			percpu_from_op(8, volatile, "mov", pcp)
-#define this_cpu_write_8(pcp, val)		percpu_to_op(8, volatile, "mov", (pcp), val)
-#define this_cpu_add_8(pcp, val)		percpu_add_op(8, volatile, (pcp), val)
-#define this_cpu_and_8(pcp, val)		percpu_to_op(8, volatile, "and", (pcp), val)
-#define this_cpu_or_8(pcp, val)			percpu_to_op(8, volatile, "or", (pcp), val)
-#define this_cpu_add_return_8(pcp, val)		percpu_add_return_op(8, volatile, pcp, val)
-#define this_cpu_xchg_8(pcp, nval)		percpu_xchg_op(8, volatile, pcp, nval)
-#define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval)
+#define this_cpu_read_8(pcp)				percpu_from_op(8, volatile, "mov", pcp)
+#define this_cpu_write_8(pcp, val)			percpu_to_op(8, volatile, "mov", (pcp), val)
+#define this_cpu_add_8(pcp, val)			percpu_add_op(8, volatile, (pcp), val)
+#define this_cpu_and_8(pcp, val)			percpu_to_op(8, volatile, "and", (pcp), val)
+#define this_cpu_or_8(pcp, val)				percpu_to_op(8, volatile, "or", (pcp), val)
+#define this_cpu_add_return_8(pcp, val)			percpu_add_return_op(8, volatile, pcp, val)
+#define this_cpu_xchg_8(pcp, nval)			percpu_xchg_op(8, volatile, pcp, nval)
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval)	percpu_cmpxchg_op(8, volatile, pcp, oval, nval, "")
+#define this_cpu_cmpxchg_8(pcp, oval, nval)		percpu_cmpxchg_op(8, volatile, pcp, oval, nval, LOCK_PREFIX)
 
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 05/13] this_cpu_cmpxchg: x86: " Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Goal is to have vmstat_shepherd to transfer from
per-CPU counters to global counters remotely. For this,
an atomic this_cpu_cmpxchg is necessary.

Add this_cpu_cmpxchg_local_ helpers to asm-generic/percpu.h.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/include/asm-generic/percpu.h
===================================================================
--- linux-vmstat-remote.orig/include/asm-generic/percpu.h
+++ linux-vmstat-remote/include/asm-generic/percpu.h
@@ -424,6 +424,23 @@ do {									\
 	this_cpu_generic_cmpxchg(pcp, oval, nval)
 #endif
 
+#ifndef this_cpu_cmpxchg_local_1
+#define this_cpu_cmpxchg_local_1(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_2
+#define this_cpu_cmpxchg_local_2(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_4
+#define this_cpu_cmpxchg_local_4(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+#ifndef this_cpu_cmpxchg_local_8
+#define this_cpu_cmpxchg_local_8(pcp, oval, nval) \
+	this_cpu_generic_cmpxchg(pcp, oval, nval)
+#endif
+
 #ifndef this_cpu_cmpxchg_double_1
 #define this_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
Index: linux-vmstat-remote/include/linux/percpu-defs.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/percpu-defs.h
+++ linux-vmstat-remote/include/linux/percpu-defs.h
@@ -513,6 +513,8 @@ do {									\
 #define this_cpu_xchg(pcp, nval)	__pcpu_size_call_return2(this_cpu_xchg_, pcp, nval)
 #define this_cpu_cmpxchg(pcp, oval, nval) \
 	__pcpu_size_call_return2(this_cpu_cmpxchg_, pcp, oval, nval)
+#define this_cpu_cmpxchg_local(pcp, oval, nval) \
+	__pcpu_size_call_return2(this_cpu_cmpxchg_local_, pcp, oval, nval)
 #define this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
 	__pcpu_double_call_return_bool(this_cpu_cmpxchg_double_, pcp1, pcp2, oval1, oval2, nval1, nval2)
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 08/13] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Peter Xu, Marcelo Tosatti

this_cpu_cmpxchg was modified to atomic version, which 
can be more costly than non-atomic version.

Switch users of this_cpu_cmpxchg to this_cpu_cmpxchg_local
(which preserves pre-non-atomic this_cpu_cmpxchg behaviour).

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/kernel/fork.c
===================================================================
--- linux-vmstat-remote.orig/kernel/fork.c
+++ linux-vmstat-remote/kernel/fork.c
@@ -205,7 +205,7 @@ static bool try_release_thread_stack_to_
 	unsigned int i;
 
 	for (i = 0; i < NR_CACHED_STACKS; i++) {
-		if (this_cpu_cmpxchg(cached_stacks[i], NULL, vm) != NULL)
+		if (this_cpu_cmpxchg_local(cached_stacks[i], NULL, vm) != NULL)
 			continue;
 		return true;
 	}
Index: linux-vmstat-remote/kernel/scs.c
===================================================================
--- linux-vmstat-remote.orig/kernel/scs.c
+++ linux-vmstat-remote/kernel/scs.c
@@ -83,7 +83,7 @@ void scs_free(void *s)
 	 */
 
 	for (i = 0; i < NR_CACHED_SCS; i++)
-		if (this_cpu_cmpxchg(scs_cache[i], 0, s) == NULL)
+		if (this_cpu_cmpxchg_local(scs_cache[i], 0, s) == NULL)
 			return;
 
 	kasan_unpoison_vmalloc(s, SCS_SIZE, KASAN_VMALLOC_PROT_NORMAL);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 08/13] mm/vmstat: switch counter modification to cmpxchg
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 09/13] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

In preparation for switching vmstat shepherd to flush
per-CPU counters remotely, switch the __{mod,inc,dec} functions that
modify the counters to use cmpxchg.

To facilitate reviewing, functions are ordered in the text file, as:

__{mod,inc,dec}_{zone,node}_page_state
#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
#endif

This patch defines the __ versions for the
CONFIG_HAVE_CMPXCHG_LOCAL case to be their non-"__" counterparts:

#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state = {mod,inc,dec}_{zone,node}_page_state
#else
{mod,inc,dec}_{zone,node}_page_state
__{mod,inc,dec}_{zone,node}_page_state
#endif

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c 
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla			Patched		Diff
115.25			117		1.4%

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/*
+ * If we have cmpxchg_local support then we do not need to incur the overhead
+ * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
+ *
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *     0       No overstepping
+ *     1       Overstepping half of threshold
+ *     -1      Overstepping minus half of threshold
+ */
+static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
+				  long delta, int overstep_mode)
+{
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a zone.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		zone_page_state_add(z, zone, item);
+}
+
+void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			 long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_zone_page_state);
+
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			   long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+void inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_zone_page_state);
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_zone_page_state);
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+				  enum node_stat_item item,
+				  int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	if (vmstat_item_in_bytes(item)) {
+		/*
+		 * Only cgroups use subpage accounting right now; at
+		 * the global level, these items still change in
+		 * multiples of whole pages. Store them as pages
+		 * internally to keep the per-cpu counters compact.
+		 */
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+#else
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -541,149 +723,6 @@ void __dec_node_page_state(struct page *
 }
 EXPORT_SYMBOL(__dec_node_page_state);
 
-#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
-/*
- * If we have cmpxchg_local support then we do not need to incur the overhead
- * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
- *
- * mod_state() modifies the zone counter state through atomic per cpu
- * operations.
- *
- * Overstep mode specifies how overstep should handled:
- *     0       No overstepping
- *     1       Overstepping half of threshold
- *     -1      Overstepping minus half of threshold
-*/
-static inline void mod_zone_state(struct zone *zone,
-       enum zone_stat_item item, long delta, int overstep_mode)
-{
-	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long o, n, t, z;
-
-	do {
-		z = 0;  /* overflow to zone counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a zone.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to zone counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		zone_page_state_add(z, zone, item);
-}
-
-void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-			 long delta)
-{
-	mod_zone_state(zone, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_zone_page_state);
-
-void inc_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_zone_page_state);
-
-void dec_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_zone_page_state);
-
-static inline void mod_node_state(struct pglist_data *pgdat,
-       enum node_stat_item item, int delta, int overstep_mode)
-{
-	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	long o, n, t, z;
-
-	if (vmstat_item_in_bytes(item)) {
-		/*
-		 * Only cgroups use subpage accounting right now; at
-		 * the global level, these items still change in
-		 * multiples of whole pages. Store them as pages
-		 * internally to keep the per-cpu counters compact.
-		 */
-		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
-		delta >>= PAGE_SHIFT;
-	}
-
-	do {
-		z = 0;  /* overflow to node counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a node.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to node counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		node_page_state_add(z, pgdat, item);
-}
-
-void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
-					long delta)
-{
-	mod_node_state(pgdat, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_node_page_state);
-
-void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
-{
-	mod_node_state(pgdat, item, 1, 1);
-}
-
-void inc_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_node_page_state);
-
-void dec_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_node_page_state);
-#else
 /*
  * Use interrupt disable to serialize counter updates
  */
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -6249,9 +6249,6 @@ static int page_alloc_cpu_dead(unsigned
 	/*
 	 * Zero the differential counters of the dead processor
 	 * so that the vm statistics are consistent.
-	 *
-	 * This is only okay since the processor is dead and cannot
-	 * race with what we are doing.
 	 */
 	cpu_vm_stats_fold(cpu);
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 09/13] vmstat: switch per-cpu vmstat counters to 32-bits
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (7 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 08/13] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Some architectures only provide xchg/cmpxchg in 32/64-bit quantities.

Since the next patch is about to use xchg on per-CPU vmstat counters,
switch them to s32.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -675,8 +675,8 @@ struct per_cpu_pages {
 
 struct per_cpu_zonestat {
 #ifdef CONFIG_SMP
-	s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
-	s8 stat_threshold;
+	s32 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
+	s32 stat_threshold;
 #endif
 #ifdef CONFIG_NUMA
 	/*
@@ -689,8 +689,8 @@ struct per_cpu_zonestat {
 };
 
 struct per_cpu_nodestat {
-	s8 stat_threshold;
-	s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
+	s32 stat_threshold;
+	s32 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
 };
 
 #endif /* !__GENERATING_BOUNDS.H */
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -351,7 +351,7 @@ static inline void mod_zone_state(struct
 				  long delta, int overstep_mode)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long o, n, t, z;
 
 	do {
@@ -428,7 +428,7 @@ static inline void mod_node_state(struct
 				  int delta, int overstep_mode)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long o, n, t, z;
 
 	if (vmstat_item_in_bytes(item)) {
@@ -525,7 +525,7 @@ void __mod_zone_page_state(struct zone *
 			   long delta)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
 	long x;
 	long t;
 
@@ -556,7 +556,7 @@ void __mod_node_page_state(struct pglist
 				long delta)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
 	long x;
 	long t;
 
@@ -614,8 +614,8 @@ EXPORT_SYMBOL(__mod_node_page_state);
 void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -623,7 +623,7 @@ void __inc_zone_state(struct zone *zone,
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
@@ -635,8 +635,8 @@ void __inc_zone_state(struct zone *zone,
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -646,7 +646,7 @@ void __inc_node_state(struct pglist_data
 	v = __this_cpu_inc_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v > t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
@@ -670,8 +670,8 @@ EXPORT_SYMBOL(__inc_node_page_state);
 void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
 {
 	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_stat_diff + item;
+	s32 v, t;
 
 	/* See __mod_node_page_state */
 	preempt_disable_nested();
@@ -679,7 +679,7 @@ void __dec_zone_state(struct zone *zone,
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
@@ -691,8 +691,8 @@ void __dec_zone_state(struct zone *zone,
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
 {
 	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	s8 v, t;
+	s32 __percpu *p = pcp->vm_node_stat_diff + item;
+	s32 v, t;
 
 	VM_WARN_ON_ONCE(vmstat_item_in_bytes(item));
 
@@ -702,7 +702,7 @@ void __dec_node_state(struct pglist_data
 	v = __this_cpu_dec_return(*p);
 	t = __this_cpu_read(pcp->stat_threshold);
 	if (unlikely(v < - t)) {
-		s8 overstep = t >> 1;
+		s32 overstep = t >> 1;
 
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (8 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 09/13] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use xchg instead of a
pair of read/write instructions.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -924,7 +924,7 @@ static int refresh_cpu_vm_stats(bool do_
 }
 
 /*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
@@ -945,8 +945,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = pzstats->vm_stat_diff[i];
-				pzstats->vm_stat_diff[i] = 0;
+				v = xchg(&pzstats->vm_stat_diff[i], 0);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
@@ -956,8 +955,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_numa_event[i]) {
 				unsigned long v;
 
-				v = pzstats->vm_numa_event[i];
-				pzstats->vm_numa_event[i] = 0;
+				v = xchg(&pzstats->vm_numa_event[i], 0);
 				zone_numa_event_add(v, zone, i);
 			}
 		}
@@ -973,8 +971,7 @@ void cpu_vm_stats_fold(int cpu)
 			if (p->vm_node_stat_diff[i]) {
 				int v;
 
-				v = p->vm_node_stat_diff[i];
-				p->vm_node_stat_diff[i] = 0;
+				v = xchg(&p->vm_node_stat_diff[i], 0);
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
 			}



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (9 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 12/13] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

With a task that busy loops on a given CPU,
the kworker interruption to execute vmstat_update
is undesired and may exceed latency thresholds
for certain applications.

Performance details for the kworker interruption:

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the

        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold for certain applications.

To fix this, now that the counters are modified via cmpxchg
both CPU locally (via the account functions), and remotely (via
cpu_vm_stats_fold), its possible to switch vmstat_shepherd to perform
the per-CPU vmstats folding remotely.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -2049,6 +2049,23 @@ static void vmstat_shepherd(struct work_
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		cpu_vm_stats_fold(cpu);
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
+}
+#else
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
@@ -2068,6 +2085,7 @@ static void vmstat_shepherd(struct work_
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }
+#endif
 
 static void __init start_shepherd_timer(void)
 {



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 12/13] mm/vmstat: refresh stats remotely instead of via work item
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (10 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-15 18:00 ` [PATCH v8 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Refresh per-CPU stats remotely, instead of queueing 
work items, for the stat_refresh procfs method.

This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -1907,11 +1907,20 @@ static DEFINE_PER_CPU(struct delayed_wor
 int sysctl_stat_interval __read_mostly = HZ;
 
 #ifdef CONFIG_PROC_FS
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+static int refresh_all_vm_stats(void);
+#else
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
 }
 
+static int refresh_all_vm_stats(void)
+{
+	return schedule_on_each_cpu(refresh_vm_stats);
+}
+#endif
+
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -1931,7 +1940,7 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
+	err = refresh_all_vm_stats();
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
@@ -2051,7 +2060,7 @@ static DECLARE_DEFERRABLE_WORK(shepherd,
 
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
-static void vmstat_shepherd(struct work_struct *w)
+static int refresh_all_vm_stats(void)
 {
 	int cpu;
 
@@ -2061,7 +2070,12 @@ static void vmstat_shepherd(struct work_
 		cond_resched();
 	}
 	cpus_read_unlock();
+	return 0;
+}
 
+static void vmstat_shepherd(struct work_struct *w)
+{
+	refresh_all_vm_stats();
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (11 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 12/13] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
@ 2023-05-15 18:00 ` Marcelo Tosatti
  2023-05-16  8:09 ` [PATCH v8 00/13] fold per-CPU vmstats remotely Christoph Lameter
  2023-05-24 12:51 ` Michal Hocko
  14 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-15 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko, Marcelo Tosatti

Large NUMA systems might have significant portions 
of system memory to be trapped in pcp queues. The number of pcp is                                                             
determined by the number of processors and nodes in a system. A system                                                              
with 4 processors and 2 nodes has 8 pcps which is okay. But a system                                                                
with 1024 processors and 512 nodes has 512k pcps with a high potential                                                              
for large amount of memory being caught in them.

Enable remote node draining for the CONFIG_HAVE_CMPXCHG_LOCAL case,
where vmstat_shepherd will perform the aging and draining via
cpu_vm_stats_fold.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -928,7 +928,7 @@ static int refresh_cpu_vm_stats(bool do_
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
-void cpu_vm_stats_fold(int cpu)
+void cpu_vm_stats_fold(int cpu, bool do_pagesets)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
@@ -938,6 +938,9 @@ void cpu_vm_stats_fold(int cpu)
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat *pzstats;
+#ifdef CONFIG_NUMA
+		struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+#endif
 
 		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
 
@@ -948,6 +951,11 @@ void cpu_vm_stats_fold(int cpu)
 				v = xchg(&pzstats->vm_stat_diff[i], 0);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
+#ifdef CONFIG_NUMA
+				/* 3 seconds idle till flush */
+				if (do_pagesets)
+					pcp->expire = 3;
+#endif
 			}
 		}
 #ifdef CONFIG_NUMA
@@ -959,6 +967,38 @@ void cpu_vm_stats_fold(int cpu)
 				zone_numa_event_add(v, zone, i);
 			}
 		}
+
+		if (do_pagesets) {
+			cond_resched();
+			/*
+			 * Deal with draining the remote pageset of a
+			 * processor
+			 *
+			 * Check if there are pages remaining in this pageset
+			 * if not then there is nothing to expire.
+			 */
+			if (!pcp->expire || !pcp->count)
+				continue;
+
+			/*
+			 * We never drain zones local to this processor.
+			 */
+			if (zone_to_nid(zone) == cpu_to_node(cpu)) {
+				pcp->expire = 0;
+				continue;
+			}
+
+			WARN_ON(pcp->expire < 0);
+			/*
+			 * pcp->expire is only accessed from vmstat_shepherd context,
+			 * therefore no locking is required.
+			 */
+			if (--pcp->expire)
+				continue;
+
+			if (pcp->count)
+				drain_zone_pages(zone, pcp);
+		}
 #endif
 	}
 
@@ -2066,7 +2106,7 @@ static int refresh_all_vm_stats(void)
 
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		cpu_vm_stats_fold(cpu);
+		cpu_vm_stats_fold(cpu, true);
 		cond_resched();
 	}
 	cpus_read_unlock();
Index: linux-vmstat-remote/include/linux/vmstat.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/vmstat.h
+++ linux-vmstat-remote/include/linux/vmstat.h
@@ -297,7 +297,7 @@ extern void __dec_zone_state(struct zone
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
 void quiet_vmstat(void);
-void cpu_vm_stats_fold(int cpu);
+void cpu_vm_stats_fold(int cpu, bool do_pagesets);
 void refresh_zone_stat_thresholds(void);
 
 struct ctl_table;
Index: linux-vmstat-remote/mm/page_alloc.c
===================================================================
--- linux-vmstat-remote.orig/mm/page_alloc.c
+++ linux-vmstat-remote/mm/page_alloc.c
@@ -6250,7 +6250,7 @@ static int page_alloc_cpu_dead(unsigned
 	 * Zero the differential counters of the dead processor
 	 * so that the vm statistics are consistent.
 	 */
-	cpu_vm_stats_fold(cpu);
+	cpu_vm_stats_fold(cpu, false);
 
 	for_each_populated_zone(zone)
 		zone_pcp_update(zone, 0);



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 00/13] fold per-CPU vmstats remotely
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (12 preceding siblings ...)
  2023-05-15 18:00 ` [PATCH v8 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
@ 2023-05-16  8:09 ` Christoph Lameter
  2023-05-16 18:02   ` Marcelo Tosatti
  2023-05-24 12:51 ` Michal Hocko
  14 siblings, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2023-05-16  8:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko

The patchset still modifies the semantics of this_cpu operations semantics
replacing the lockless RMV operations with locked ones. One of the
rationales for the use  this_cpu operations is their efficiency since
locked RMV atomics are avoided. This patchset destroys that functionality.

If you want locked RMV semantics then use them through cmpxchg() and
friends. Do not modify this_cpu operations by changing the implementation
in the arch code.






^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 00/13] fold per-CPU vmstats remotely
  2023-05-16  8:09 ` [PATCH v8 00/13] fold per-CPU vmstats remotely Christoph Lameter
@ 2023-05-16 18:02   ` Marcelo Tosatti
  0 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-16 18:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Russell King, Huacai Chen, Heiko Carstens, x86,
	Vlastimil Babka, Michal Hocko

Hi Christoph,

On Tue, May 16, 2023 at 10:09:02AM +0200, Christoph Lameter wrote:
> The patchset still modifies the semantics of this_cpu operations semantics
> replacing the lockless RMV operations with locked ones. 

It does that to follow the pre-existing kernel convention:

function-name			LOCK prefix
cmpxchg				YES
cmpxchg_local			NO

So the patchset introduces:

function-name			LOCK prefix
this_cpu_cmpxchg		YES
this_cpu_cmpxchg_local		NO

> One of the
> rationales for the use  this_cpu operations is their efficiency since
> locked RMV atomics are avoided. 

And there is the freedom to choose between this_cpu_cmpxchg and
this_cpu_cmpxchg_local (depending on intended usage).

> This patchset destroys that functionality.

Patch 6 is

Subject: [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions

Which adds this_cpu_cmpxchg_local

Patch 7 converts all other this_cmpxchg users
(except the vmstat ones)

[PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local

So the non-LOCK'ed behaviour is maintained for existing users.

> If you want locked RMV semantics then use them through cmpxchg() and
> friends. Do not modify this_cpu operations by changing the implementation
> in the arch code.

But then it would be necessary to disable preemption here:

static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
                                  long delta, int overstep_mode)
{
        struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
        s32 __percpu *p = pcp->vm_stat_diff + item;
        long o, n, t, z;

        do {
                z = 0;  /* overflow to zone counters */

                /*
                 * The fetching of the stat_threshold is racy. We may apply
                 * a counter threshold to the wrong the cpu if we get
                 * rescheduled while executing here. However, the next
                 * counter update will apply the threshold again and
                 * therefore bring the counter under the threshold again.
                 *
                 * Most of the time the thresholds are the same anyways
                 * for all cpus in a zone.
                 */
                t = this_cpu_read(pcp->stat_threshold);

                o = this_cpu_read(*p);
                n = delta + o;

                if (abs(n) > t) {
                        int os = overstep_mode * (t >> 1);

                        /* Overflow must be added to zone counters */
                        z = n + os;
                        n = -os;
                }
        } while (this_cpu_cmpxchg(*p, o, n) != o);
		 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

        if (z)
                zone_page_state_add(z, zone, item);
}

Earlier you objected to disabling preemption on this codepath
(which is what led to this patchset in the first place):

"Using preemption is a way to make this work correctly. However, 
doing so would sacrifice the performance, low impact and the
scalability of the vm counters."

So it seems a locked, this_cpu function which does lock cmxpchg
is desired.

Perhaps you disagree with the this_cpu_cmpxchg_local/this_cpu_cmpxchg
naming?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot
  2023-05-15 18:00 ` [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
@ 2023-05-24 12:34   ` Michal Hocko
  2023-05-25  7:05   ` Aaron Tomlin
  1 sibling, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2023-05-24 12:34 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Mon 15-05-23 15:00:16, Marcelo Tosatti wrote:
> A customer provided evidence indicating that a process
> was stalled in direct reclaim:
> 
>  - The process was trapped in throttle_direct_reclaim().
>    The function wait_event_killable() was called to wait condition     
>    allow_direct_reclaim(pgdat) for current node to be true.     
>    The allow_direct_reclaim(pgdat) examined the number of free pages     
>    on the node by zone_page_state() which just returns value in     
>    zone->vm_stat[NR_FREE_PAGES].     
>                                                 
>  - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.            
>    However, the freelist on this node was not empty.     
>                            
>  - This inconsistent of vmstat value was caused by percpu vmstat on     
>    nohz_full cpus. Every increment/decrement of vmstat is performed     
>    on percpu vmstat counter at first, then pooled diffs are cumulated     
>    to the zone's vmstat counter in timely manner. However, on nohz_full     
>    cpus (in case of this customer's system, 48 of 52 cpus) these pooled     
>    diffs were not cumulated once the cpu had no event on it so that     
>    the cpu started sleeping infinitely.                       
>    I checked percpu vmstat and found there were total 69 counts not         
>    cumulated to the zone's vmstat counter yet.     
>                                          
>  - In this situation, kswapd did not help the trapped process.     
>    In pgdat_balanced(), zone_wakermark_ok_safe() examined the number     
>    of free pages on the node by zone_page_state_snapshot() which     
>    checks pending counts on percpu vmstat.     
>    Therefore kswapd could know there were 69 free pages correctly.     
>    Since zone->_watermark = {8, 20, 32}, kswapd did not work because     
>    69 was greater than 32 as high watermark.     
> 
> Change allow_direct_reclaim to use zone_page_state_snapshot, which
> allows a more precise version of the vmstat counters to be used.
> 
> allow_direct_reclaim will only be called from try_to_free_pages,
> which is not a hot path.

I have asked this patch to contain a note about testing done for future
reference IIRC. You have said that the above scenario is not easy to
reproduce and therefore not much of testing has been done but this is a
valuable information on its own.

> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

I have also acked the patch. So let me repeat
Acked-by: Michal Hocko <mhocko@suse.com>

Btw. the patch should be routed separately as it is a potential fix.
> 
> ---
> 
> Index: linux-vmstat-remote/mm/vmscan.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmscan.c
> +++ linux-vmstat-remote/mm/vmscan.c
> @@ -6886,7 +6886,7 @@ static bool allow_direct_reclaim(pg_data
>  			continue;
>  
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> -		free_pages += zone_page_state(zone, NR_FREE_PAGES);
> +		free_pages += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  	}
>  
>  	/* If there are no reserves (unexpected config) then do not throttle */
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 00/13] fold per-CPU vmstats remotely
  2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (13 preceding siblings ...)
  2023-05-16  8:09 ` [PATCH v8 00/13] fold per-CPU vmstats remotely Christoph Lameter
@ 2023-05-24 12:51 ` Michal Hocko
  2023-05-24 13:53   ` Marcelo Tosatti
  14 siblings, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2023-05-24 12:51 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

[Sorry for a late response but I was conferencing last two weeks and now
catching up]

On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote:
[...]
> v8
> - Add summary of discussion on -v7 to cover letter

Thanks this is very useful! This helps to frame the further discussion.

I believe the most important question to answer is this in fact
> I think what needs to be done is to avoid new queue_work_on()
> users from being introduced in the tree (the number of
> existing ones is finite and can therefore be fixed).
> 
> Agree with the criticism here, however, i can't see other
> options than the following:
> 
>         1) Given an activity, which contains a sequence of instructions
>            to execute on a CPU, to change the algorithm
>            to execute that code remotely (therefore avoid interrupting a CPU),
>            or to avoid the interruption somehow (which must be dealt with
>            on a case-by-case basis).
> 
>         2) To block that activity from happening in the first place,
>            for the sites where it can be blocked (that return errors to
>            userspace, for example).
> 
>         3) Completly isolate the CPU from the kernel (off-line it).

I agree that a reliable cpu isolation implementation needs to address
queue_work_on problem. And it has to do that _realiably_. This cannot by
achieved by an endless whack-a-mole and chasing each new instance. There
must be a more systematic approach. One way would be to change the
semantic of schedule_work_on and fail call for an isolated CPU. The
caller would have a way to fallback and handle the operation by other
means. E.g. vmstat could simply ignore folding pcp data because an
imprecision shouldn't really matter. Other callers might chose to do the
operation remotely. This is a lot of work, no doubt about that, but it
is a long term maintainable solution that doesn't give you new surprises
with any new released kernel. There are likely other remote interfaces
that would need to follow that scheme.

If the cpu isolation is not planned to be worth that time investment
then I do not think it is also worth reducing a highly optimized vmstat
code. These stats are invoked from many hot paths and per-cpu
implementation has been optimized for that case. If your workload would
like to avoid that as disturbing then you already have a quiet_vmstat
precedence so find a way how to use it for your workload instead.
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 00/13] fold per-CPU vmstats remotely
  2023-05-24 12:51 ` Michal Hocko
@ 2023-05-24 13:53   ` Marcelo Tosatti
  2023-05-25  6:47     ` Michal Hocko
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2023-05-24 13:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote:
> [Sorry for a late response but I was conferencing last two weeks and now
> catching up]
> 
> On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote:
> [...]
> > v8
> > - Add summary of discussion on -v7 to cover letter
> 
> Thanks this is very useful! This helps to frame the further discussion.
> 
> I believe the most important question to answer is this in fact
> > I think what needs to be done is to avoid new queue_work_on()
> > users from being introduced in the tree (the number of
> > existing ones is finite and can therefore be fixed).
> > 
> > Agree with the criticism here, however, i can't see other
> > options than the following:
> > 
> >         1) Given an activity, which contains a sequence of instructions
> >            to execute on a CPU, to change the algorithm
> >            to execute that code remotely (therefore avoid interrupting a CPU),
> >            or to avoid the interruption somehow (which must be dealt with
> >            on a case-by-case basis).
> > 
> >         2) To block that activity from happening in the first place,
> >            for the sites where it can be blocked (that return errors to
> >            userspace, for example).
> > 
> >         3) Completly isolate the CPU from the kernel (off-line it).
> 
> I agree that a reliable cpu isolation implementation needs to address
> queue_work_on problem. And it has to do that _realiably_. This cannot by
> achieved by an endless whack-a-mole and chasing each new instance. There
> must be a more systematic approach. One way would be to change the
> semantic of schedule_work_on and fail call for an isolated CPU. The
> caller would have a way to fallback and handle the operation by other
> means. E.g. vmstat could simply ignore folding pcp data because an
> imprecision shouldn't really matter. Other callers might chose to do the
> operation remotely. This is a lot of work, no doubt about that, but it
> is a long term maintainable solution that doesn't give you new surprises
> with any new released kernel. There are likely other remote interfaces
> that would need to follow that scheme.
> 
> If the cpu isolation is not planned to be worth that time investment
> then I do not think it is also worth reducing a highly optimized vmstat
> code. These stats are invoked from many hot paths and per-cpu
> implementation has been optimized for that case.

It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG
instruction. Which should not cost much due to cache locking (these are
per-CPU variables anyway).

> If your workload would
> like to avoid that as disturbing then you already have a quiet_vmstat
> precedence so find a way how to use it for your workload instead.
>  
> -- 
> Michal Hocko
> SUSE Labs

OK so an alternative solution is to completly disable vmstat updates
for isolated CPUs. Are you OK with that ?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 00/13] fold per-CPU vmstats remotely
  2023-05-24 13:53   ` Marcelo Tosatti
@ 2023-05-25  6:47     ` Michal Hocko
  0 siblings, 0 replies; 21+ messages in thread
From: Michal Hocko @ 2023-05-25  6:47 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka

On Wed 24-05-23 10:53:23, Marcelo Tosatti wrote:
> On Wed, May 24, 2023 at 02:51:55PM +0200, Michal Hocko wrote:
> > [Sorry for a late response but I was conferencing last two weeks and now
> > catching up]
> > 
> > On Mon 15-05-23 15:00:15, Marcelo Tosatti wrote:
> > [...]
> > > v8
> > > - Add summary of discussion on -v7 to cover letter
> > 
> > Thanks this is very useful! This helps to frame the further discussion.
> > 
> > I believe the most important question to answer is this in fact
> > > I think what needs to be done is to avoid new queue_work_on()
> > > users from being introduced in the tree (the number of
> > > existing ones is finite and can therefore be fixed).
> > > 
> > > Agree with the criticism here, however, i can't see other
> > > options than the following:
> > > 
> > >         1) Given an activity, which contains a sequence of instructions
> > >            to execute on a CPU, to change the algorithm
> > >            to execute that code remotely (therefore avoid interrupting a CPU),
> > >            or to avoid the interruption somehow (which must be dealt with
> > >            on a case-by-case basis).
> > > 
> > >         2) To block that activity from happening in the first place,
> > >            for the sites where it can be blocked (that return errors to
> > >            userspace, for example).
> > > 
> > >         3) Completly isolate the CPU from the kernel (off-line it).
> > 
> > I agree that a reliable cpu isolation implementation needs to address
> > queue_work_on problem. And it has to do that _realiably_. This cannot by
> > achieved by an endless whack-a-mole and chasing each new instance. There
> > must be a more systematic approach. One way would be to change the
> > semantic of schedule_work_on and fail call for an isolated CPU. The
> > caller would have a way to fallback and handle the operation by other
> > means. E.g. vmstat could simply ignore folding pcp data because an
> > imprecision shouldn't really matter. Other callers might chose to do the
> > operation remotely. This is a lot of work, no doubt about that, but it
> > is a long term maintainable solution that doesn't give you new surprises
> > with any new released kernel. There are likely other remote interfaces
> > that would need to follow that scheme.
> > 
> > If the cpu isolation is not planned to be worth that time investment
> > then I do not think it is also worth reducing a highly optimized vmstat
> > code. These stats are invoked from many hot paths and per-cpu
> > implementation has been optimized for that case.
> 
> It is exactly the same code, but now with a "LOCK" prefix for CMPXCHG
> instruction. Which should not cost much due to cache locking (these are
> per-CPU variables anyway).

Sorry but just a LOCK prefix for a hot path is not a serious argument.
 
> > If your workload would
> > like to avoid that as disturbing then you already have a quiet_vmstat
> > precedence so find a way how to use it for your workload instead.
> >  
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> OK so an alternative solution is to completly disable vmstat updates
> for isolated CPUs. Are you OK with that ?

Yes, the number of events should be reasonably small and those places in
the kernel which really need a precise value need to do a per-cpu walk
anyway. IIRC /proc/vmstat et al also do accumulate pcp state.

But let me reiterate. Even with vmstat updates out of the game, you have
so many other sources of disruption that your isolated workload will be
fragile until you actually try to deal with the problem on a more
fundamental level.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot
  2023-05-15 18:00 ` [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
  2023-05-24 12:34   ` Michal Hocko
@ 2023-05-25  7:05   ` Aaron Tomlin
  1 sibling, 0 replies; 21+ messages in thread
From: Aaron Tomlin @ 2023-05-25  7:05 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Frederic Weisbecker, Andrew Morton,
	linux-kernel, linux-mm, Russell King, Huacai Chen,
	Heiko Carstens, x86, Vlastimil Babka, Michal Hocko, Aaron Tomlin

On Mon, May 15, 2023 at 03:00:16PM -0300, Marcelo Tosatti wrote:
> A customer provided evidence indicating that a process
> was stalled in direct reclaim:
> 
>  - The process was trapped in throttle_direct_reclaim().
>    The function wait_event_killable() was called to wait condition     
>    allow_direct_reclaim(pgdat) for current node to be true.     
>    The allow_direct_reclaim(pgdat) examined the number of free pages     
>    on the node by zone_page_state() which just returns value in     
>    zone->vm_stat[NR_FREE_PAGES].     
>                                                 
>  - On node #1, zone->vm_stat[NR_FREE_PAGES] was 0.            
>    However, the freelist on this node was not empty.     
>                            
>  - This inconsistent of vmstat value was caused by percpu vmstat on     
>    nohz_full cpus. Every increment/decrement of vmstat is performed     
>    on percpu vmstat counter at first, then pooled diffs are cumulated     
>    to the zone's vmstat counter in timely manner. However, on nohz_full     
>    cpus (in case of this customer's system, 48 of 52 cpus) these pooled     
>    diffs were not cumulated once the cpu had no event on it so that     
>    the cpu started sleeping infinitely.                       
>    I checked percpu vmstat and found there were total 69 counts not         
>    cumulated to the zone's vmstat counter yet.     
>                                          
>  - In this situation, kswapd did not help the trapped process.     
>    In pgdat_balanced(), zone_wakermark_ok_safe() examined the number     
>    of free pages on the node by zone_page_state_snapshot() which     
>    checks pending counts on percpu vmstat.     
>    Therefore kswapd could know there were 69 free pages correctly.     
>    Since zone->_watermark = {8, 20, 32}, kswapd did not work because     
>    69 was greater than 32 as high watermark.     
> 
> Change allow_direct_reclaim to use zone_page_state_snapshot, which
> allows a more precise version of the vmstat counters to be used.
> 
> allow_direct_reclaim will only be called from try_to_free_pages,
> which is not a hot path.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
> 
> Index: linux-vmstat-remote/mm/vmscan.c
> ===================================================================
> --- linux-vmstat-remote.orig/mm/vmscan.c
> +++ linux-vmstat-remote/mm/vmscan.c
> @@ -6886,7 +6886,7 @@ static bool allow_direct_reclaim(pg_data
>  			continue;
>  
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> -		free_pages += zone_page_state(zone, NR_FREE_PAGES);
> +		free_pages += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  	}
>  
>  	/* If there are no reserves (unexpected config) then do not throttle */
> 
> 

Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>

-- 
Aaron Tomlin

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-05-25  7:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-15 18:00 [PATCH v8 00/13] fold per-CPU vmstats remotely Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 01/13] vmstat: allow_direct_reclaim should use zone_page_state_snapshot Marcelo Tosatti
2023-05-24 12:34   ` Michal Hocko
2023-05-25  7:05   ` Aaron Tomlin
2023-05-15 18:00 ` [PATCH v8 02/13] this_cpu_cmpxchg: ARM64: switch this_cpu_cmpxchg to locked, add _local function Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 03/13] this_cpu_cmpxchg: loongarch: " Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 04/13] this_cpu_cmpxchg: S390: " Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 05/13] this_cpu_cmpxchg: x86: " Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 06/13] add this_cpu_cmpxchg_local and asm-generic definitions Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 07/13] convert this_cpu_cmpxchg users to this_cpu_cmpxchg_local Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 08/13] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 09/13] vmstat: switch per-cpu vmstat counters to 32-bits Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 10/13] mm/vmstat: use xchg in cpu_vm_stats_fold Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 11/13] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 12/13] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
2023-05-15 18:00 ` [PATCH v8 13/13] vmstat: add pcp remote node draining via cpu_vm_stats_fold Marcelo Tosatti
2023-05-16  8:09 ` [PATCH v8 00/13] fold per-CPU vmstats remotely Christoph Lameter
2023-05-16 18:02   ` Marcelo Tosatti
2023-05-24 12:51 ` Michal Hocko
2023-05-24 13:53   ` Marcelo Tosatti
2023-05-25  6:47     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).