linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/5] Ensure quiet_vmstat() is called when the idle tick was stopped too
@ 2022-12-06 16:18 Marcelo Tosatti
  2022-12-06 16:18 ` [PATCH v9 1/5] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2022-12-06 16:18 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel, linux-mm

This patch series addresses the following two problems:

    1. A customer provided some evidence which indicates that
       the idle tick was stopped; albeit, CPU-specific vmstat
       counters still remained populated.

       Thus one can only assume quiet_vmstat() was not
       invoked on return to the idle loop. If I understand
       correctly, I suspect this divergence might erroneously
       prevent a reclaim attempt by kswapd. If the number of
       zone specific free pages are below their per-cpu drift
       value then zone_page_state_snapshot() is used to
       compute a more accurate view of the aforementioned
       statistic.  Thus any task blocked on the NUMA node
       specific pfmemalloc_wait queue will be unable to make
       significant progress via direct reclaim unless it is
       killed after being woken up by kswapd
       (see throttle_direct_reclaim())

    2. With a SCHED_FIFO task that busy loops on a given CPU,
       and kworker for that CPU at SCHED_OTHER priority,
       queuing work to sync per-vmstats will either cause that
       work to never execute, or stalld (i.e. stall daemon)
       boosts kworker priority which causes a latency
       violation


As seen previously, the trivial test program (i.e. attached at the end of
this cover letter) executed inside a KVM VM, was used to determine the
somewhat impact under vanilla and with the proposed changes. Firstly, the
mlock(2) and munlock(2) system calls was used solely to modify vmstat item
'NR_MLOCK'. In another scenario, the nanosleep(2) system call was used
several times to suspend execution for a period of time to approximately
compute the number of CPU-cycles in the idle code path. The following is an
average count of CPU-cycles across the aforementioned system calls and the
idle loop, respectively. I believe these results are negligible:

				  Vanilla                 Modified

  Cycles per idle loop            151858                  153258  (+1.0%)
  Cycles per syscall              8461                    8690    (+2.6%)


Any feedback would be appreciated. Thanks.


Changes since v8 [1]:
- For nohz_full CPUs, manage per-CPU vmstat flushing from CPU context
   (Frederic Weisbecker)
 
Changes since v7 [2]:
 - Added trivial helpers for modification and testing
   (Andrew Morton)
 - Modified comment since we do now cancel any delayed
   work if the tick is stopped in quiet_vmstat()
 - Moved check to ensure vmstat differentials do not
   remain if the tick is stopped on exiting to user-mode
   into a separate patch
   (Frederic Weisbecker)

Changes since v6 [3]:
 - Clean vmstat_dirty before differential sync loop
 - Cancel pending work if tick stopped
 - Do not queue work to remote CPU if tick stopped

Changes since v5 [4]:

 - Introduced __tick_nohz_user_enter_prepare()
 - Switched to EXPORT_SYMBOL_GPL()

Changes since v4 [5]:

 - Moved vmstat_dirty specific changes into a separate patch
   (Marcelo Tosatti)

Changes since v3 [6]:

 - Used EXPORT_SYMBOL() on tick_nohz_user_enter_prepare()
 - Replaced need_update()
 - Introduced CPU-specific variable namely vmstat_dirty
   and mark_vmstat_dirty()

[1]: https://lore.kernel.org/linux-mm/20220924152227.819815-1-atomlin@redhat.com/
[2]: https://lore.kernel.org/lkml/20220817191346.287594886@redhat.com/
[3]: https://lore.kernel.org/linux-mm/20220808194820.676246-1-atomlin@redhat.com/
[4]: https://lore.kernel.org/lkml/20220801234258.134609-1-atomlin@redhat.com/
[5]: https://lore.kernel.org/lkml/20220621172207.1501641-1-atomlin@redhat.com/
[6]: https://lore.kernel.org/lkml/20220422193647.3808657-1-atomlin@redhat.com/


Aaron Tomlin (4):
  mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy
  mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode
    when the idle tick is stopped
  tick/sched: Ensure quiet_vmstat() is called when the idle tick was
    stopped too

Marcelo Tosatti (1):
  mm/vmstat: Do not queue vmstat_update if tick is stopped

 include/linux/tick.h     |  5 ++-
 kernel/time/tick-sched.c | 18 ++++++++-
 mm/vmstat.c              | 80 +++++++++++++++++++++-------------------
 3 files changed, 62 insertions(+), 41 deletions(-)

-- 
2.37.1

--- test-vmstat-overhead.c ---

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>

typedef unsigned long long cycles_t;
typedef unsigned long long usecs_t;
typedef unsigned long long u64;

#ifdef __x86_64__
#define DECLARE_ARGS(val, low, high)    unsigned long low, high
#define EAX_EDX_VAL(val, low, high)     ((low) | ((u64)(high) << 32))
#define EAX_EDX_ARGS(val, low, high)    "a" (low), "d" (high)
#define EAX_EDX_RET(val, low, high)     "=a" (low), "=d" (high)
#else
#define DECLARE_ARGS(val, low, high)    unsigned long long val
#define EAX_EDX_VAL(val, low, high)     (val)
#define EAX_EDX_ARGS(val, low, high)    "A" (val)
#define EAX_EDX_RET(val, low, high)     "=A" (val)
#endif

static inline unsigned long long __rdtscll(void)
{
        DECLARE_ARGS(val, low, high);

        asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));

        return EAX_EDX_VAL(val, low, high);
}

#define rdtscll(val) do { (val) = __rdtscll(); } while (0)

#define NRSYSCALLS 30000
#define NRSLEEPS   100000

void main(int argc, char *argv[])
{
	unsigned long a, b, cycles;
	int i, syscall = 0;
	void *page = malloc(4096);

	if (mlock(page, 4096))
		perror("mlock");
	if (munlock(page, 4096))
		perror("munlock");

	if (argc != 2) {
		printf("usage: %s {idle,syscall}\n", argv[0]);
		exit(1);
	}

        rdtscll(a);

	if (strncmp("idle", argv[1], 4) == 0)
		syscall = 0;
	else if (strncmp("syscall", argv[1], 7) == 0)
		syscall = 1;
	else {
		printf("usage: %s {idle,syscall}\n", argv[0]);
		exit(1);
	}
	
	if (syscall == 1) {
        	for (i = 0; i < NRSYSCALLS; i++) {
			if (mlock(page, 4096))
				perror("mlock");
			if (munlock(page, 4096))
				perror("munlock");
		}
	} else {
        	for (i = 0; i < NRSLEEPS; i++)
		 	usleep(10);
	}

        rdtscll(b);

        cycles = b - a;

	if (syscall == 1)
        	printf("cycles per syscall: %d\n", (b-a)/(NRSYSCALLS*2));
	else
		printf("cycles per idle loop: %d\n", (b-a)/NRSLEEPS);
}





^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v9 1/5] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy
  2022-12-06 16:18 [PATCH v9 0/5] Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
@ 2022-12-06 16:18 ` Marcelo Tosatti
  2022-12-14 13:06   ` Frederic Weisbecker
  2022-12-06 16:18 ` [PATCH v9 2/5] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies Marcelo Tosatti
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2022-12-06 16:18 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel, linux-mm

From: Aaron Tomlin <atomlin@redhat.com>

Introduce a CPU-specific variable namely vmstat_dirty to indicate
if a vmstat imbalance is present for a given CPU. Therefore, at
the appropriate time, we can fold all the remaining differentials.
This patch also provides trivial helpers for modification and testing.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 mm/vmstat.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -194,6 +194,22 @@ void fold_vm_numa_events(void)
 #endif
 
 #ifdef CONFIG_SMP
+static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void vmstat_mark_dirty(void)
+{
+	this_cpu_write(vmstat_dirty, true);
+}
+
+static inline void vmstat_clear_dirty(void)
+{
+	this_cpu_write(vmstat_dirty, false);
+}
+
+static inline bool is_vmstat_dirty(void)
+{
+	return this_cpu_read(vmstat_dirty);
+}
 
 int calculate_pressure_threshold(struct zone *zone)
 {



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v9 2/5] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2022-12-06 16:18 [PATCH v9 0/5] Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
  2022-12-06 16:18 ` [PATCH v9 1/5] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
@ 2022-12-06 16:18 ` Marcelo Tosatti
  2022-12-06 16:18 ` [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2022-12-06 16:18 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel, linux-mm

From: Aaron Tomlin <atomlin@redhat.com>

This patch will now use the previously introduced CPU-specific variable
namely vmstat_dirty to indicate if a vmstat differential/or imbalance is
present for a given CPU. So, at the appropriate time, vmstat processing can
be initiated. The hope is that this particular approach is "cheaper" when
compared to need_update(). The idea is based on Marcelo's patch [1].

[1]: https://lore.kernel.org/lkml/20220204173554.763888172@fedora.localdomain/

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 mm/vmstat.c |   48 ++++++++++++++----------------------------------
 1 file changed, 14 insertions(+), 34 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -381,6 +381,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	vmstat_mark_dirty();
 
 	preempt_enable_nested();
 }
@@ -417,6 +418,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	vmstat_mark_dirty();
 
 	preempt_enable_nested();
 }
@@ -606,6 +608,7 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	vmstat_mark_dirty();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -674,6 +677,7 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	vmstat_mark_dirty();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -828,6 +832,14 @@ static int refresh_cpu_vm_stats(bool do_
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
+	/*
+	 * Clear vmstat_dirty before clearing the percpu vmstats.
+	 * If interrupts are enabled, it is possible that an interrupt
+	 * or another task modifies a percpu vmstat, which will
+	 * set vmstat_dirty to true.
+	 */
+	vmstat_clear_dirty();
+
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
 #ifdef CONFIG_NUMA
@@ -1957,35 +1969,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
-			return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.
@@ -1995,10 +1978,7 @@ void quiet_vmstat(void)
 	if (system_state != SYSTEM_RUNNING)
 		return;
 
-	if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
-		return;
-
-	if (!need_update(smp_processor_id()))
+	if (!is_vmstat_dirty())
 		return;
 
 	/*
@@ -2029,7 +2009,7 @@ static void vmstat_shepherd(struct work_
 	for_each_online_cpu(cpu) {
 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
 
-		if (!delayed_work_pending(dw) && need_update(cpu))
+		if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
 
 		cond_resched();



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
  2022-12-06 16:18 [PATCH v9 0/5] Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
  2022-12-06 16:18 ` [PATCH v9 1/5] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
  2022-12-06 16:18 ` [PATCH v9 2/5] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies Marcelo Tosatti
@ 2022-12-06 16:18 ` Marcelo Tosatti
  2022-12-14 13:18   ` Frederic Weisbecker
  2022-12-14 13:33   ` Frederic Weisbecker
  2022-12-06 16:18 ` [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
  2022-12-06 16:18 ` [PATCH v9 5/5] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
  4 siblings, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2022-12-06 16:18 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel,
	linux-mm, Marcelo Tosatti

For nohz full CPUs, manage per-CPU stat syncing from CPU context:
start delayed work when marking per-CPU vmstat dirty.

When returning to userspace, fold the stats and cancel the delayed work.

When entering idle, only fold the stats.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 include/linux/vmstat.h   |    4 ++--
 kernel/time/tick-sched.c |    2 +-
 mm/vmstat.c              |   41 ++++++++++++++++++++++++++++++++---------
 3 files changed, 35 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/tick.h>
 
 #include "internal.h"
 
@@ -195,9 +196,24 @@ void fold_vm_numa_events(void)
 
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
+int sysctl_stat_interval __read_mostly = HZ;
 
 static inline void vmstat_mark_dirty(void)
 {
+	int cpu = smp_processor_id();
+
+	if (tick_nohz_full_cpu(cpu) && !this_cpu_read(vmstat_dirty)) {
+		struct delayed_work *dw;
+
+		dw = &per_cpu(vmstat_work, cpu);
+		if (!delayed_work_pending(dw)) {
+			unsigned long delay;
+
+			delay = round_jiffies_relative(sysctl_stat_interval);
+			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);
+		}
+	}
 	this_cpu_write(vmstat_dirty, true);
 }
 
@@ -1886,9 +1902,6 @@ static const struct seq_operations vmsta
 #endif /* CONFIG_PROC_FS */
 
 #ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
-int sysctl_stat_interval __read_mostly = HZ;
-
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1973,21 +1986,27 @@ static void vmstat_update(struct work_st
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.
  */
-void quiet_vmstat(void)
+void quiet_vmstat(bool user)
 {
+	struct delayed_work *dw;
+
 	if (system_state != SYSTEM_RUNNING)
 		return;
 
 	if (!is_vmstat_dirty())
 		return;
 
+	refresh_cpu_vm_stats(false);
+
+	if (!user)
+		return;
 	/*
-	 * Just refresh counters and do not care about the pending delayed
-	 * vmstat_update. It doesn't fire that often to matter and canceling
-	 * it would be too expensive from this path.
-	 * vmstat_shepherd will take care about that for us.
+	 * If the tick is stopped, cancel any delayed work to avoid
+	 * interruptions to this CPU in the future.
 	 */
-	refresh_cpu_vm_stats(false);
+	dw = &per_cpu(vmstat_work, smp_processor_id());
+	if (delayed_work_pending(dw))
+		cancel_delayed_work(dw);
 }
 
 /*
@@ -2009,6 +2028,10 @@ static void vmstat_shepherd(struct work_
 	for_each_online_cpu(cpu) {
 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
 
+		/* NOHZ full CPUs manage their own vmstat flushing */
+		if (tick_nohz_full_cpu(smp_processor_id()))
+			continue;
+
 		if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
 
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -290,7 +290,7 @@ extern void dec_zone_state(struct zone *
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
-void quiet_vmstat(void);
+void quiet_vmstat(bool user);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -403,7 +403,7 @@ static inline void __dec_node_page_state
 
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
-static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat(bool user) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_zonestat *pzstats) { }
Index: linux-2.6/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -911,7 +911,7 @@ static void tick_nohz_stop_tick(struct t
 	 */
 	if (!ts->tick_stopped) {
 		calc_load_nohz_start();
-		quiet_vmstat();
+		quiet_vmstat(false);
 
 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 		ts->tick_stopped = 1;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped
  2022-12-06 16:18 [PATCH v9 0/5] Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2022-12-06 16:18 ` [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
@ 2022-12-06 16:18 ` Marcelo Tosatti
  2022-12-14 12:49   ` Frederic Weisbecker
  2022-12-14 13:00   ` Frederic Weisbecker
  2022-12-06 16:18 ` [PATCH v9 5/5] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
  4 siblings, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2022-12-06 16:18 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel, linux-mm

From: Aaron Tomlin <atomlin@redhat.com>

This patch ensures CPU-specific vmstat differentials do not remain
when the scheduling-tick is stopped and before exiting to user-mode
in the context of nohz_full only.

A trivial test program was used to determine the impact of the proposed
changes and under vanilla. The mlock(2) and munlock(2) system calls was
used solely to modify vmstat item 'NR_MLOCK'. The following is an average
count of CPU-cycles across the aforementioned system calls:

				  Vanilla                 Modified

  Cycles per syscall              8461                    8690    (+2.6%)

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 include/linux/tick.h     |    5 +++--
 kernel/time/tick-sched.c |   15 +++++++++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/tick.h
===================================================================
--- linux-2.6.orig/include/linux/tick.h
+++ linux-2.6/include/linux/tick.h
@@ -11,7 +11,6 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
-#include <linux/rcupdate.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -272,6 +271,7 @@ static inline void tick_dep_clear_signal
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+void __tick_nohz_user_enter_prepare(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
@@ -296,6 +296,7 @@ static inline void tick_dep_clear_signal
 
 static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void __tick_nohz_task_switch(void) { }
+static inline void __tick_nohz_user_enter_prepare(void) { }
 static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
 #endif
 
@@ -308,7 +309,7 @@ static inline void tick_nohz_task_switch
 static inline void tick_nohz_user_enter_prepare(void)
 {
 	if (tick_nohz_full_cpu(smp_processor_id()))
-		rcu_nocb_flush_deferred_wakeup();
+		__tick_nohz_user_enter_prepare();
 }
 
 #endif
Index: linux-2.6/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -26,6 +26,7 @@
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
 #include <linux/mm.h>
+#include <linux/rcupdate.h>
 
 #include <asm/irq_regs.h>
 
@@ -519,6 +520,20 @@ void __tick_nohz_task_switch(void)
 	}
 }
 
+void __tick_nohz_user_enter_prepare(void)
+{
+	struct tick_sched *ts;
+
+	if (tick_nohz_full_cpu(smp_processor_id())) {
+		ts = this_cpu_ptr(&tick_cpu_sched);
+
+		if (ts->tick_stopped)
+			quiet_vmstat(true);
+		rcu_nocb_flush_deferred_wakeup();
+	}
+}
+EXPORT_SYMBOL_GPL(__tick_nohz_user_enter_prepare);
+
 /* Get the boot-time nohz CPU list from the kernel parameters. */
 void __init tick_nohz_full_setup(cpumask_var_t cpumask)
 {



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v9 5/5] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too
  2022-12-06 16:18 [PATCH v9 0/5] Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2022-12-06 16:18 ` [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
@ 2022-12-06 16:18 ` Marcelo Tosatti
  4 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2022-12-06 16:18 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel, linux-mm

From: Aaron Tomlin <atomlin@redhat.com>

In the context of the idle task and an adaptive-tick mode/or a nohz_full
CPU, quiet_vmstat() can be called: before stopping the idle tick,
entering an idle state and on exit. In particular, for the latter case,
when the idle task is required to reschedule, the idle tick can remain
stopped and the timer expiration time endless i.e., KTIME_MAX. Now,
indeed before a nohz_full CPU enters an idle state, CPU-specific vmstat
counters should be processed to ensure the respective values have been
reset and folded into the zone specific 'vm_stat[]'. That being said, it
can only occur when: the idle tick was previously stopped, and
reprogramming of the timer is not required.

A customer provided some evidence which indicates that the idle tick was
stopped; albeit, CPU-specific vmstat counters still remained populated.
Thus one can only assume quiet_vmstat() was not invoked on return to the
idle loop.

If I understand correctly, I suspect this divergence might erroneously
prevent a reclaim attempt by kswapd. If the number of zone specific free
pages are below their per-cpu drift value then
zone_page_state_snapshot() is used to compute a more accurate view of
the aforementioned statistic.  Thus any task blocked on the NUMA node
specific pfmemalloc_wait queue will be unable to make significant
progress via direct reclaim unless it is killed after being woken up by
kswapd (see throttle_direct_reclaim()).

Consider the following theoretical scenario:

 - Note: CPU X is part of 'tick_nohz_full_mask'

    1.      CPU Y migrated running task A to CPU X that
	    was in an idle state i.e. waiting for an IRQ;
	    marked the current task on CPU X to need/or
	    require a reschedule i.e., set TIF_NEED_RESCHED
	    and invoked a reschedule IPI to CPU X
	    (see sched_move_task())

    2.      CPU X acknowledged the reschedule IPI. Generic
	    idle loop code noticed the TIF_NEED_RESCHED flag
	    against the idle task and attempts to exit of the
	    loop and calls the main scheduler function i.e.
	    __schedule().

	    Since the idle tick was previously stopped no
	    scheduling-clock tick would occur.
	    So, no deferred timers would be handled

    3.      Post transition to kernel execution Task A
	    running on CPU X, indirectly released a few pages
	    (e.g. see __free_one_page()); CPU X's
	    'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
	    specific 'vm_stat[]' update was deferred as per the
	    CPU-specific stat threshold

    4.      Task A does invoke exit(2) and the kernel does
	    remove the task from the run-queue; the idle task
	    was selected to execute next since there are no
	    other runnable tasks assigned to the given CPU
	    (see pick_next_task() and pick_next_task_idle())

    5.      On return to the idle loop since the idle tick
	    was already stopped and can remain so (see [1]
	    below) e.g. no pending soft IRQs, no attempt is
	    made to zero and fold CPU X's vmstat counters
	    since reprogramming of the scheduling-clock tick
	    is not required/or needed (see [2])

		  ...
		    do_idle
		    {

		      __current_set_polling()
		      tick_nohz_idle_enter()

		      while (!need_resched()) {

			local_irq_disable()

			...

			/* No polling or broadcast event */
			cpuidle_idle_call()
			{

			  if (cpuidle_not_available(drv, dev)) {
			    tick_nohz_idle_stop_tick()
			      __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched))
			      {
				int cpu = smp_processor_id()

				if (ts->timer_expires_base)
				  expires = ts->timer_expires
				else if (can_stop_idle_tick(cpu, ts))
	      (1) ------->        expires = tick_nohz_next_event(ts, cpu)
				else
				  return

				ts->idle_calls++

				if (expires > 0LL) {

				  tick_nohz_stop_tick(ts, cpu)
				  {

				    if (ts->tick_stopped && (expires == ts->next_tick)) {
	      (2) ------->            if (tick == KTIME_MAX || ts->next_tick ==
					hrtimer_get_expires(&ts->sched_timer))
					return
				    }
				    ...
				  }

So, the idea of this patch is to ensure refresh_cpu_vm_stats(false) is
called, when it is appropriate, on return to the idle loop if the idle
tick was previously stopped too.

A trivial test program was used to determine the impact of the proposed
changes and under vanilla. The nanosleep(2) system call was used several
times to suspend execution for a period of time to approximately compute
the number of CPU-cycles in the idle code path. The following is an average
count of CPU-cycles:

				  Vanilla                 Modified

  Cycles per idle loop            151858                  153258  (+1.0%)

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 kernel/time/tick-sched.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -926,13 +926,14 @@ static void tick_nohz_stop_tick(struct t
 	 */
 	if (!ts->tick_stopped) {
 		calc_load_nohz_start();
-		quiet_vmstat(false);
 
 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 		ts->tick_stopped = 1;
 		trace_tick_stop(1, TICK_DEP_MASK_NONE);
 	}
 
+	/* Attempt to fold when the idle tick is stopped or not */
+	quiet_vmstat(false);
 	ts->next_tick = tick;
 
 	/*



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped
  2022-12-06 16:18 ` [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
@ 2022-12-14 12:49   ` Frederic Weisbecker
  2022-12-14 13:00   ` Frederic Weisbecker
  1 sibling, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2022-12-14 12:49 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, cl, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Tue, Dec 06, 2022 at 01:18:30PM -0300, Marcelo Tosatti wrote:
> From: Aaron Tomlin <atomlin@redhat.com>
> 
> This patch ensures CPU-specific vmstat differentials do not remain
> when the scheduling-tick is stopped and before exiting to user-mode
> in the context of nohz_full only.
> 
> A trivial test program was used to determine the impact of the proposed
> changes and under vanilla. The mlock(2) and munlock(2) system calls was
> used solely to modify vmstat item 'NR_MLOCK'. The following is an average
> count of CPU-cycles across the aforementioned system calls:
> 
> 				  Vanilla                 Modified
> 
>   Cycles per syscall              8461                    8690    (+2.6%)
> 
> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>

This is missing your Signed-off-by:

Also can you make this (and also the conditional vmstat_shepherd ignore
nohz_full CPUs and also the in-place local arming) a Kconfig option perhaps?
Something like CONFIG_FLUSH_WORK_ON_RESUME_USER (depend on NO_HZ_FULL). I'm using
"WORK" instead of "VMSTAT" so that we can even add more stuff there later.

This way I'll stop worrying about the potential HPC users who may not care about
the occasional interrupt and prefer to have reasonably fast syscalls.

Then after some time if nobody complains, we can remove the Kconfig entry.

Thanks.

> ---
>  include/linux/tick.h     |    5 +++--
>  kernel/time/tick-sched.c |   15 +++++++++++++++
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/include/linux/tick.h
> ===================================================================
> --- linux-2.6.orig/include/linux/tick.h
> +++ linux-2.6/include/linux/tick.h
> @@ -11,7 +11,6 @@
>  #include <linux/context_tracking_state.h>
>  #include <linux/cpumask.h>
>  #include <linux/sched.h>
> -#include <linux/rcupdate.h>
>  
>  #ifdef CONFIG_GENERIC_CLOCKEVENTS
>  extern void __init tick_init(void);
> @@ -272,6 +271,7 @@ static inline void tick_dep_clear_signal
>  
>  extern void tick_nohz_full_kick_cpu(int cpu);
>  extern void __tick_nohz_task_switch(void);
> +void __tick_nohz_user_enter_prepare(void);
>  extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
>  #else
>  static inline bool tick_nohz_full_enabled(void) { return false; }
> @@ -296,6 +296,7 @@ static inline void tick_dep_clear_signal
>  
>  static inline void tick_nohz_full_kick_cpu(int cpu) { }
>  static inline void __tick_nohz_task_switch(void) { }
> +static inline void __tick_nohz_user_enter_prepare(void) { }
>  static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
>  #endif
>  
> @@ -308,7 +309,7 @@ static inline void tick_nohz_task_switch
>  static inline void tick_nohz_user_enter_prepare(void)
>  {
>  	if (tick_nohz_full_cpu(smp_processor_id()))
> -		rcu_nocb_flush_deferred_wakeup();
> +		__tick_nohz_user_enter_prepare();
>  }
>  
>  #endif
> Index: linux-2.6/kernel/time/tick-sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/time/tick-sched.c
> +++ linux-2.6/kernel/time/tick-sched.c
> @@ -26,6 +26,7 @@
>  #include <linux/posix-timers.h>
>  #include <linux/context_tracking.h>
>  #include <linux/mm.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/irq_regs.h>
>  
> @@ -519,6 +520,20 @@ void __tick_nohz_task_switch(void)
>  	}
>  }
>  
> +void __tick_nohz_user_enter_prepare(void)
> +{
> +	struct tick_sched *ts;
> +
> +	if (tick_nohz_full_cpu(smp_processor_id())) {
> +		ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +		if (ts->tick_stopped)
> +			quiet_vmstat(true);
> +		rcu_nocb_flush_deferred_wakeup();
> +	}
> +}
> +EXPORT_SYMBOL_GPL(__tick_nohz_user_enter_prepare);
> +
>  /* Get the boot-time nohz CPU list from the kernel parameters. */
>  void __init tick_nohz_full_setup(cpumask_var_t cpumask)
>  {
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped
  2022-12-06 16:18 ` [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
  2022-12-14 12:49   ` Frederic Weisbecker
@ 2022-12-14 13:00   ` Frederic Weisbecker
  1 sibling, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2022-12-14 13:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, cl, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Tue, Dec 06, 2022 at 01:18:30PM -0300, Marcelo Tosatti wrote:
> From: Aaron Tomlin <atomlin@redhat.com>
> 
> This patch ensures CPU-specific vmstat differentials do not remain
> when the scheduling-tick is stopped and before exiting to user-mode
> in the context of nohz_full only.

Please also don't forget to mention why this change is needed.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v9 1/5] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy
  2022-12-06 16:18 ` [PATCH v9 1/5] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
@ 2022-12-14 13:06   ` Frederic Weisbecker
  0 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2022-12-14 13:06 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, cl, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Tue, Dec 06, 2022 at 01:18:27PM -0300, Marcelo Tosatti wrote:
> From: Aaron Tomlin <atomlin@redhat.com>
> 
> Introduce a CPU-specific variable namely vmstat_dirty to indicate
> if a vmstat imbalance is present for a given CPU. Therefore, at
> the appropriate time, we can fold all the remaining differentials.
> This patch also provides trivial helpers for modification and testing.
> 
> Signed-off-by: Aaron Tomlin <atomlin@redhat.com>

Your SOB is also missing here and on other patches.

Thanks.

> ---
>  mm/vmstat.c |   16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c
> +++ linux-2.6/mm/vmstat.c
> @@ -194,6 +194,22 @@ void fold_vm_numa_events(void)
>  #endif
>  
>  #ifdef CONFIG_SMP
> +static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +
> +static inline void vmstat_mark_dirty(void)
> +{
> +	this_cpu_write(vmstat_dirty, true);
> +}
> +
> +static inline void vmstat_clear_dirty(void)
> +{
> +	this_cpu_write(vmstat_dirty, false);
> +}
> +
> +static inline bool is_vmstat_dirty(void)
> +{
> +	return this_cpu_read(vmstat_dirty);
> +}
>  
>  int calculate_pressure_threshold(struct zone *zone)
>  {
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
  2022-12-06 16:18 ` [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
@ 2022-12-14 13:18   ` Frederic Weisbecker
  2022-12-14 13:33   ` Frederic Weisbecker
  1 sibling, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2022-12-14 13:18 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, cl, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Tue, Dec 06, 2022 at 01:18:29PM -0300, Marcelo Tosatti wrote:
> For nohz full CPUs, manage per-CPU stat syncing from CPU context:
> start delayed work when marking per-CPU vmstat dirty.
> 
> When returning to userspace, fold the stats and cancel the delayed work.
> 
> When entering idle, only fold the stats.

The changelog still misses the reason behind the changes.

> @@ -195,9 +196,24 @@ void fold_vm_numa_events(void)
>  
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> +static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
> +int sysctl_stat_interval __read_mostly = HZ;
>  
>  static inline void vmstat_mark_dirty(void)
>  {
> +	int cpu = smp_processor_id();
> +
> +	if (tick_nohz_full_cpu(cpu) && !this_cpu_read(vmstat_dirty)) {
> +		struct delayed_work *dw;
> +
> +		dw = &per_cpu(vmstat_work, cpu);

this_cpu_ptr()

> +		if (!delayed_work_pending(dw)) {
> +			unsigned long delay;
> +
> +			delay = round_jiffies_relative(sysctl_stat_interval);
> +			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);
> +		}
> +	}
>  	this_cpu_write(vmstat_dirty, true);
>  }
>  
> @@ -1973,21 +1986,27 @@ static void vmstat_update(struct work_st
>   * until the diffs stay at zero. The function is used by NOHZ and can only be
>   * invoked when tick processing is not active.
>   */
> -void quiet_vmstat(void)
> +void quiet_vmstat(bool user)
>  {
> +	struct delayed_work *dw;
> +
>  	if (system_state != SYSTEM_RUNNING)
>  		return;
>  
>  	if (!is_vmstat_dirty())
>  		return;
>  
> +	refresh_cpu_vm_stats(false);
> +
> +	if (!user)
> +		return;
>  	/*
> -	 * Just refresh counters and do not care about the pending delayed
> -	 * vmstat_update. It doesn't fire that often to matter and canceling
> -	 * it would be too expensive from this path.
> -	 * vmstat_shepherd will take care about that for us.
> +	 * If the tick is stopped, cancel any delayed work to avoid
> +	 * interruptions to this CPU in the future.
>  	 */
> -	refresh_cpu_vm_stats(false);
> +	dw = &per_cpu(vmstat_work, smp_processor_id());

this_cpu_ptr()

Thanks.

> +	if (delayed_work_pending(dw))
> +		cancel_delayed_work(dw);
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
  2022-12-06 16:18 ` [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
  2022-12-14 13:18   ` Frederic Weisbecker
@ 2022-12-14 13:33   ` Frederic Weisbecker
  2022-12-16 16:16     ` Marcelo Tosatti
  1 sibling, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2022-12-14 13:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, cl, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Tue, Dec 06, 2022 at 01:18:29PM -0300, Marcelo Tosatti wrote:
>  static inline void vmstat_mark_dirty(void)
>  {
> +	int cpu = smp_processor_id();
> +
> +	if (tick_nohz_full_cpu(cpu) && !this_cpu_read(vmstat_dirty)) {
> +		struct delayed_work *dw;
> +
> +		dw = &per_cpu(vmstat_work, cpu);
> +		if (!delayed_work_pending(dw)) {
> +			unsigned long delay;
> +
> +			delay = round_jiffies_relative(sysctl_stat_interval);
> +			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);

Currently the vmstat_work is flushed on cpu_hotplug (CPUHP_AP_ONLINE_DYN).
vmstat_shepherd makes sure to not rearm it afterward. But now it looks
possible for the above to do that mistake?

> +		}
> +	}
>  	this_cpu_write(vmstat_dirty, true);
>  }
> @@ -2009,6 +2028,10 @@ static void vmstat_shepherd(struct work_
>  	for_each_online_cpu(cpu) {
>  		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
>  
> +		/* NOHZ full CPUs manage their own vmstat flushing */
> +		if (tick_nohz_full_cpu(smp_processor_id()))

It should be the remote CPU instead of the current one.

Thanks.

> +			continue;
> +
>  		if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
>  			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>  

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
  2022-12-14 13:33   ` Frederic Weisbecker
@ 2022-12-16 16:16     ` Marcelo Tosatti
  2022-12-16 22:47       ` Frederic Weisbecker
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2022-12-16 16:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: atomlin, cl, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Wed, Dec 14, 2022 at 02:33:02PM +0100, Frederic Weisbecker wrote:
> On Tue, Dec 06, 2022 at 01:18:29PM -0300, Marcelo Tosatti wrote:
> >  static inline void vmstat_mark_dirty(void)
> >  {
> > +	int cpu = smp_processor_id();
> > +
> > +	if (tick_nohz_full_cpu(cpu) && !this_cpu_read(vmstat_dirty)) {
> > +		struct delayed_work *dw;
> > +
> > +		dw = &per_cpu(vmstat_work, cpu);
> > +		if (!delayed_work_pending(dw)) {
> > +			unsigned long delay;
> > +
> > +			delay = round_jiffies_relative(sysctl_stat_interval);
> > +			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);
> 
> Currently the vmstat_work is flushed on cpu_hotplug (CPUHP_AP_ONLINE_DYN).
> vmstat_shepherd makes sure to not rearm it afterward. But now it looks
> possible for the above to do that mistake?

Don't think the mistake is an issue. In case of a
queue_delayed_work_on being called after cancel_delayed_work_sync,
either vmstat_update executes on the local CPU, or on a
different CPU (after the bound kworkers have been moved).

Each case is fine (see vmstat_update).

> > +		}
> > +	}
> >  	this_cpu_write(vmstat_dirty, true);
> >  }
> > @@ -2009,6 +2028,10 @@ static void vmstat_shepherd(struct work_
> >  	for_each_online_cpu(cpu) {
> >  		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> >  
> > +		/* NOHZ full CPUs manage their own vmstat flushing */
> > +		if (tick_nohz_full_cpu(smp_processor_id()))
> 
> It should be the remote CPU instead of the current one.

Fixed.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
  2022-12-16 16:16     ` Marcelo Tosatti
@ 2022-12-16 22:47       ` Frederic Weisbecker
  0 siblings, 0 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2022-12-16 22:47 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, cl, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Fri, Dec 16, 2022 at 01:16:09PM -0300, Marcelo Tosatti wrote:
> On Wed, Dec 14, 2022 at 02:33:02PM +0100, Frederic Weisbecker wrote:
> > On Tue, Dec 06, 2022 at 01:18:29PM -0300, Marcelo Tosatti wrote:
> > >  static inline void vmstat_mark_dirty(void)
> > >  {
> > > +	int cpu = smp_processor_id();
> > > +
> > > +	if (tick_nohz_full_cpu(cpu) && !this_cpu_read(vmstat_dirty)) {
> > > +		struct delayed_work *dw;
> > > +
> > > +		dw = &per_cpu(vmstat_work, cpu);
> > > +		if (!delayed_work_pending(dw)) {
> > > +			unsigned long delay;
> > > +
> > > +			delay = round_jiffies_relative(sysctl_stat_interval);
> > > +			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);
> > 
> > Currently the vmstat_work is flushed on cpu_hotplug (CPUHP_AP_ONLINE_DYN).
> > vmstat_shepherd makes sure to not rearm it afterward. But now it looks
> > possible for the above to do that mistake?
> 
> Don't think the mistake is an issue. In case of a
> queue_delayed_work_on being called after cancel_delayed_work_sync,
> either vmstat_update executes on the local CPU, or on a
> different CPU (after the bound kworkers have been moved).

But after the CPU goes offline, its workqueue pool becomes UNBOUND. Which means
that the vmstat_update() from the offline CPU can then execute partly on CPU 0, then
gets preempted and executes halfway on CPU 1, then gets preempted and...

Having a quick look at refresh_cpu_vm_stats(), I doesn't look ready for that...

Thanks.

> 
> Each case is fine (see vmstat_update).
> 
> > > +		}
> > > +	}
> > >  	this_cpu_write(vmstat_dirty, true);
> > >  }
> > > @@ -2009,6 +2028,10 @@ static void vmstat_shepherd(struct work_
> > >  	for_each_online_cpu(cpu) {
> > >  		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
> > >  
> > > +		/* NOHZ full CPUs manage their own vmstat flushing */
> > > +		if (tick_nohz_full_cpu(smp_processor_id()))
> > 
> > It should be the remote CPU instead of the current one.
> 
> Fixed.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-12-16 22:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-06 16:18 [PATCH v9 0/5] Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
2022-12-06 16:18 ` [PATCH v9 1/5] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
2022-12-14 13:06   ` Frederic Weisbecker
2022-12-06 16:18 ` [PATCH v9 2/5] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies Marcelo Tosatti
2022-12-06 16:18 ` [PATCH v9 3/5] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
2022-12-14 13:18   ` Frederic Weisbecker
2022-12-14 13:33   ` Frederic Weisbecker
2022-12-16 16:16     ` Marcelo Tosatti
2022-12-16 22:47       ` Frederic Weisbecker
2022-12-06 16:18 ` [PATCH v9 4/5] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
2022-12-14 12:49   ` Frederic Weisbecker
2022-12-14 13:00   ` Frederic Weisbecker
2022-12-06 16:18 ` [PATCH v9 5/5] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).