linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped
@ 2023-01-05 12:52 Marcelo Tosatti
  2023-01-05 12:52 ` [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-05 12:52 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel, linux-mm

This patch series addresses the following two problems:

    1. A customer provided some evidence which indicates that
       the idle tick was stopped; albeit, CPU-specific vmstat
       counters still remained populated.

       Thus one can only assume quiet_vmstat() was not
       invoked on return to the idle loop. If I understand
       correctly, I suspect this divergence might erroneously
       prevent a reclaim attempt by kswapd. If the number of
       zone specific free pages are below their per-cpu drift
       value then zone_page_state_snapshot() is used to
       compute a more accurate view of the aforementioned
       statistic.  Thus any task blocked on the NUMA node
       specific pfmemalloc_wait queue will be unable to make
       significant progress via direct reclaim unless it is
       killed after being woken up by kswapd
       (see throttle_direct_reclaim())

    2. With a SCHED_FIFO task that busy loops on a given CPU,
       and kworker for that CPU at SCHED_OTHER priority,
       queuing work to sync per-vmstats will either cause that
       work to never execute, or stalld (i.e. stall daemon)
       boosts kworker priority which causes a latency
       violation


As seen previously, the trivial test program (i.e. attached at the end of
this cover letter) executed inside a KVM VM, was used to determine the
somewhat impact under vanilla and with the proposed changes. Firstly, the
mlock(2) and munlock(2) system calls was used solely to modify vmstat item
'NR_MLOCK'. In another scenario, the nanosleep(2) system call was used
several times to suspend execution for a period of time to approximately
compute the number of CPU-cycles in the idle code path. The following is an
average count of CPU-cycles across the aforementioned system calls and the
idle loop, respectively. I believe these results are negligible:

				  Vanilla                 Modified

  Cycles per idle loop            151858                  153258  (+1.0%)
  Cycles per syscall              8461                    8690    (+2.6%)


Any feedback would be appreciated. Thanks.

Changes since v12 [1]:
- Protect vmstat cmpxchg and vmstat dirty bit write 
  by disabling preemption (Frederic Weisbecker)

Changes since v11 [2]:
- Switch back to this_cpu_write/read when appropriate
   (Frederic Weisbecker)
- Avoid ifdeffery in the middle of functions
   (Frederic Weisbecker)
- Clarify down_prep callback comment
   (Frederic Weisbecker)
- Move new Kconfig option close to CPU_ISOLATION option
   (Frederic Weisbecker)

Changes since v10 [3]:
- Close cpu hotplug race with nohz_full CPUs
   (Frederic Weisbecker)

Changes since v9 [4]:
- Add config to enable/disable syncing when returning to userspace
   (Frederic Weisbecker)
- Add missing signed-off-by
   (Frederic Weisbecker)
- Use proper CPU value when skipping nohz_full CPUs
   (Frederic Weisbecker)
- Use this_cpu_ptr when appropriate
   (Frederic Weisbecker)
- Improve changelogs
   (Frederic Weisbecker)
- For stat_refresh sysfs file: avoid queueing work on CPU if stats are clean

Changes since v8 [5]:
- For nohz_full CPUs, manage per-CPU vmstat flushing from CPU context
   (Frederic Weisbecker)
 
Changes since v7 [6]:
 - Added trivial helpers for modification and testing
   (Andrew Morton)
 - Modified comment since we do now cancel any delayed
   work if the tick is stopped in quiet_vmstat()
 - Moved check to ensure vmstat differentials do not
   remain if the tick is stopped on exiting to user-mode
   into a separate patch
   (Frederic Weisbecker)

Changes since v6 [7]:
 - Clean vmstat_dirty before differential sync loop
 - Cancel pending work if tick stopped
 - Do not queue work to remote CPU if tick stopped

Changes since v5 [8]:

 - Introduced __tick_nohz_user_enter_prepare()
 - Switched to EXPORT_SYMBOL_GPL()

Changes since v4 [9]:

 - Moved vmstat_dirty specific changes into a separate patch
   (Marcelo Tosatti)

Changes since v3 [10]:

 - Used EXPORT_SYMBOL() on tick_nohz_user_enter_prepare()
 - Replaced need_update()
 - Introduced CPU-specific variable namely vmstat_dirty
   and mark_vmstat_dirty()

[1]: https://lore.kernel.org/linux-mm/20230104133459.5yaflf3yicpmhbbh@ava.usersys.com/T/
[2]: https://lore.kernel.org/lkml/20221223144150.GA79369@lothringen/T/
[3]: https://lore.kernel.org/linux-mm/20221216194904.075275493@redhat.com/T/
[4]: https://lore.kernel.org/lkml/20221214131839.GE1930067@lothringen/t/
[5]: https://lore.kernel.org/linux-mm/20220924152227.819815-1-atomlin@redhat.com/
[6]: https://lore.kernel.org/lkml/20220817191346.287594886@redhat.com/
[7]: https://lore.kernel.org/linux-mm/20220808194820.676246-1-atomlin@redhat.com/
[8]: https://lore.kernel.org/lkml/20220801234258.134609-1-atomlin@redhat.com/
[9]: https://lore.kernel.org/lkml/20220621172207.1501641-1-atomlin@redhat.com/
[10]: https://lore.kernel.org/lkml/20220422193647.3808657-1-atomlin@redhat.com/


Aaron Tomlin (4):
  mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy
  mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode
    when the idle tick is stopped
  tick/sched: Ensure quiet_vmstat() is called when the idle tick was
    stopped too

Marcelo Tosatti (2):
  mm/vmstat: Do not queue vmstat_update if tick is stopped
  mm/vmstat: avoid queueing work item if cpu stats are clean


 include/linux/tick.h     |    5 +-
 include/linux/vmstat.h   |    4 -
 init/Kconfig             |   13 +++++
 kernel/time/tick-sched.c |   20 ++++++++-
 mm/vmstat.c              |  223 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
 5 files changed, 214 insertions(+), 51 deletions(-)



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy
  2023-01-05 12:52 [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped Marcelo Tosatti
@ 2023-01-05 12:52 ` Marcelo Tosatti
  2023-01-10 11:58   ` Christoph Lameter
  2023-01-05 12:52 ` [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies Marcelo Tosatti
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-05 12:52 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel,
	linux-mm, Marcelo Tosatti

From: Aaron Tomlin <atomlin@atomlin.com>

Introduce a CPU-specific variable namely vmstat_dirty to indicate
if a vmstat imbalance is present for a given CPU. Therefore, at
the appropriate time, we can fold all the remaining differentials.
This patch also provides trivial helpers for modification and testing.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 mm/vmstat.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -194,6 +194,22 @@ void fold_vm_numa_events(void)
 #endif
 
 #ifdef CONFIG_SMP
+static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void vmstat_mark_dirty(void)
+{
+	this_cpu_write(vmstat_dirty, true);
+}
+
+static inline void vmstat_clear_dirty(void)
+{
+	this_cpu_write(vmstat_dirty, false);
+}
+
+static inline bool is_vmstat_dirty(void)
+{
+	return this_cpu_read(vmstat_dirty);
+}
 
 int calculate_pressure_threshold(struct zone *zone)
 {



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-05 12:52 [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped Marcelo Tosatti
  2023-01-05 12:52 ` [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
@ 2023-01-05 12:52 ` Marcelo Tosatti
  2023-01-10 12:06   ` Christoph Lameter
  2023-01-05 12:52 ` [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-05 12:52 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel,
	linux-mm, Marcelo Tosatti

From: Aaron Tomlin <atomlin@atomlin.com>

This patch will now use the previously introduced CPU-specific variable
namely vmstat_dirty to indicate if a vmstat differential/or imbalance is
present for a given CPU. So, at the appropriate time, vmstat processing can
be initiated. The hope is that this particular approach is "cheaper" when
compared to need_update(). The idea is based on Marcelo's patch [1].

[1]: https://lore.kernel.org/lkml/20220204173554.763888172@fedora.localdomain/

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 mm/vmstat.c |   48 ++++++++++++++----------------------------------
 1 file changed, 14 insertions(+), 34 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -381,6 +381,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	vmstat_mark_dirty();
 
 	preempt_enable_nested();
 }
@@ -417,6 +418,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	vmstat_mark_dirty();
 
 	preempt_enable_nested();
 }
@@ -577,6 +579,9 @@ static inline void mod_zone_state(struct
 	s8 __percpu *p = pcp->vm_stat_diff + item;
 	long o, n, t, z;
 
+	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
+	preempt_disable();
+
 	do {
 		z = 0;  /* overflow to zone counters */
 
@@ -606,6 +611,8 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	vmstat_mark_dirty();
+	preempt_enable();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -645,6 +652,8 @@ static inline void mod_node_state(struct
 		delta >>= PAGE_SHIFT;
 	}
 
+	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
+	preempt_disable();
 	do {
 		z = 0;  /* overflow to node counters */
 
@@ -674,6 +683,8 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	vmstat_mark_dirty();
+	preempt_enable();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -828,6 +839,14 @@ static int refresh_cpu_vm_stats(bool do_
 	int global_node_diff[NR_VM_NODE_STAT_ITEMS] = { 0, };
 	int changes = 0;
 
+	/*
+	 * Clear vmstat_dirty before clearing the percpu vmstats.
+	 * If interrupts are enabled, it is possible that an interrupt
+	 * or another task modifies a percpu vmstat, which will
+	 * set vmstat_dirty to true.
+	 */
+	vmstat_clear_dirty();
+
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
 #ifdef CONFIG_NUMA
@@ -1957,35 +1976,6 @@ static void vmstat_update(struct work_st
 }
 
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_zonestat *pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		struct per_cpu_nodestat *n;
-
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(pzstats->vm_stat_diff, 0, sizeof(pzstats->vm_stat_diff)))
-			return true;
-
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, sizeof(n->vm_node_stat_diff)))
-			return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.
@@ -1995,10 +1985,7 @@ void quiet_vmstat(void)
 	if (system_state != SYSTEM_RUNNING)
 		return;
 
-	if (!delayed_work_pending(this_cpu_ptr(&vmstat_work)))
-		return;
-
-	if (!need_update(smp_processor_id()))
+	if (!is_vmstat_dirty())
 		return;
 
 	/*
@@ -2029,7 +2016,7 @@ static void vmstat_shepherd(struct work_
 	for_each_online_cpu(cpu) {
 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
 
-		if (!delayed_work_pending(dw) && need_update(cpu))
+		if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
 
 		cond_resched();



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
  2023-01-05 12:52 [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped Marcelo Tosatti
  2023-01-05 12:52 ` [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
  2023-01-05 12:52 ` [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies Marcelo Tosatti
@ 2023-01-05 12:52 ` Marcelo Tosatti
  2023-01-05 12:52 ` [PATCH v13 4/6] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-05 12:52 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel,
	linux-mm, Marcelo Tosatti

For nohz full CPUs, we'd like the per-CPU vm statistics to be
synchronized when userspace is executing. Otherwise, 
the vmstat_shepherd might queue a work item to synchronize them,
which is undesired intereference for isolated CPUs.

This means that its necessary to check for, and possibly sync,
the statistics when returning to userspace. This means that
there are now two execution contexes, on different CPUs,
which require awareness about each other: context switch
and vmstat shepherd kernel threadr.

To avoid the shared variables between these two contexes (which
would require atomic accesses), delegate the responsability
of statistics synchronization from vmstat_shepherd to local CPU
context, for nohz_full CPUs.

Do that by queueing a delayed work when marking per-CPU vmstat dirty.

When returning to userspace, fold the stats and cancel the delayed work.

When entering idle, only fold the stats.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 include/linux/vmstat.h   |    4 ++--
 kernel/time/tick-sched.c |    2 +-
 mm/vmstat.c              |   41 ++++++++++++++++++++++++++++++++---------
 3 files changed, 35 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/tick.h>
 
 #include "internal.h"
 
@@ -194,21 +195,57 @@ void fold_vm_numa_events(void)
 #endif
 
 #ifdef CONFIG_SMP
-static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+struct vmstat_dirty {
+	bool dirty;
+#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER
+	bool cpu_offline;
+#endif
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct vmstat_dirty, vmstat_dirty_pcpu);
+static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
+int sysctl_stat_interval __read_mostly = HZ;
+
+#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER
+static inline void vmstat_queue_local_work(void)
+{
+	bool vmstat_dirty = this_cpu_read(vmstat_dirty_pcpu.dirty);
+	bool cpu_offline = this_cpu_read(vmstat_dirty_pcpu.cpu_offline);
+	int cpu = smp_processor_id();
+
+	if (tick_nohz_full_cpu(cpu) && !vmstat_dirty) {
+		struct delayed_work *dw;
+
+		dw = this_cpu_ptr(&vmstat_work);
+		if (!delayed_work_pending(dw) && !cpu_offline) {
+			unsigned long delay;
+
+			delay = round_jiffies_relative(sysctl_stat_interval);
+			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);
+		}
+	}
+}
+#else
+static inline void vmstat_queue_local_work(void)
+{
+}
+#endif
 
 static inline void vmstat_mark_dirty(void)
 {
-	this_cpu_write(vmstat_dirty, true);
+	vmstat_queue_local_work();
+	this_cpu_write(vmstat_dirty_pcpu.dirty, true);
 }
 
 static inline void vmstat_clear_dirty(void)
 {
-	this_cpu_write(vmstat_dirty, false);
+	this_cpu_write(vmstat_dirty_pcpu.dirty, false);
 }
 
 static inline bool is_vmstat_dirty(void)
 {
-	return this_cpu_read(vmstat_dirty);
+	return this_cpu_read(vmstat_dirty_pcpu.dirty);
 }
 
 int calculate_pressure_threshold(struct zone *zone)
@@ -1893,9 +1930,6 @@ static const struct seq_operations vmsta
 #endif /* CONFIG_PROC_FS */
 
 #ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
-int sysctl_stat_interval __read_mostly = HZ;
-
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1980,7 +2014,7 @@ static void vmstat_update(struct work_st
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.
  */
-void quiet_vmstat(void)
+void quiet_vmstat(bool user)
 {
 	if (system_state != SYSTEM_RUNNING)
 		return;
@@ -1988,13 +2022,19 @@ void quiet_vmstat(void)
 	if (!is_vmstat_dirty())
 		return;
 
+	refresh_cpu_vm_stats(false);
+
+	if (!IS_ENABLED(CONFIG_FLUSH_WORK_ON_RESUME_USER))
+		return;
+
+	if (!user)
+		return;
 	/*
-	 * Just refresh counters and do not care about the pending delayed
-	 * vmstat_update. It doesn't fire that often to matter and canceling
-	 * it would be too expensive from this path.
-	 * vmstat_shepherd will take care about that for us.
+	 * If the tick is stopped, cancel any delayed work to avoid
+	 * interruptions to this CPU in the future.
 	 */
-	refresh_cpu_vm_stats(false);
+	if (delayed_work_pending(this_cpu_ptr(&vmstat_work)))
+		cancel_delayed_work(this_cpu_ptr(&vmstat_work));
 }
 
 /*
@@ -2015,8 +2055,14 @@ static void vmstat_shepherd(struct work_
 	/* Check processors whose vmstat worker threads have been disabled */
 	for_each_online_cpu(cpu) {
 		struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
+		struct vmstat_dirty *vms = per_cpu_ptr(&vmstat_dirty_pcpu, cpu);
+
+		if (IS_ENABLED(CONFIG_FLUSH_WORK_ON_RESUME_USER))
+			/* NOHZ full CPUs manage their own vmstat flushing */
+			if (tick_nohz_full_cpu(cpu))
+				continue;
 
-		if (!delayed_work_pending(dw) && per_cpu(vmstat_dirty, cpu))
+		if (!delayed_work_pending(dw) && vms->dirty)
 			queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
 
 		cond_resched();
@@ -2049,8 +2095,36 @@ static void __init init_cpu_node_state(v
 	}
 }
 
+#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER
+static void vmstat_cpu_online_rearm(unsigned int cpu)
+{
+	struct vmstat_dirty *vms = per_cpu_ptr(&vmstat_dirty_pcpu, cpu);
+
+	if (tick_nohz_full_cpu(cpu)) {
+		struct delayed_work *dw;
+
+		vms->cpu_offline = false;
+		vms->dirty = true;
+
+		dw = this_cpu_ptr(&vmstat_work);
+		if (!delayed_work_pending(dw)) {
+			unsigned long delay;
+
+			delay = round_jiffies_relative(sysctl_stat_interval);
+			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);
+		}
+	}
+}
+#else
+static void vmstat_cpu_online_rearm(unsigned int cpu)
+{
+}
+#endif
+
 static int vmstat_cpu_online(unsigned int cpu)
 {
+	vmstat_cpu_online_rearm(cpu);
+
 	refresh_zone_stat_thresholds();
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
@@ -2060,8 +2134,28 @@ static int vmstat_cpu_online(unsigned in
 	return 0;
 }
 
+
+#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER
+static void vmstat_mark_cpu_offline(unsigned int cpu)
+{
+	struct vmstat_dirty *vms = per_cpu_ptr(&vmstat_dirty_pcpu, cpu);
+
+	vms->cpu_offline = true;
+}
+#else
+static void vmstat_mark_cpu_offline(unsigned int cpu)
+{
+}
+#endif
+
+/*
+ * Callbacks in the ONLINE section (CPUHP_AP_ONLINE_DYN is in this section),
+ * are invoked on the hotplugged CPU from the per CPU
+ * hotplug thread with interrupts and preemption enabled.
+ */
 static int vmstat_cpu_down_prep(unsigned int cpu)
 {
+	vmstat_mark_cpu_offline(cpu);
 	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
 	return 0;
 }
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h
+++ linux-2.6/include/linux/vmstat.h
@@ -290,7 +290,7 @@ extern void dec_zone_state(struct zone *
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_node_state(struct pglist_data *, enum node_stat_item);
 
-void quiet_vmstat(void);
+void quiet_vmstat(bool user);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);
 
@@ -403,7 +403,7 @@ static inline void __dec_node_page_state
 
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
-static inline void quiet_vmstat(void) { }
+static inline void quiet_vmstat(bool user) { }
 
 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_zonestat *pzstats) { }
Index: linux-2.6/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -911,7 +911,7 @@ static void tick_nohz_stop_tick(struct t
 	 */
 	if (!ts->tick_stopped) {
 		calc_load_nohz_start();
-		quiet_vmstat();
+		quiet_vmstat(false);
 
 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 		ts->tick_stopped = 1;
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -678,6 +678,19 @@ config CPU_ISOLATION
 
 	  Say Y if unsure.
 
+config FLUSH_WORK_ON_RESUME_USER
+	bool "Flush per-CPU vmstats on user return (for nohz full CPUs)"
+	depends on NO_HZ_FULL
+	default y
+
+	help
+	  By default, nohz full CPUs flush per-CPU vm statistics on return
+	  to userspace (to avoid additional interferences when executing
+	  userspace code). This has a small but measurable impact on
+	  system call performance. You can disable this to improve system call
+	  performance, at the expense of potential interferences to userspace
+	  execution.
+
 source "kernel/rcu/Kconfig"
 
 config BUILD_BIN2C



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v13 4/6] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped
  2023-01-05 12:52 [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2023-01-05 12:52 ` [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
@ 2023-01-05 12:52 ` Marcelo Tosatti
  2023-01-05 12:52 ` [PATCH v13 5/6] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-05 12:52 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel,
	linux-mm, Marcelo Tosatti

From: Aaron Tomlin <atomlin@atomlin.com>

For nohz full CPUs, we'd like the per-CPU vm statistics to be
synchronized when userspace is executing. Otherwise, the vmstat_shepherd
might queue a work item to synchronize them, which is undesired
intereference for isolated CPUs.

This patch syncs CPU-specific vmstat differentials, on return to
userspace, if CONFIG_FLUSH_WORK_ON_RESUME_USER is enabled and the tick
is stopped.

A trivial test program was used to determine the impact of the proposed
changes and under vanilla. The mlock(2) and munlock(2) system calls
was used solely to modify vmstat item 'NR_MLOCK'. The following is an
average count of CPU-cycles across the aforementioned system calls:

				  Vanilla                 Modified

  Cycles per syscall              8461                    8690    (+2.6%)

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 include/linux/tick.h     |    5 +++--
 kernel/time/tick-sched.c |   15 +++++++++++++++
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/tick.h
===================================================================
--- linux-2.6.orig/include/linux/tick.h
+++ linux-2.6/include/linux/tick.h
@@ -11,7 +11,6 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>
 #include <linux/sched.h>
-#include <linux/rcupdate.h>
 
 #ifdef CONFIG_GENERIC_CLOCKEVENTS
 extern void __init tick_init(void);
@@ -272,6 +271,7 @@ static inline void tick_dep_clear_signal
 
 extern void tick_nohz_full_kick_cpu(int cpu);
 extern void __tick_nohz_task_switch(void);
+void __tick_nohz_user_enter_prepare(void);
 extern void __init tick_nohz_full_setup(cpumask_var_t cpumask);
 #else
 static inline bool tick_nohz_full_enabled(void) { return false; }
@@ -296,6 +296,7 @@ static inline void tick_dep_clear_signal
 
 static inline void tick_nohz_full_kick_cpu(int cpu) { }
 static inline void __tick_nohz_task_switch(void) { }
+static inline void __tick_nohz_user_enter_prepare(void) { }
 static inline void tick_nohz_full_setup(cpumask_var_t cpumask) { }
 #endif
 
@@ -308,7 +309,7 @@ static inline void tick_nohz_task_switch
 static inline void tick_nohz_user_enter_prepare(void)
 {
 	if (tick_nohz_full_cpu(smp_processor_id()))
-		rcu_nocb_flush_deferred_wakeup();
+		__tick_nohz_user_enter_prepare();
 }
 
 #endif
Index: linux-2.6/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -26,6 +26,7 @@
 #include <linux/posix-timers.h>
 #include <linux/context_tracking.h>
 #include <linux/mm.h>
+#include <linux/rcupdate.h>
 
 #include <asm/irq_regs.h>
 
@@ -519,6 +520,23 @@ void __tick_nohz_task_switch(void)
 	}
 }
 
+void __tick_nohz_user_enter_prepare(void)
+{
+	if (tick_nohz_full_cpu(smp_processor_id())) {
+		if (IS_ENABLED(CONFIG_FLUSH_WORK_ON_RESUME_USER)) {
+			struct tick_sched *ts;
+
+			ts = this_cpu_ptr(&tick_cpu_sched);
+
+			if (ts->tick_stopped)
+				quiet_vmstat(true);
+		}
+
+		rcu_nocb_flush_deferred_wakeup();
+	}
+}
+EXPORT_SYMBOL_GPL(__tick_nohz_user_enter_prepare);
+
 /* Get the boot-time nohz CPU list from the kernel parameters. */
 void __init tick_nohz_full_setup(cpumask_var_t cpumask)
 {



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v13 5/6] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too
  2023-01-05 12:52 [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2023-01-05 12:52 ` [PATCH v13 4/6] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
@ 2023-01-05 12:52 ` Marcelo Tosatti
  2023-01-05 12:52 ` [PATCH v13 6/6] mm/vmstat: avoid queueing work item if cpu stats are clean Marcelo Tosatti
       [not found] ` <20230106001244.4463-1-hdanton@sina.com>
  6 siblings, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-05 12:52 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel,
	linux-mm, Marcelo Tosatti

From: Aaron Tomlin <atomlin@atomlin.com>

In the context of the idle task and an adaptive-tick mode/or a nohz_full
CPU, quiet_vmstat() can be called: before stopping the idle tick,
entering an idle state and on exit. In particular, for the latter case,
when the idle task is required to reschedule, the idle tick can remain
stopped and the timer expiration time endless i.e., KTIME_MAX. Now,
indeed before a nohz_full CPU enters an idle state, CPU-specific vmstat
counters should be processed to ensure the respective values have been
reset and folded into the zone specific 'vm_stat[]'. That being said, it
can only occur when: the idle tick was previously stopped, and
reprogramming of the timer is not required.

A customer provided some evidence which indicates that the idle tick was
stopped; albeit, CPU-specific vmstat counters still remained populated.
Thus one can only assume quiet_vmstat() was not invoked on return to the
idle loop.

If I understand correctly, I suspect this divergence might erroneously
prevent a reclaim attempt by kswapd. If the number of zone specific free
pages are below their per-cpu drift value then
zone_page_state_snapshot() is used to compute a more accurate view of
the aforementioned statistic.  Thus any task blocked on the NUMA node
specific pfmemalloc_wait queue will be unable to make significant
progress via direct reclaim unless it is killed after being woken up by
kswapd (see throttle_direct_reclaim()).

Consider the following theoretical scenario:

 - Note: CPU X is part of 'tick_nohz_full_mask'

    1.      CPU Y migrated running task A to CPU X that
	    was in an idle state i.e. waiting for an IRQ;
	    marked the current task on CPU X to need/or
	    require a reschedule i.e., set TIF_NEED_RESCHED
	    and invoked a reschedule IPI to CPU X
	    (see sched_move_task())

    2.      CPU X acknowledged the reschedule IPI. Generic
	    idle loop code noticed the TIF_NEED_RESCHED flag
	    against the idle task and attempts to exit of the
	    loop and calls the main scheduler function i.e.
	    __schedule().

	    Since the idle tick was previously stopped no
	    scheduling-clock tick would occur.
	    So, no deferred timers would be handled

    3.      Post transition to kernel execution Task A
	    running on CPU X, indirectly released a few pages
	    (e.g. see __free_one_page()); CPU X's
	    'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
	    specific 'vm_stat[]' update was deferred as per the
	    CPU-specific stat threshold

    4.      Task A does invoke exit(2) and the kernel does
	    remove the task from the run-queue; the idle task
	    was selected to execute next since there are no
	    other runnable tasks assigned to the given CPU
	    (see pick_next_task() and pick_next_task_idle())

    5.      On return to the idle loop since the idle tick
	    was already stopped and can remain so (see [1]
	    below) e.g. no pending soft IRQs, no attempt is
	    made to zero and fold CPU X's vmstat counters
	    since reprogramming of the scheduling-clock tick
	    is not required/or needed (see [2])

		  ...
		    do_idle
		    {

		      __current_set_polling()
		      tick_nohz_idle_enter()

		      while (!need_resched()) {

			local_irq_disable()

			...

			/* No polling or broadcast event */
			cpuidle_idle_call()
			{

			  if (cpuidle_not_available(drv, dev)) {
			    tick_nohz_idle_stop_tick()
			      __tick_nohz_idle_stop_tick(this_cpu_ptr(&tick_cpu_sched))
			      {
				int cpu = smp_processor_id()

				if (ts->timer_expires_base)
				  expires = ts->timer_expires
				else if (can_stop_idle_tick(cpu, ts))
	      (1) ------->        expires = tick_nohz_next_event(ts, cpu)
				else
				  return

				ts->idle_calls++

				if (expires > 0LL) {

				  tick_nohz_stop_tick(ts, cpu)
				  {

				    if (ts->tick_stopped && (expires == ts->next_tick)) {
	      (2) ------->            if (tick == KTIME_MAX || ts->next_tick ==
					hrtimer_get_expires(&ts->sched_timer))
					return
				    }
				    ...
				  }

So, the idea of this patch is to ensure refresh_cpu_vm_stats(false) is
called, when it is appropriate, on return to the idle loop if the idle
tick was previously stopped too.

A trivial test program was used to determine the impact of the proposed
changes and under vanilla. The nanosleep(2) system call was used several
times to suspend execution for a period of time to approximately compute
the number of CPU-cycles in the idle code path. The following is an average
count of CPU-cycles:

				  Vanilla                 Modified

  Cycles per idle loop            151858                  153258  (+1.0%)

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 kernel/time/tick-sched.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -929,13 +929,14 @@ static void tick_nohz_stop_tick(struct t
 	 */
 	if (!ts->tick_stopped) {
 		calc_load_nohz_start();
-		quiet_vmstat(false);
 
 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
 		ts->tick_stopped = 1;
 		trace_tick_stop(1, TICK_DEP_MASK_NONE);
 	}
 
+	/* Attempt to fold when the idle tick is stopped or not */
+	quiet_vmstat(false);
 	ts->next_tick = tick;
 
 	/*



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v13 6/6] mm/vmstat: avoid queueing work item if cpu stats are clean
  2023-01-05 12:52 [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2023-01-05 12:52 ` [PATCH v13 5/6] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
@ 2023-01-05 12:52 ` Marcelo Tosatti
       [not found] ` <20230106001244.4463-1-hdanton@sina.com>
  6 siblings, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-05 12:52 UTC (permalink / raw)
  To: atomlin, frederic
  Cc: cl, tglx, mingo, peterz, pauld, neelx, oleksandr, linux-kernel,
	linux-mm, Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c
+++ linux-2.6/mm/vmstat.c
@@ -1931,6 +1931,31 @@ static const struct seq_operations vmsta
 
 #ifdef CONFIG_SMP
 #ifdef CONFIG_PROC_FS
+static bool need_drain_remote_zones(int cpu)
+{
+#ifdef CONFIG_NUMA
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		if (!pcp->count)
+			continue;
+
+		if (!pcp->expire)
+			continue;
+
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+#endif
+
+	return false;
+}
+
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats(true);
@@ -1940,8 +1965,12 @@ int vmstat_refresh(struct ctl_table *tab
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
+	struct work_struct __percpu *works;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1955,9 +1984,21 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+		struct vmstat_dirty *vms = per_cpu_ptr(&vmstat_dirty_pcpu, cpu);
+
+		INIT_WORK(work, refresh_vm_stats);
+
+		if (vms->dirty || need_drain_remote_zones(cpu))
+			schedule_work_on(cpu, work);
+	}
+	for_each_online_cpu(cpu)
+		flush_work(per_cpu_ptr(works, cpu));
+	cpus_read_unlock();
+	free_percpu(works);
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
       [not found] ` <20230106001244.4463-1-hdanton@sina.com>
@ 2023-01-06 12:51   ` Marcelo Tosatti
       [not found]   ` <20230106150154.4560-1-hdanton@sina.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-06 12:51 UTC (permalink / raw)
  To: Hillf Danton; +Cc: atomlin, frederic, linux-kernel, linux-mm

On Fri, Jan 06, 2023 at 08:12:44AM +0800, Hillf Danton wrote:
> On 05 Jan 2023 09:52:21 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > For nohz full CPUs, we'd like the per-CPU vm statistics to be
> > synchronized when userspace is executing. Otherwise, 
> > the vmstat_shepherd might queue a work item to synchronize them,
> > which is undesired intereference for isolated CPUs.
> > 
> > This means that its necessary to check for, and possibly sync,
> > the statistics when returning to userspace. This means that
> > there are now two execution contexes, on different CPUs,
> > which require awareness about each other: context switch
> > and vmstat shepherd kernel threadr.
> > 
> > To avoid the shared variables between these two contexes (which
> > would require atomic accesses), delegate the responsability
> > of statistics synchronization from vmstat_shepherd to local CPU
> > context, for nohz_full CPUs.
> > 
> > Do that by queueing a delayed work when marking per-CPU vmstat dirty.
> > 
> > When returning to userspace, fold the stats and cancel the delayed work.
> > 
> > When entering idle, only fold the stats.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > ---
> >  include/linux/vmstat.h   |    4 ++--
> >  kernel/time/tick-sched.c |    2 +-
> >  mm/vmstat.c              |   41 ++++++++++++++++++++++++++++++++---------
> >  3 files changed, 35 insertions(+), 12 deletions(-)
> > 
> > Index: linux-2.6/mm/vmstat.c
> > ===================================================================
> > --- linux-2.6.orig/mm/vmstat.c
> > +++ linux-2.6/mm/vmstat.c
> > @@ -28,6 +28,7 @@
> >  #include <linux/mm_inline.h>
> >  #include <linux/page_ext.h>
> >  #include <linux/page_owner.h>
> > +#include <linux/tick.h>
> >  
> >  #include "internal.h"
> >  
> > @@ -194,21 +195,57 @@ void fold_vm_numa_events(void)
> >  #endif
> >  
> >  #ifdef CONFIG_SMP
> > -static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
> > +
> > +struct vmstat_dirty {
> > +	bool dirty;
> > +#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER
> > +	bool cpu_offline;
> > +#endif
> > +};
> > +
> > +static DEFINE_PER_CPU_ALIGNED(struct vmstat_dirty, vmstat_dirty_pcpu);
> > +static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
> > +int sysctl_stat_interval __read_mostly = HZ;
> > +
> > +#ifdef CONFIG_FLUSH_WORK_ON_RESUME_USER
> > +static inline void vmstat_queue_local_work(void)
> > +{
> > +	bool vmstat_dirty = this_cpu_read(vmstat_dirty_pcpu.dirty);
> > +	bool cpu_offline = this_cpu_read(vmstat_dirty_pcpu.cpu_offline);
> > +	int cpu = smp_processor_id();
> > +
> > +	if (tick_nohz_full_cpu(cpu) && !vmstat_dirty) {
> > +		struct delayed_work *dw;
> > +
> > +		dw = this_cpu_ptr(&vmstat_work);
> > +		if (!delayed_work_pending(dw) && !cpu_offline) {
> > +			unsigned long delay;
> > +
> > +			delay = round_jiffies_relative(sysctl_stat_interval);
> > +			queue_delayed_work_on(cpu, mm_percpu_wq, dw, delay);
> 
> Regression wrt V12 if timer is added on the CPU that is not doing HK_TYPE_TIMER?

Before this change, the timer was managed (and queued on an isolated
CPU) by vmstat_shepherd. Now it is managed (and queued) by the local
CPU, so there is no regression.

Thanks.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
       [not found]   ` <20230106150154.4560-1-hdanton@sina.com>
@ 2023-01-06 18:16     ` Marcelo Tosatti
       [not found]     ` <20230107001529.4617-1-hdanton@sina.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-06 18:16 UTC (permalink / raw)
  To: Hillf Danton; +Cc: atomlin, frederic, linux-kernel, linux-mm

On Fri, Jan 06, 2023 at 11:01:54PM +0800, Hillf Danton wrote:
> On 6 Jan 2023 09:51:00 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > On Fri, Jan 06, 2023 at 08:12:44AM +0800, Hillf Danton wrote:
> > > 
> > > Regression wrt V12 if timer is added on the CPU that is not doing HK_TYPE_TIMER?
> > 
> > Before this change, the timer was managed (and queued on an isolated
> > CPU) by vmstat_shepherd. Now it is managed (and queued) by the local
> > CPU, so there is no regression.
> 
> Given vm stats folded when returning to userspace, queuing the delayed work
> barely makes sense in the first place. If it can be canceled, queuing it burns
> cycles with nothing earned. Otherwise vm stats got folded already.

Agree, but you can't know whether return to userspace will occur 
before the timer is fired.

So queueing the timer is to _ensure_ that eventually vmstats will be 
synced (which maintains the current timing behaviour wrt vmstat syncs).

Also don't think the queueing cost is significant: it only happens
for the first vmstat dirty item.

> Nor does shepherd even without delay. And the right thing is only make shepherd
> leave isolated CPUs intact.
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
       [not found]     ` <20230107001529.4617-1-hdanton@sina.com>
@ 2023-01-09 14:12       ` Marcelo Tosatti
       [not found]       ` <20230110024356.336-1-hdanton@sina.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-09 14:12 UTC (permalink / raw)
  To: Hillf Danton; +Cc: atomlin, frederic, linux-kernel, linux-mm

Hi Hillf,

On Sat, Jan 07, 2023 at 08:15:29AM +0800, Hillf Danton wrote:
> On 6 Jan 2023 15:16:23 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > On Fri, Jan 06, 2023 at 11:01:54PM +0800, Hillf Danton wrote:
> > > On 6 Jan 2023 09:51:00 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > > > On Fri, Jan 06, 2023 at 08:12:44AM +0800, Hillf Danton wrote:
> > > > > 
> > > > > Regression wrt V12 if timer is added on the CPU that is not doing HK_TYPE_TIMER?
> > > > 
> > > > Before this change, the timer was managed (and queued on an isolated
> > > > CPU) by vmstat_shepherd. Now it is managed (and queued) by the local
> > > > CPU, so there is no regression.
> > > 
> > > Given vm stats folded when returning to userspace, queuing the delayed work
> > > barely makes sense in the first place. If it can be canceled, queuing it burns
> > > cycles with nothing earned. Otherwise vm stats got folded already.
> > 
> > Agree, but you can't know whether return to userspace will occur 
> > before the timer is fired.
> 
> No way to predict a random timer expiration, no?

Right.

> > 
> > So queueing the timer is to _ensure_ that eventually vmstats will be 
> > synced (which maintains the current timing behaviour wrt vmstat syncs).
> 
> After this change,
> 
> > > > > > @@ -1988,13 +2022,19 @@ void quiet_vmstat(void)
> > > > > >  	if (!is_vmstat_dirty())
> > > > > >  		return;
> > > > > >  
> 
> it is only ensured eventually by this check instead.

Yes, but if you do not return to userspace, then the per-CPU vm
statistics can be dirty indefinitely.

> > > > > > +	refresh_cpu_vm_stats(false);
> > > > > > +
> > > > > > +	if (!IS_ENABLED(CONFIG_FLUSH_WORK_ON_RESUME_USER))
> > > > > > +		return;
> > > > > > +
> > > > > > +	if (!user)
> > > > > > +		return;
> 
> 
> > Also don't think the queueing cost is significant: it only happens
> > for the first vmstat dirty item.
> 
> Cost is considered only if it is needed.

Not sure i understand what you mean (or whether there is any alternative
to the timer).


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
       [not found]       ` <20230110024356.336-1-hdanton@sina.com>
@ 2023-01-10 11:50         ` Marcelo Tosatti
       [not found]         ` <20230110151901.402-1-hdanton@sina.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-10 11:50 UTC (permalink / raw)
  To: Hillf Danton; +Cc: atomlin, frederic, linux-kernel, linux-mm

On Tue, Jan 10, 2023 at 10:43:56AM +0800, Hillf Danton wrote:
> On 9 Jan 2023 11:12:49 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > Yes, but if you do not return to userspace, then the per-CPU vm
> > statistics can be dirty indefinitely.
> 
> Could you specify the reasons for failing to return to userspace,
> given it is undesired intereference for the shepherd to queue work
> on the isolated CPUs.

Any system call that takes longer than the threshold to sync vmstats.

Or a long running kernel thread, for example:

https://stackoverflow.com/questions/65111483/long-running-kthread-and-synchronize-net


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy
  2023-01-05 12:52 ` [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
@ 2023-01-10 11:58   ` Christoph Lameter
  2023-01-10 12:12     ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2023-01-10 11:58 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, frederic, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Thu, 5 Jan 2023, Marcelo Tosatti wrote:

> +static inline void vmstat_mark_dirty(void)
> +{
> +	this_cpu_write(vmstat_dirty, true);
> +}

this_cpu_write() is intended for an per cpu atomic context. You are not
using it in that way. The processor may have changed before or after and
thus vmstat_dirty for another CPU may  have been marked dirty.

I guess this would have to be called __vmstat_mark_dirty() and be using
__this_cpu_write(*) with a requirement that preemption be disabled before
using this function.

> +static inline void vmstat_clear_dirty(void)
> +{
> +	this_cpu_write(vmstat_dirty, false);
> +}

Same

> +static inline bool is_vmstat_dirty(void)
> +{
> +	return this_cpu_read(vmstat_dirty);
> +}

This function would only work correctly if preemption is disabled.
Otherwise the processor may change.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-05 12:52 ` [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies Marcelo Tosatti
@ 2023-01-10 12:06   ` Christoph Lameter
  2023-01-10 12:18     ` Frederic Weisbecker
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2023-01-10 12:06 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: atomlin, frederic, tglx, mingo, peterz, pauld, neelx, oleksandr,
	linux-kernel, linux-mm

On Thu, 5 Jan 2023, Marcelo Tosatti wrote:

> --- linux-2.6.orig/mm/vmstat.c
> +++ linux-2.6/mm/vmstat.c
> @@ -381,6 +381,7 @@ void __mod_zone_page_state(struct zone *
>  		x = 0;
>  	}
>  	__this_cpu_write(*p, x);
> +	vmstat_mark_dirty();

__vmstat_mark_dirty()? See earlier patch

>
>  	preempt_enable_nested();
>  }
> @@ -417,6 +418,7 @@ void __mod_node_page_state(struct pglist
>  		x = 0;
>  	}
>  	__this_cpu_write(*p, x);
> +	vmstat_mark_dirty();

Ditto.

>
>  	preempt_enable_nested();
>  }
> @@ -577,6 +579,9 @@ static inline void mod_zone_state(struct
>  	s8 __percpu *p = pcp->vm_stat_diff + item;
>  	long o, n, t, z;
>
> +	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
> +	preempt_disable();

If you are disabling preemption then why do we still need cmpxchg?
Same again below.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy
  2023-01-10 11:58   ` Christoph Lameter
@ 2023-01-10 12:12     ` Frederic Weisbecker
  0 siblings, 0 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2023-01-10 12:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Marcelo Tosatti, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Tue, Jan 10, 2023 at 12:58:32PM +0100, Christoph Lameter wrote:
> On Thu, 5 Jan 2023, Marcelo Tosatti wrote:
> 
> > +static inline void vmstat_mark_dirty(void)
> > +{
> > +	this_cpu_write(vmstat_dirty, true);
> > +}
> 
> this_cpu_write() is intended for an per cpu atomic context. You are not
> using it in that way. The processor may have changed before or after and
> thus vmstat_dirty for another CPU may  have been marked dirty.
> 
> I guess this would have to be called __vmstat_mark_dirty() and be using
> __this_cpu_write(*) with a requirement that preemption be disabled before
> using this function.

You're right. So this patchset also arranges for these vmstat functions to be
called with preemption disabled. I'm converting the this_cpu operations
to __this_cpu versions to make sure of that. And I believe the __this_cpu
WARN if preemptible().


> 
> > +static inline void vmstat_clear_dirty(void)
> > +{
> > +	this_cpu_write(vmstat_dirty, false);
> > +}
> 
> Same
> 
> > +static inline bool is_vmstat_dirty(void)
> > +{
> > +	return this_cpu_read(vmstat_dirty);
> > +}
> 
> This function would only work correctly if preemption is disabled.
> Otherwise the processor may change.

Indeed that should apply as __this_cpu_read() as well.

Thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-10 12:06   ` Christoph Lameter
@ 2023-01-10 12:18     ` Frederic Weisbecker
  2023-01-10 13:39       ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Frederic Weisbecker @ 2023-01-10 12:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Marcelo Tosatti, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Tue, Jan 10, 2023 at 01:06:37PM +0100, Christoph Lameter wrote:
> On Thu, 5 Jan 2023, Marcelo Tosatti wrote:
> > @@ -577,6 +579,9 @@ static inline void mod_zone_state(struct
> >  	s8 __percpu *p = pcp->vm_stat_diff + item;
> >  	long o, n, t, z;
> >
> > +	/* cmpxchg and vmstat_mark_dirty should happen on the same CPU */
> > +	preempt_disable();
> 
> If you are disabling preemption then why do we still need cmpxchg?
> Same again below.

Note I'm absolutely clueless with vmstat. But I was wondering about it as well
while reviewing Marcelo's series, so git blame pointed me to:

7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid
interrupt disable / enable")

And this seem to mention that this can race with IRQs as well, hence the local
cmpxchg operation.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-10 12:18     ` Frederic Weisbecker
@ 2023-01-10 13:39       ` Christoph Lameter
  2023-01-10 20:09         ` Marcelo Tosatti
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2023-01-10 13:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Marcelo Tosatti, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Tue, 10 Jan 2023, Frederic Weisbecker wrote:

> Note I'm absolutely clueless with vmstat. But I was wondering about it as well
> while reviewing Marcelo's series, so git blame pointed me to:
>
> 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid
> interrupt disable / enable")
>
> And this seem to mention that this can race with IRQs as well, hence the local
> cmpxchg operation.

The race with irq could be an issue but I thought we avoided that and were
content with disabling preemption.

But this issue illustrates the central problem of the patchset: It makes
the lightweight counters not so lightweight anymore. The basic primitives
add a  lot of weight. And the pre cpu atomic updates operations require
the modification of multiple values. The operation cannot be "atomic" in
that sense anymore and we need some other form of synchronization that can
span multiple instructions.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
       [not found]         ` <20230110151901.402-1-hdanton@sina.com>
@ 2023-01-10 16:12           ` Frederic Weisbecker
       [not found]           ` <20230110235822.456-1-hdanton@sina.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2023-01-10 16:12 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Marcelo Tosatti, atomlin, linux-kernel, linux-mm

On Tue, Jan 10, 2023 at 11:19:01PM +0800, Hillf Danton wrote:
> On Tue, 10 Jan 2023 08:50:28 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > On Tue, Jan 10, 2023 at 10:43:56AM +0800, Hillf Danton wrote:
> > > On 9 Jan 2023 11:12:49 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > > > 
> > > > Yes, but if you do not return to userspace, then the per-CPU vm
> > > > statistics can be dirty indefinitely.
> > > 
> > > Could you specify the reasons for failing to return to userspace,
> > > given it is undesired intereference for the shepherd to queue work
> > > on the isolated CPUs.
> > 
> > Any system call that takes longer than the threshold to sync vmstats.
> 
> Which ones?
> 
> If schedule() occurs during syscall because of acquiring mutex for instance
> then anything on the isolated runqueue, including workqueue worker shepherd
> wakes up, can burn CPU cycles without undesired intereference produced.

The above confuses me. How others tasks would help with syscalls that take too long too
service?

> > 
> > Or a long running kernel thread, for example:
> 
> It is a buggyyyy example.
> > 
> > https://stackoverflow.com/questions/65111483/long-running-kthread-and-synchronize-net

I can imagine a CPU spending most of its time processing networking packets
through interrupts/softirq within ksoftirqd/NAPI while another CPU process
these packets in userspace.

In this case the CPU handling the kernel part can theoretically never go to
idle/user. nohz_full isn't optimized toward such job but there is nothing
to prevent it from doing such job.

Thanks.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-10 13:39       ` Christoph Lameter
@ 2023-01-10 20:09         ` Marcelo Tosatti
  2023-01-11  8:42           ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-10 20:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Tue, Jan 10, 2023 at 02:39:08PM +0100, Christoph Lameter wrote:
> On Tue, 10 Jan 2023, Frederic Weisbecker wrote:
> 
> > Note I'm absolutely clueless with vmstat. But I was wondering about it as well
> > while reviewing Marcelo's series, so git blame pointed me to:
> >
> > 7c83912062c801738d7d19acaf8f7fec25ea663c ("vmstat: User per cpu atomics to avoid
> > interrupt disable / enable")
> >
> > And this seem to mention that this can race with IRQs as well, hence the local
> > cmpxchg operation.
> 
> The race with irq could be an issue but I thought we avoided that and were
> content with disabling preemption.
> 
> But this issue illustrates the central problem of the patchset: It makes
> the lightweight counters not so lightweight anymore. 

https://lkml.iu.edu/hypermail/linux/kernel/0903.2/00569.html

With added

static void do_test_preempt(void)
{
        unsigned long flags;
        unsigned int i;
        cycles_t time1, time2, time;
        u32 rem;

        local_irq_save(flags);
        preempt_disable();
        time1 = get_cycles();
        for (i = 0; i < NR_LOOPS; i++) {
                preempt_disable();
                preempt_enable();
        }
        time2 = get_cycles();
        local_irq_restore(flags);
        preempt_enable();
        time = time2 - time1;

        printk(KERN_ALERT "test results: time for disabling/enabling preemption\n");
        printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
        printk(KERN_ALERT "total time: %llu\n", time);
        time = div_u64_rem(time, NR_LOOPS, &rem);
        printk(KERN_ALERT "-> enabling/disabling preemption takes %llu cycles\n",
time);
        printk(KERN_ALERT "test end\n");
}


model name	: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

[  423.676079] test init
[  423.676249] test results: time for baseline
[  423.676405] number of loops: 200000
[  423.676676] total time: 104274
[  423.676910] -> baseline takes 0 cycles
[  423.677051] test end
[  423.678150] test results: time for locked cmpxchg
[  423.678353] number of loops: 200000
[  423.678498] total time: 2473839
[  423.678630] -> locked cmpxchg takes 12 cycles
[  423.678810] test end
[  423.679204] test results: time for non locked cmpxchg
[  423.679394] number of loops: 200000
[  423.679527] total time: 740298
[  423.679644] -> non locked cmpxchg takes 3 cycles
[  423.679817] test end
[  423.680755] test results: time for locked add return
[  423.680951] number of loops: 200000
[  423.681089] total time: 2118185
[  423.681229] -> locked add return takes 10 cycles
[  423.681411] test end
[  423.681846] test results: time for enabling interrupts (STI)
[  423.682063] number of loops: 200000
[  423.682209] total time: 861591
[  423.682335] -> enabling interrupts (STI) takes 4 cycles
[  423.682532] test end
[  423.683606] test results: time for disabling interrupts (CLI)
[  423.683852] number of loops: 200000
[  423.684006] total time: 2440756
[  423.684141] -> disabling interrupts (CLI) takes 12 cycles
[  423.684588] test end
[  423.686626] test results: time for disabling/enabling interrupts (STI/CLI)
[  423.686879] number of loops: 200000
[  423.687015] total time: 4802297
[  423.687139] -> enabling/disabling interrupts (STI/CLI) takes 24 cycles
[  423.687389] test end
[  423.688025] test results: time for disabling/enabling preemption
[  423.688258] number of loops: 200000
[  423.688396] total time: 1341001
[  423.688526] -> enabling/disabling preemption takes 6 cycles
[  423.689276] test end

> The basic primitives add a  lot of weight. 

Can't see any alternative given the necessity to avoid interruption
by the work to sync per-CPU vmstats to global vmstats.

> And the pre cpu atomic updates operations require the modification
> of multiple values. The operation 
> cannot be "atomic" in that sense anymore and we need some other form of
> synchronization that can
> span multiple instructions.

    So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
    count on preremption being disabled we still have some minor issues.
    The fetching of the counter thresholds is racy.
    A threshold from another cpu may be applied if we happen to be
    rescheduled on another cpu.  However, the following vmstat operation
    will then bring the counter again under the threshold limit.

Those small issues are gone, OTOH.






^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full
       [not found]           ` <20230110235822.456-1-hdanton@sina.com>
@ 2023-01-11  0:09             ` Frederic Weisbecker
  0 siblings, 0 replies; 24+ messages in thread
From: Frederic Weisbecker @ 2023-01-11  0:09 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Marcelo Tosatti, atomlin, linux-kernel, linux-mm

On Wed, Jan 11, 2023 at 07:58:22AM +0800, Hillf Danton wrote:
> On 10 Jan 2023 17:12:22 +0100 Frederic Weisbecker <frederic@kernel.org>
> > On Tue, Jan 10, 2023 at 11:19:01PM +0800, Hillf Danton wrote:
> > > On Tue, 10 Jan 2023 08:50:28 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > > > On Tue, Jan 10, 2023 at 10:43:56AM +0800, Hillf Danton wrote:
> > > > > On 9 Jan 2023 11:12:49 -0300 Marcelo Tosatti <mtosatti@redhat.com>
> > > > > > 
> > > > > > Yes, but if you do not return to userspace, then the per-CPU vm
> > > > > > statistics can be dirty indefinitely.
> > > > > 
> > > > > Could you specify the reasons for failing to return to userspace,
> > > > > given it is undesired intereference for the shepherd to queue work
> > > > > on the isolated CPUs.
> > > > 
> > > > Any system call that takes longer than the threshold to sync vmstats.
> > > 
> > > Which ones?
> > > 
> > > If schedule() occurs during syscall because of acquiring mutex for instance
> > > then anything on the isolated runqueue, including workqueue worker shepherd
> > > wakes up, can burn CPU cycles without undesired interference produced.
> > 
> > The above confuses me. How others tasks would help with syscalls that take too long too
> > service?
> 
> Given no scheduling in userspace, no chance for other tasks to interfere
> after returning to userspace, on one hand.
> 
> Upon scheduling during syscall on the other hand, it is the right time
> to sync vmstats for example. But no vmstats can be updated without works
> queued by shepherd.
> 
> In a nutshell, no interference could happen without scheduling, and how
> work is queued does not matter. So the current shepherd behavior is prefered.

I'm still confused... We want to avoid the shepherd because it may queue the
vmstat work while the task wants to run noise-free in userspace.

> > 
> > > > 
> > > > Or a long running kernel thread, for example:
> > > 
> > > It is a buggyyyy example.
> > > > 
> > > > https://stackoverflow.com/questions/65111483/long-running-kthread-and-synchronize-net
> > 
> > I can imagine a CPU spending most of its time processing networking packets
> > through interrupts/softirq within ksoftirqd/NAPI while another CPU process
> > these packets in userspace.
> > 
> > In this case the CPU handling the kernel part can theoretically never go to
> > idle/user. nohz_full isn't optimized toward such job but there is nothing
> > to prevent it from doing such job.
> 
> A simple FIFO task launched by an administrator can get a CPU out of scheduler's
> control for a week, regardless of isolation.

Sure. But, what do you mean by that exactly?

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-10 20:09         ` Marcelo Tosatti
@ 2023-01-11  8:42           ` Christoph Lameter
  2023-01-11 17:07             ` Marcelo Tosatti
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2023-01-11  8:42 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Tue, 10 Jan 2023, Marcelo Tosatti wrote:

> > The basic primitives add a  lot of weight.
>
> Can't see any alternative given the necessity to avoid interruption
> by the work to sync per-CPU vmstats to global vmstats.

this_cpu operations are designed to operate on a *single* value (a counter) and can
be run on an arbitrary cpu, There is no preemption or interrupt
disable required since the counters of all cpus will be added up at the
end.

You want *two* values (the counter and the dirty flag) to be modified
together and want to use the counters/flag to identify the cpu where
these events occurred. this_cpu_xxx operations are not suitable for that
purpose. You would need a way to ensure that both operations occur on the
same cpu.

> > > And the pre cpu atomic updates operations require the modification
> > of multiple values. The operation
> > cannot be "atomic" in that sense anymore and we need some other form of
> > synchronization that can
> > span multiple instructions.
>
>     So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
>     count on preremption being disabled we still have some minor issues.
>     The fetching of the counter thresholds is racy.
>     A threshold from another cpu may be applied if we happen to be
>     rescheduled on another cpu.  However, the following vmstat operation
>     will then bring the counter again under the threshold limit.
>
> Those small issues are gone, OTOH.

Well you could use this_cpu_cmpxchg128 to update a 64 bit counter and a
flag at the same time. Otherwise you will have to switch off preemption or
interrupts when incrementing the counters and updating the dirty flag.

Thus you do not really need the this_cpu operations anymore. It would
best to use a preempt_disable section and uuse C operators -- ++ for the
counter and do regular assignment for the flag.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-11  8:42           ` Christoph Lameter
@ 2023-01-11 17:07             ` Marcelo Tosatti
  2023-01-16  9:51               ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-11 17:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Wed, Jan 11, 2023 at 09:42:18AM +0100, Christoph Lameter wrote:
> On Tue, 10 Jan 2023, Marcelo Tosatti wrote:
> 
> > > The basic primitives add a  lot of weight.
> >
> > Can't see any alternative given the necessity to avoid interruption
> > by the work to sync per-CPU vmstats to global vmstats.
> 
> this_cpu operations are designed to operate on a *single* value (a counter) and can
> be run on an arbitrary cpu, There is no preemption or interrupt
> disable required since the counters of all cpus will be added up at the
> end.
> 
> You want *two* values (the counter and the dirty flag) to be modified
> together and want to use the counters/flag to identify the cpu where
> these events occurred. this_cpu_xxx operations are not suitable for that
> purpose. You would need a way to ensure that both operations occur on the
> same cpu.

Which is either preempt_disable (CONFIG_HAVE_CMPXCHG_LOCAL case), or 
local_irq_disable (!CONFIG_HAVE_CMPXCHG_LOCAL case).

> > > > And the pre cpu atomic updates operations require the modification
> > > of multiple values. The operation
> > > cannot be "atomic" in that sense anymore and we need some other form of
> > > synchronization that can
> > > span multiple instructions.
> >
> >     So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
> >     count on preremption being disabled we still have some minor issues.
> >     The fetching of the counter thresholds is racy.
> >     A threshold from another cpu may be applied if we happen to be
> >     rescheduled on another cpu.  However, the following vmstat operation
> >     will then bring the counter again under the threshold limit.
> >
> > Those small issues are gone, OTOH.
> 
> Well you could use this_cpu_cmpxchg128 to update a 64 bit counter and a
> flag at the same time. 

But then you transform the "per-CPU vmstat is dirty" bit (bool) into a 
number of flags that must be scanned (when returning to userspace).

Which increases the overhead of a fast path (return to userspace).

> Otherwise you will have to switch off preemption or
> interrupts when incrementing the counters and updating the dirty flag.
> 
> Thus you do not really need the this_cpu operations anymore. It would
> best to use a preempt_disable section and uuse C operators -- ++ for the
> counter and do regular assignment for the flag.

OK, can replace this_cpu operations with this_cpu_ptr + standard C operators 
(and in fact can do that for interrupt disabled functions as well, that
is CONFIG_HAVE_CMPXCHG_LOCAL not defined).

Is that it?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-11 17:07             ` Marcelo Tosatti
@ 2023-01-16  9:51               ` Christoph Lameter
  2023-01-16 16:11                 ` Marcelo Tosatti
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2023-01-16  9:51 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Wed, 11 Jan 2023, Marcelo Tosatti wrote:

> OK, can replace this_cpu operations with this_cpu_ptr + standard C operators
> (and in fact can do that for interrupt disabled functions as well, that
> is CONFIG_HAVE_CMPXCHG_LOCAL not defined).
>
> Is that it?

No that was hyperthetical.

I do not know how to get out of this dilemma. We surely want to keep fast
vmstat operations working.

The fundamental issue that causes the vmstat discrepancies is likely that
the fast this_cpu ops can increment the counter on any random cpu and that
this is the reason you get vmstat discrepancies.

Give up the assumption that an increment of a this_cpu counter on a
specific cpu means that something occurred on that specific cpu. Maybe
that will get you on a path to resolve the issues you are seeing.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-16  9:51               ` Christoph Lameter
@ 2023-01-16 16:11                 ` Marcelo Tosatti
  2023-01-17 12:52                   ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Marcelo Tosatti @ 2023-01-16 16:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Mon, Jan 16, 2023 at 10:51:40AM +0100, Christoph Lameter wrote:
> On Wed, 11 Jan 2023, Marcelo Tosatti wrote:
> 
> > OK, can replace this_cpu operations with this_cpu_ptr + standard C operators
> > (and in fact can do that for interrupt disabled functions as well, that
> > is CONFIG_HAVE_CMPXCHG_LOCAL not defined).
> >
> > Is that it?
> 
> No that was hyperthetical.
> 
> I do not know how to get out of this dilemma. We surely want to keep fast
> vmstat operations working.

Honestly, to me, there is no dilemma:

* There is a requirement from applications to be uninterrupted
by operating system activities. Examples include radio access
network software, software defined PLCs for industrial automation (1).

* There exists vm-statistics counters (which count
the number of pages on different states, for example, number of 
free pages, locked pages, pages under writeback, pagetable pages,
file pages, etc). 
To reduce number of accesses to the global counters, each CPU maintains
its own delta relative to the global VM counters
(which can be cached in the local processor cache, therefore fast).

The per-CPU deltas are synchronized to global counters:
	1) If the per-CPU counters exceed a given threshold.
	2) Periodically, with a low frequency compared to 
	   CPU events (every number of seconds).
	3) Upon an event that requires accurate counters.

* The periodic synchronization interrupts a given CPU, in case
it contains a counter delta relative to the global counters.

To avoid this interruption, due to [1], the proposed patchset
synchronizes any pending per-CPU deltas to global counters,
for nohz_full= CPUs, when returning to userspace
(which is a very fast path).

Since return to userspace is a very fast path, synchronizing
per-CPU counter deltas by reading their contents is undesired.

Therefore a single bit is introduced to compact the following
information: does this CPU contain any delta relative to the
global counters that should be written back?

This bit is set when a per-CPU delta is increased.
This bit is cleared when the per-CPU deltas are written back to 
the global counters.

Since for the following two operations:

	modify per-CPU delta (for current CPU) of counter X by Y
	set bit (for current CPU) indicating the per-CPU delta exists

"current CPU" can change, it is necessary to disable CPU preemption
when executing the pair of operations.

vmstat operations still perform their main goal which is to 
maintain accesses local to the CPU when incrementing the counters
(for most of counter modifications).
The preempt_disable/enable pair is also a per-CPU variable.

Now you are objecting to this patchset because:

It increases the number of cycles to execute the function to modify 
the counters by 6. Can you mention any benchmark where this 
increase is significant?

By searching for mod_zone_page_state/mode_node_page_state one can see
the following: the codepaths that call them are touching multiple pages
and other data structures, so the preempt_enable/preempt_disable
pair should be a very small contribution (in terms of percentage) to any
meaningful benchmark.


> The fundamental issue that causes the vmstat discrepancies is likely that
> the fast this_cpu ops can increment the counter on any random cpu and that
> this is the reason you get vmstat discrepancies.

Yes. 

> Give up the assumption that an increment of a this_cpu counter on a
> specific cpu means that something occurred on that specific cpu. Maybe
> that will get you on a path to resolve the issues you are seeing.

But it can't. To be able to condense the information "does a delta exist
on this CPU" from a number of cacheline reads to a single cacheline
read, one bit can be used. And the write to that bit and to the counters
is not atomic.

Alternatives:
	1) Disable periodic synchronization for nohz_full CPUs.
	2) Processor instructions which can modify more than
	   one address in memory.
	3) Synchronize the per-CPU stats remotely (which
	   increases per-CPU and per-node accesses).



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies
  2023-01-16 16:11                 ` Marcelo Tosatti
@ 2023-01-17 12:52                   ` Christoph Lameter
  0 siblings, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2023-01-17 12:52 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, atomlin, tglx, mingo, peterz, pauld, neelx,
	oleksandr, linux-kernel, linux-mm

On Mon, 16 Jan 2023, Marcelo Tosatti wrote:

> Honestly, to me, there is no dilemma:
>
> * There is a requirement from applications to be uninterrupted
> by operating system activities. Examples include radio access
> network software, software defined PLCs for industrial automation (1).
>
> * There exists vm-statistics counters (which count
> the number of pages on different states, for example, number of
> free pages, locked pages, pages under writeback, pagetable pages,
> file pages, etc).
> To reduce number of accesses to the global counters, each CPU maintains
> its own delta relative to the global VM counters
> (which can be cached in the local processor cache, therefore fast).

The counters only count accurately as a global sum. A counter may be
specific to a zone and at which time it counts uses of that zone of from
all processors.

> Now you are objecting to this patchset because:
>
> It increases the number of cycles to execute the function to modify
> the counters by 6. Can you mention any benchmark where this
> increase is significant?

I am objecting because of a fundamental misunderstanding of how these
counters work and because the patchset is incorrect in the way it handles
these counters. Come up with a correct approach and then we can consider
regressions and/or improvements in performance.

> Alternatives:
> 	1) Disable periodic synchronization for nohz_full CPUs.
> 	2) Processor instructions which can modify more than
> 	   one address in memory.
> 	3) Synchronize the per-CPU stats remotely (which
> 	   increases per-CPU and per-node accesses).

Or remove the assumptions that may exist in current code that a delta on a
specific cpu counter means that something occurred on that cpu?

If there is a delta then that *does not* mean that there is something to
do on that processor. The delta could be folded by another processor into
the global counter if that processor is idle or not entering the Kernel
and stays that way throughout the operation.

So I guess that would be #3. The function cpu_vm_stats_fold() already does
this for offline cpus. Can something similar be made to work for idle cpus
or those continually running in user space?



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-01-17 12:54 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-05 12:52 [PATCH v13 0/6] Ensure quiet_vmstat() is called when returning to userpace and when idle tick is stopped Marcelo Tosatti
2023-01-05 12:52 ` [PATCH v13 1/6] mm/vmstat: Add CPU-specific variable to track a vmstat discrepancy Marcelo Tosatti
2023-01-10 11:58   ` Christoph Lameter
2023-01-10 12:12     ` Frederic Weisbecker
2023-01-05 12:52 ` [PATCH v13 2/6] mm/vmstat: Use vmstat_dirty to track CPU-specific vmstat discrepancies Marcelo Tosatti
2023-01-10 12:06   ` Christoph Lameter
2023-01-10 12:18     ` Frederic Weisbecker
2023-01-10 13:39       ` Christoph Lameter
2023-01-10 20:09         ` Marcelo Tosatti
2023-01-11  8:42           ` Christoph Lameter
2023-01-11 17:07             ` Marcelo Tosatti
2023-01-16  9:51               ` Christoph Lameter
2023-01-16 16:11                 ` Marcelo Tosatti
2023-01-17 12:52                   ` Christoph Lameter
2023-01-05 12:52 ` [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
2023-01-05 12:52 ` [PATCH v13 4/6] tick/nohz_full: Ensure quiet_vmstat() is called on exit to user-mode when the idle tick is stopped Marcelo Tosatti
2023-01-05 12:52 ` [PATCH v13 5/6] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too Marcelo Tosatti
2023-01-05 12:52 ` [PATCH v13 6/6] mm/vmstat: avoid queueing work item if cpu stats are clean Marcelo Tosatti
     [not found] ` <20230106001244.4463-1-hdanton@sina.com>
2023-01-06 12:51   ` [PATCH v13 3/6] mm/vmstat: manage per-CPU stats from CPU context when NOHZ full Marcelo Tosatti
     [not found]   ` <20230106150154.4560-1-hdanton@sina.com>
2023-01-06 18:16     ` Marcelo Tosatti
     [not found]     ` <20230107001529.4617-1-hdanton@sina.com>
2023-01-09 14:12       ` Marcelo Tosatti
     [not found]       ` <20230110024356.336-1-hdanton@sina.com>
2023-01-10 11:50         ` Marcelo Tosatti
     [not found]         ` <20230110151901.402-1-hdanton@sina.com>
2023-01-10 16:12           ` Frederic Weisbecker
     [not found]           ` <20230110235822.456-1-hdanton@sina.com>
2023-01-11  0:09             ` Frederic Weisbecker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).