linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/5] optionally perform deferred actions on return to userspace (v3)
@ 2021-07-14 20:42 Marcelo Tosatti
  2021-07-14 20:42 ` [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags Marcelo Tosatti
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-07-14 20:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Juri Lelli, Nitesh Lal, Peter Zijlstra, Nicolas Saenz

Changelog:


-v3: use optimized percpu accessors for hotpath in
     vmstat.c (Christoph Lameter)
     fix !CONFIG_NUMA compilation breakage (kernel robot)

-v2: fix !CONFIG_SMP breakage (kernel robot)
     switch option to generic "quiesce_on_exit_to_usermode"


Summary of what was discussed on -v1:

1) The additional hooks to performance sensitive callbacks
in mm/vmstat.c are protected by a static key, therefore
workloads which do not enable this should not be impacted.

2) People would prefer the prctl() interface, but as noted
in the option documentation (patch 1), the code added by
this patchset should be reused by the prctl() interface,
and the isolcpus option can then be deprecated.

3) Nobody has any other bright ideas for ways to solve this
that would make this patch series obsolete.

4) The isolcpus= interface should switch to a cpuset based
interface.

---


The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on
the CPU in question.

To fix this, optionally quiesce deferred actions when returning
to userspace, controllable by a new "quiesce_on_exit_to_usermode"
isolcpus flag (default off).

See individual patches for details.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags
  2021-07-14 20:42 [patch 0/5] optionally perform deferred actions on return to userspace (v3) Marcelo Tosatti
@ 2021-07-14 20:42 ` Marcelo Tosatti
  2021-07-19 14:14   ` Frederic Weisbecker
  2021-07-14 20:42 ` [patch 2/5] common entry: add hook for isolation to __syscall_exit_to_user_mode_work Marcelo Tosatti
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Marcelo Tosatti @ 2021-07-14 20:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Juri Lelli, Nitesh Lal, Peter Zijlstra, Nicolas Saenz,
	Marcelo Tosatti

Add a new isolcpus flag "quiesce_on_exit_to_usermode" to enable
quiescing of deferred actions on return to userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6-vmstat-update/include/linux/sched/isolation.h
===================================================================
--- linux-2.6-vmstat-update.orig/include/linux/sched/isolation.h
+++ linux-2.6-vmstat-update/include/linux/sched/isolation.h
@@ -15,6 +15,7 @@ enum hk_flags {
 	HK_FLAG_WQ		= (1 << 6),
 	HK_FLAG_MANAGED_IRQ	= (1 << 7),
 	HK_FLAG_KTHREAD		= (1 << 8),
+	HK_FLAG_QUIESCE_URET	= (1 << 9),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
Index: linux-2.6-vmstat-update/kernel/sched/isolation.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/sched/isolation.c
+++ linux-2.6-vmstat-update/kernel/sched/isolation.c
@@ -173,6 +173,12 @@ static int __init housekeeping_isolcpus_
 			continue;
 		}
 
+		if (!strncmp(str, "quiesce_on_exit_to_usermode,", 28)) {
+			str += 28;
+			flags |= HK_FLAG_QUIESCE_URET;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.
Index: linux-2.6-vmstat-update/Documentation/admin-guide/kernel-parameters.txt
===================================================================
--- linux-2.6-vmstat-update.orig/Documentation/admin-guide/kernel-parameters.txt
+++ linux-2.6-vmstat-update/Documentation/admin-guide/kernel-parameters.txt
@@ -2124,6 +2124,43 @@
 
 			The format of <cpu-list> is described above.
 
+                         quiesce_on_exit_to_usermode
+
+			  This flag allows userspace to take preventive measures to
+			  avoid deferred actions and create a OS noise free environment for
+			  the application, by quiescing such activities on
+			  return from syscalls (that is, perform the necessary
+			  background work on return to userspace, rather than allowing
+			  it to happen when userspace is executing, in the form of
+			  an interruption to the application).
+
+			  There might be a performance degradation from using this,
+			  on systemcall heavy workloads, for the isolated CPUs.
+			  This option is intended to be used by specialized workloads.
+
+			  It should be deprecated in favour of a prctl() interface
+			  to enable this mode (which allows the quiescing to take
+			  place only on select sections of userspace execution, namely
+			  the latency sensitive loops).
+
+			  Note: one of the preventive measures this option
+			  enables is the following.
+
+			  Page counters are maintained in per-CPU counters to
+			  improve performance. When a CPU modifies a page counter,
+			  this modification is kept in the per-CPU counter.
+			  Certain activities require a global count, which
+			  involves requesting each CPU to flush its local counters
+			  to the global VM counters.
+			  This flush is implemented via a workqueue item, which
+			  requires scheduling the workqueue task on isolated CPUs.
+
+			  To avoid this interruption, quiesce_on_exit_to_usermode
+			  syncs the page counters on each return from system calls.
+			  To ensure the application returns to userspace
+			  with no modified per-CPU counters, its necessary to
+			  use mlockall() in addition to this isolcpus flag.
+
 	iucv=		[HW,NET]
 
 	ivrs_ioapic	[HW,X86-64]



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 2/5] common entry: add hook for isolation to __syscall_exit_to_user_mode_work
  2021-07-14 20:42 [patch 0/5] optionally perform deferred actions on return to userspace (v3) Marcelo Tosatti
  2021-07-14 20:42 ` [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags Marcelo Tosatti
@ 2021-07-14 20:42 ` Marcelo Tosatti
  2021-07-14 20:42 ` [patch 3/5] mm: vmstat: optionally flush per-CPU vmstat counters on return to userspace Marcelo Tosatti
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-07-14 20:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Juri Lelli, Nitesh Lal, Peter Zijlstra, Nicolas Saenz,
	Marcelo Tosatti

This hook will be used by the next patch to perform synchronization
of per-CPU vmstats.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6-vmstat-update/kernel/entry/common.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/entry/common.c
+++ linux-2.6-vmstat-update/kernel/entry/common.c
@@ -284,9 +284,18 @@ static void syscall_exit_to_user_mode_pr
 		syscall_exit_work(regs, work);
 }
 
+/*
+ * Isolaton specific exit to user mode preparation. Runs with interrupts
+ * enabled.
+ */
+static void isolation_exit_to_user_mode_prepare(void)
+{
+}
+
 static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)
 {
 	syscall_exit_to_user_mode_prepare(regs);
+	isolation_exit_to_user_mode_prepare();
 	local_irq_disable_exit_to_user();
 	exit_to_user_mode_prepare(regs);
 }



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 3/5] mm: vmstat: optionally flush per-CPU vmstat counters on return to userspace
  2021-07-14 20:42 [patch 0/5] optionally perform deferred actions on return to userspace (v3) Marcelo Tosatti
  2021-07-14 20:42 ` [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags Marcelo Tosatti
  2021-07-14 20:42 ` [patch 2/5] common entry: add hook for isolation to __syscall_exit_to_user_mode_work Marcelo Tosatti
@ 2021-07-14 20:42 ` Marcelo Tosatti
  2021-07-14 20:42 ` [patch 4/5] mm: vmstat: move need_update Marcelo Tosatti
  2021-07-14 20:42 ` [patch 5/5] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  4 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-07-14 20:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Juri Lelli, Nitesh Lal, Peter Zijlstra, Nicolas Saenz,
	Marcelo Tosatti

The logic to disable vmstat worker thread, when entering
nohz full, does not cover all scenarios. For example, it is possible
for the following to happen:

1) enter nohz_full, which calls refresh_cpu_vm_stats, syncing the stats.
2) app runs mlock, which increases counters for mlock'ed pages.
3) start -RT loop

Since refresh_cpu_vm_stats from nohz_full logic can happen _before_
the mlock, vmstat shepherd can restart vmstat worker thread on 
the CPU in question.

To fix this, optionally sync the vmstat counters when returning
from userspace, controllable by a new "quiesce_on_exit_to_usermode" isolcpus 
flags (default off).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6-vmstat-update/kernel/sched/isolation.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/sched/isolation.c
+++ linux-2.6-vmstat-update/kernel/sched/isolation.c
@@ -8,6 +8,7 @@
  *
  */
 #include "sched.h"
+#include <linux/vmstat.h>
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
 EXPORT_SYMBOL_GPL(housekeeping_overridden);
@@ -129,6 +130,11 @@ static int __init housekeeping_setup(cha
 		}
 	}
 
+#ifdef CONFIG_SMP
+	if (flags & HK_FLAG_QUIESCE_URET)
+		static_branch_enable(&vmstat_sync_enabled);
+#endif
+
 	housekeeping_flags |= flags;
 
 	free_bootmem_cpumask_var(non_housekeeping_mask);
Index: linux-2.6-vmstat-update/include/linux/vmstat.h
===================================================================
--- linux-2.6-vmstat-update.orig/include/linux/vmstat.h
+++ linux-2.6-vmstat-update/include/linux/vmstat.h
@@ -21,6 +21,23 @@ int sysctl_vm_numa_stat_handler(struct c
 		void *buffer, size_t *length, loff_t *ppos);
 #endif
 
+#ifdef CONFIG_SMP
+DECLARE_STATIC_KEY_FALSE(vmstat_sync_enabled);
+
+extern void __sync_vmstat(void);
+static inline void sync_vmstat(void)
+{
+	if (static_branch_unlikely(&vmstat_sync_enabled))
+		__sync_vmstat();
+}
+#else
+
+static inline void sync_vmstat(void)
+{
+}
+
+#endif
+
 struct reclaim_stat {
 	unsigned nr_dirty;
 	unsigned nr_unqueued_dirty;
Index: linux-2.6-vmstat-update/mm/vmstat.c
===================================================================
--- linux-2.6-vmstat-update.orig/mm/vmstat.c
+++ linux-2.6-vmstat-update/mm/vmstat.c
@@ -28,6 +28,7 @@
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/sched/isolation.h>
 
 #include "internal.h"
 
@@ -308,6 +309,17 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+DEFINE_STATIC_KEY_FALSE(vmstat_sync_enabled);
+static DEFINE_PER_CPU_ALIGNED(bool, vmstat_dirty);
+
+static inline void mark_vmstat_dirty(void)
+{
+	if (!static_branch_unlikely(&vmstat_sync_enabled))
+		return;
+
+	raw_cpu_write(vmstat_dirty, true);
+}
+
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -330,6 +342,7 @@ void __mod_zone_page_state(struct zone *
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 }
 EXPORT_SYMBOL(__mod_zone_page_state);
 
@@ -361,6 +374,7 @@ void __mod_node_page_state(struct pglist
 		x = 0;
 	}
 	__this_cpu_write(*p, x);
+	mark_vmstat_dirty();
 }
 EXPORT_SYMBOL(__mod_node_page_state);
 
@@ -401,6 +415,7 @@ void __inc_zone_state(struct zone *zone,
 		zone_page_state_add(v + overstep, zone, item);
 		__this_cpu_write(*p, -overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -419,6 +434,7 @@ void __inc_node_state(struct pglist_data
 		node_page_state_add(v + overstep, pgdat, item);
 		__this_cpu_write(*p, -overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -447,6 +463,7 @@ void __dec_zone_state(struct zone *zone,
 		zone_page_state_add(v - overstep, zone, item);
 		__this_cpu_write(*p, overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
@@ -465,6 +482,7 @@ void __dec_node_state(struct pglist_data
 		node_page_state_add(v - overstep, pgdat, item);
 		__this_cpu_write(*p, overstep);
 	}
+	mark_vmstat_dirty();
 }
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
@@ -528,6 +546,7 @@ static inline void mod_zone_state(struct
 
 	if (z)
 		zone_page_state_add(z, zone, item);
+	mark_vmstat_dirty();
 }
 
 void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
@@ -596,6 +615,7 @@ static inline void mod_node_state(struct
 
 	if (z)
 		node_page_state_add(z, pgdat, item);
+	mark_vmstat_dirty();
 }
 
 void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
@@ -2006,6 +2026,37 @@ static void vmstat_shepherd(struct work_
 		round_jiffies_relative(sysctl_stat_interval));
 }
 
+void __sync_vmstat(void)
+{
+	int cpu;
+
+	cpu = get_cpu();
+	if (housekeeping_cpu(cpu, HK_FLAG_QUIESCE_URET)) {
+		put_cpu();
+		return;
+	}
+
+	if (!raw_cpu_read(vmstat_dirty)) {
+		put_cpu();
+		return;
+	}
+
+	refresh_cpu_vm_stats(false);
+	raw_cpu_write(vmstat_dirty, false);
+	put_cpu();
+
+	/*
+	 * If task is migrated to another CPU between put_cpu
+	 * and cancel_delayed_work_sync, the code below might
+	 * cancel vmstat_update work for a different cpu
+	 * (than the one from which the vmstats were flushed).
+	 *
+	 * However, vmstat shepherd will re-enable it later,
+	 * so its harmless.
+	 */
+	cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
+}
+
 static void __init start_shepherd_timer(void)
 {
 	int cpu;
Index: linux-2.6-vmstat-update/kernel/entry/common.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/entry/common.c
+++ linux-2.6-vmstat-update/kernel/entry/common.c
@@ -6,6 +6,7 @@
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/vmstat.h>
 
 #include "common.h"
 
@@ -290,6 +291,7 @@ static void syscall_exit_to_user_mode_pr
  */
 static void isolation_exit_to_user_mode_prepare(void)
 {
+	sync_vmstat();
 }
 
 static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 4/5] mm: vmstat: move need_update
  2021-07-14 20:42 [patch 0/5] optionally perform deferred actions on return to userspace (v3) Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2021-07-14 20:42 ` [patch 3/5] mm: vmstat: optionally flush per-CPU vmstat counters on return to userspace Marcelo Tosatti
@ 2021-07-14 20:42 ` Marcelo Tosatti
  2021-07-14 20:42 ` [patch 5/5] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  4 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-07-14 20:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Juri Lelli, Nitesh Lal, Peter Zijlstra, Nicolas Saenz,
	Marcelo Tosatti

Move need_update() function up in vmstat.c, needed by next patch. 
No code changes.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>


Index: linux-2.6-vmstat-update/mm/vmstat.c
===================================================================
--- linux-2.6-vmstat-update.orig/mm/vmstat.c
+++ linux-2.6-vmstat-update/mm/vmstat.c
@@ -1853,6 +1853,40 @@ static const struct seq_operations vmsta
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	pg_data_t *last_pgdat = NULL;
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
+		struct per_cpu_nodestat *n;
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 */
+		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
+			       sizeof(p->vm_stat_diff[0])))
+			return true;
+#ifdef CONFIG_NUMA
+		if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
+			       sizeof(p->vm_numa_stat_diff[0])))
+			return true;
+#endif
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+		last_pgdat = zone->zone_pgdat;
+		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
+		if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
+			       sizeof(n->vm_node_stat_diff[0])))
+		    return true;
+	}
+	return false;
+}
+
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
@@ -1938,40 +1972,6 @@ static void vmstat_update(struct work_st
  * invoked when tick processing is not active.
  */
 /*
- * Check if the diffs for a certain cpu indicate that
- * an update is needed.
- */
-static bool need_update(int cpu)
-{
-	pg_data_t *last_pgdat = NULL;
-	struct zone *zone;
-
-	for_each_populated_zone(zone) {
-		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
-		struct per_cpu_nodestat *n;
-		/*
-		 * The fast way of checking if there are any vmstat diffs.
-		 */
-		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS *
-			       sizeof(p->vm_stat_diff[0])))
-			return true;
-#ifdef CONFIG_NUMA
-		if (memchr_inv(p->vm_numa_stat_diff, 0, NR_VM_NUMA_STAT_ITEMS *
-			       sizeof(p->vm_numa_stat_diff[0])))
-			return true;
-#endif
-		if (last_pgdat == zone->zone_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		n = per_cpu_ptr(zone->zone_pgdat->per_cpu_nodestats, cpu);
-		if (memchr_inv(n->vm_node_stat_diff, 0, NR_VM_NODE_STAT_ITEMS *
-			       sizeof(n->vm_node_stat_diff[0])))
-		    return true;
-	}
-	return false;
-}
-
-/*
  * Switch off vmstat processing and then fold all the remaining differentials
  * until the diffs stay at zero. The function is used by NOHZ and can only be
  * invoked when tick processing is not active.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 5/5] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean
  2021-07-14 20:42 [patch 0/5] optionally perform deferred actions on return to userspace (v3) Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2021-07-14 20:42 ` [patch 4/5] mm: vmstat: move need_update Marcelo Tosatti
@ 2021-07-14 20:42 ` Marcelo Tosatti
  4 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-07-14 20:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Juri Lelli, Nitesh Lal, Peter Zijlstra, Nicolas Saenz,
	Marcelo Tosatti

It is not necessary to queue work item to run refresh_vm_stats 
on a remote CPU if that CPU has no dirty stats and no per-CPU
allocations for remote nodes.

This fixes sosreport hang (which uses vmstat_refresh) with 
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6-vmstat-update/mm/vmstat.c
===================================================================
--- linux-2.6-vmstat-update.orig/mm/vmstat.c
+++ linux-2.6-vmstat-update/mm/vmstat.c
@@ -1888,17 +1888,41 @@ static bool need_update(int cpu)
 }
 
 #ifdef CONFIG_PROC_FS
-static void refresh_vm_stats(struct work_struct *work)
+static bool need_drain_remote_zones(int cpu)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pageset *p;
+
+		p = per_cpu_ptr(zone->pageset, cpu);
+
+		if (!p->pcp.count)
+			continue;
+#ifdef CONFIG_NUMA
+		if (!p->expire)
+			continue;
+#endif
+		if (zone_to_nid(zone) == cpu_to_node(cpu))
+			continue;
+
+		return true;
+	}
+
+	return false;
+}
+
+static long refresh_vm_stats(void *arg)
 {
 	refresh_cpu_vm_stats(true);
+	return 0;
 }
 
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
 	long val;
-	int err;
-	int i;
+	int i, cpu;
 
 	/*
 	 * The regular update, every sysctl_stat_interval, may come later
@@ -1912,9 +1936,15 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
-	if (err)
-		return err;
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		if (need_update(cpu) || need_drain_remote_zones(cpu))
+			work_on_cpu(cpu, refresh_vm_stats, NULL);
+
+		cond_resched();
+	}
+	put_online_cpus();
+
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		/*
 		 * Skip checking stats known to go negative occasionally.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags
  2021-07-14 20:42 ` [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags Marcelo Tosatti
@ 2021-07-19 14:14   ` Frederic Weisbecker
  0 siblings, 0 replies; 8+ messages in thread
From: Frederic Weisbecker @ 2021-07-19 14:14 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, Christoph Lameter, Thomas Gleixner, Juri Lelli,
	Nitesh Lal, Peter Zijlstra, Nicolas Saenz

On Wed, Jul 14, 2021 at 05:42:06PM -0300, Marcelo Tosatti wrote:
> Add a new isolcpus flag "quiesce_on_exit_to_usermode" to enable
> quiescing of deferred actions on return to userspace.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> Index: linux-2.6-vmstat-update/include/linux/sched/isolation.h
> ===================================================================
> --- linux-2.6-vmstat-update.orig/include/linux/sched/isolation.h
> +++ linux-2.6-vmstat-update/include/linux/sched/isolation.h
> Index: linux-2.6-vmstat-update/Documentation/admin-guide/kernel-parameters.txt
> ===================================================================
> --- linux-2.6-vmstat-update.orig/Documentation/admin-guide/kernel-parameters.txt
> +++ linux-2.6-vmstat-update/Documentation/admin-guide/kernel-parameters.txt
> @@ -2124,6 +2124,43 @@
>  
>  			The format of <cpu-list> is described above.
>  
> +                         quiesce_on_exit_to_usermode
> +
> +			  This flag allows userspace to take preventive measures to
> +			  avoid deferred actions and create a OS noise free environment for
> +			  the application, by quiescing such activities on
> +			  return from syscalls (that is, perform the necessary
> +			  background work on return to userspace, rather than allowing
> +			  it to happen when userspace is executing, in the form of
> +			  an interruption to the application).
> +
> +			  There might be a performance degradation from using this,
> +			  on systemcall heavy workloads, for the isolated CPUs.
> +			  This option is intended to be used by specialized workloads.
> +
> +			  It should be deprecated in favour of a prctl() interface
> +			  to enable this mode (which allows the quiescing to take
> +			  place only on select sections of userspace execution, namely
> +			  the latency sensitive loops).

So I don't believe in that. If boot parameters were deprecatable, isolcpus would
have been removed already. And now that it's here we have to support it forever
and even fight for keeping it usable with modern interfaces like cpuset.

Besides, such (very costly) quiescence on kernel exit should be only useful on
specific sections of a workload. No need to kill the performance everywhere.

It's a new feature, not a fix, so let's introduce a proper prctl() interface
once and for all. We can't postpone that step forever.

Thanks.

> +
> +			  Note: one of the preventive measures this option
> +			  enables is the following.
> +
> +			  Page counters are maintained in per-CPU counters to
> +			  improve performance. When a CPU modifies a page counter,
> +			  this modification is kept in the per-CPU counter.
> +			  Certain activities require a global count, which
> +			  involves requesting each CPU to flush its local counters
> +			  to the global VM counters.
> +			  This flush is implemented via a workqueue item, which
> +			  requires scheduling the workqueue task on isolated CPUs.
> +
> +			  To avoid this interruption, quiesce_on_exit_to_usermode
> +			  syncs the page counters on each return from system calls.
> +			  To ensure the application returns to userspace
> +			  with no modified per-CPU counters, its necessary to
> +			  use mlockall() in addition to this isolcpus flag.
> +
>  	iucv=		[HW,NET]
>  
>  	ivrs_ioapic	[HW,X86-64]
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags
  2021-07-09 17:37 [patch 0/5] optionally perform deferred actions on return to userspace Marcelo Tosatti
@ 2021-07-09 17:37 ` Marcelo Tosatti
  0 siblings, 0 replies; 8+ messages in thread
From: Marcelo Tosatti @ 2021-07-09 17:37 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Lameter, Thomas Gleixner, Frederic Weisbecker,
	Juri Lelli, Nitesh Lal, Peter Zijlstra, Nicolas Saenz,
	Marcelo Tosatti

Add a new isolcpus flag "quiesce_on_exit_to_usermode" to enable
quiescing of deferred actions on return to userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-2.6-vmstat-update/include/linux/sched/isolation.h
===================================================================
--- linux-2.6-vmstat-update.orig/include/linux/sched/isolation.h
+++ linux-2.6-vmstat-update/include/linux/sched/isolation.h
@@ -15,6 +15,7 @@ enum hk_flags {
 	HK_FLAG_WQ		= (1 << 6),
 	HK_FLAG_MANAGED_IRQ	= (1 << 7),
 	HK_FLAG_KTHREAD		= (1 << 8),
+	HK_FLAG_QUIESCE_URET	= (1 << 9),
 };
 
 #ifdef CONFIG_CPU_ISOLATION
Index: linux-2.6-vmstat-update/kernel/sched/isolation.c
===================================================================
--- linux-2.6-vmstat-update.orig/kernel/sched/isolation.c
+++ linux-2.6-vmstat-update/kernel/sched/isolation.c
@@ -173,6 +173,12 @@ static int __init housekeeping_isolcpus_
 			continue;
 		}
 
+		if (!strncmp(str, "quiesce_on_exit_to_usermode,", 28)) {
+			str += 28;
+			flags |= HK_FLAG_QUIESCE_URET;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.
Index: linux-2.6-vmstat-update/Documentation/admin-guide/kernel-parameters.txt
===================================================================
--- linux-2.6-vmstat-update.orig/Documentation/admin-guide/kernel-parameters.txt
+++ linux-2.6-vmstat-update/Documentation/admin-guide/kernel-parameters.txt
@@ -2124,6 +2124,43 @@
 
 			The format of <cpu-list> is described above.
 
+                         quiesce_on_exit_to_usermode
+
+			  This flag allows userspace to take preventive measures to
+			  avoid deferred actions and create a OS noise free environment for
+			  the application, by quiescing such activities on
+			  return from syscalls (that is, perform the necessary
+			  background work on return to userspace, rather than allowing
+			  it to happen when userspace is executing, in the form of
+			  an interruption to the application).
+
+			  There might be a performance degradation from using this,
+			  on systemcall heavy workloads, for the isolated CPUs.
+			  This option is intended to be used by specialized workloads.
+
+			  It should be deprecated in favour of a prctl() interface
+			  to enable this mode (which allows the quiescing to take
+			  place only on select sections of userspace execution, namely
+			  the latency sensitive loops).
+
+			  Note: one of the preventive measures this option
+			  enables is the following.
+
+			  Page counters are maintained in per-CPU counters to
+			  improve performance. When a CPU modifies a page counter,
+			  this modification is kept in the per-CPU counter.
+			  Certain activities require a global count, which
+			  involves requesting each CPU to flush its local counters
+			  to the global VM counters.
+			  This flush is implemented via a workqueue item, which
+			  requires scheduling the workqueue task on isolated CPUs.
+
+			  To avoid this interruption, quiesce_on_exit_to_usermode
+			  syncs the page counters on each return from system calls.
+			  To ensure the application returns to userspace
+			  with no modified per-CPU counters, its necessary to
+			  use mlockall() in addition to this isolcpus flag.
+
 	iucv=		[HW,NET]
 
 	ivrs_ioapic	[HW,X86-64]



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-19 14:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-14 20:42 [patch 0/5] optionally perform deferred actions on return to userspace (v3) Marcelo Tosatti
2021-07-14 20:42 ` [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags Marcelo Tosatti
2021-07-19 14:14   ` Frederic Weisbecker
2021-07-14 20:42 ` [patch 2/5] common entry: add hook for isolation to __syscall_exit_to_user_mode_work Marcelo Tosatti
2021-07-14 20:42 ` [patch 3/5] mm: vmstat: optionally flush per-CPU vmstat counters on return to userspace Marcelo Tosatti
2021-07-14 20:42 ` [patch 4/5] mm: vmstat: move need_update Marcelo Tosatti
2021-07-14 20:42 ` [patch 5/5] mm: vmstat_refresh: avoid queueing work item if cpu stats are clean Marcelo Tosatti
  -- strict thread matches above, loose matches on Subject: below --
2021-07-09 17:37 [patch 0/5] optionally perform deferred actions on return to userspace Marcelo Tosatti
2021-07-09 17:37 ` [patch 1/5] sched: isolation: introduce quiesce_on_exit_to_usermode isolcpu flags Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).