linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] fold per-CPU vmstats remotely
@ 2023-02-01 19:50 Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 1/5] mm/vmstat: remove remote node draining Marcelo Tosatti
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-01 19:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

This patch series addresses the following two problems:

    1. A customer provided some evidence which indicates that
       the idle tick was stopped; albeit, CPU-specific vmstat
       counters still remained populated.

       Thus one can only assume quiet_vmstat() was not
       invoked on return to the idle loop. If I understand
       correctly, I suspect this divergence might erroneously
       prevent a reclaim attempt by kswapd. If the number of
       zone specific free pages are below their per-cpu drift
       value then zone_page_state_snapshot() is used to
       compute a more accurate view of the aforementioned
       statistic.  Thus any task blocked on the NUMA node
       specific pfmemalloc_wait queue will be unable to make
       significant progress via direct reclaim unless it is
       killed after being woken up by kswapd
       (see throttle_direct_reclaim())

    2. With a SCHED_FIFO task that busy loops on a given CPU,
       and kworker for that CPU at SCHED_OTHER priority,
       queuing work to sync per-vmstats will either cause that
       work to never execute, or stalld (i.e. stall daemon)
       boosts kworker priority which causes a latency
       violation

By having vmstat_shepherd flush the per-CPU counters to the
global counters from remote CPUs.

This is done using cmpxchg to manipulate the counters,
both CPU locally (via the account functions),
and remotely (via cpu_vm_stats_fold).

Thanks to Aaron Tomlin for diagnosing issue 1 and writing
the initial patch series.

 include/linux/mmzone.h |    3 
 mm/vmstat.c            |  424 ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------
 2 files changed, 230 insertions(+), 197 deletions(-)




^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] mm/vmstat: remove remote node draining
  2023-02-01 19:50 [PATCH 0/5] fold per-CPU vmstats remotely Marcelo Tosatti
@ 2023-02-01 19:50 ` Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 2/5] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-01 19:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Draining of pages from the local pcp for a remote zone was necessary
since:

"Note that remote node draining is a somewhat esoteric feature that is
required on large NUMA systems because otherwise significant portions
of system memory can become trapped in pcp queues. The number of pcp is
determined by the number of processors and nodes in a system. A system
with 4 processors and 2 nodes has 8 pcps which is okay. But a system
with 1024 processors and 512 nodes has 512k pcps with a high potential
for large amount of memory being caught in them."

Since commit 443c2accd1b6679a1320167f8f56eed6536b806e
("mm/page_alloc: remotely drain per-cpu lists"), drain_all_pages() is able 
to remotely free those pages when necessary.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/include/linux/mmzone.h
===================================================================
--- linux-vmstat-remote.orig/include/linux/mmzone.h
+++ linux-vmstat-remote/include/linux/mmzone.h
@@ -577,9 +577,6 @@ struct per_cpu_pages {
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
 	short free_factor;	/* batch scaling factor during free */
-#ifdef CONFIG_NUMA
-	short expire;		/* When 0, remote pagesets are drained */
-#endif
 
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -803,7 +803,7 @@ static int fold_diff(int *zone_diff, int
  *
  * The function returns the number of global counters updated.
  */
-static int refresh_cpu_vm_stats(bool do_pagesets)
+static int refresh_cpu_vm_stats(void)
 {
 	struct pglist_data *pgdat;
 	struct zone *zone;
@@ -814,9 +814,6 @@ static int refresh_cpu_vm_stats(bool do_
 
 	for_each_populated_zone(zone) {
 		struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
-		struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif
 
 		for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 			int v;
@@ -826,44 +823,8 @@ static int refresh_cpu_vm_stats(bool do_
 
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
-#ifdef CONFIG_NUMA
-				/* 3 seconds idle till flush */
-				__this_cpu_write(pcp->expire, 3);
-#endif
 			}
 		}
-#ifdef CONFIG_NUMA
-
-		if (do_pagesets) {
-			cond_resched();
-			/*
-			 * Deal with draining the remote pageset of this
-			 * processor
-			 *
-			 * Check if there are pages remaining in this pageset
-			 * if not then there is nothing to expire.
-			 */
-			if (!__this_cpu_read(pcp->expire) ||
-			       !__this_cpu_read(pcp->count))
-				continue;
-
-			/*
-			 * We never drain zones local to this processor.
-			 */
-			if (zone_to_nid(zone) == numa_node_id()) {
-				__this_cpu_write(pcp->expire, 0);
-				continue;
-			}
-
-			if (__this_cpu_dec_return(pcp->expire))
-				continue;
-
-			if (__this_cpu_read(pcp->count)) {
-				drain_zone_pages(zone, this_cpu_ptr(pcp));
-				changes++;
-			}
-		}
-#endif
 	}
 
 	for_each_online_pgdat(pgdat) {
@@ -1864,7 +1825,7 @@ int sysctl_stat_interval __read_mostly =
 #ifdef CONFIG_PROC_FS
 static void refresh_vm_stats(struct work_struct *work)
 {
-	refresh_cpu_vm_stats(true);
+	refresh_cpu_vm_stats();
 }
 
 int vmstat_refresh(struct ctl_table *table, int write,
@@ -1928,7 +1889,7 @@ int vmstat_refresh(struct ctl_table *tab
 
 static void vmstat_update(struct work_struct *w)
 {
-	if (refresh_cpu_vm_stats(true)) {
+	if (refresh_cpu_vm_stats()) {
 		/*
 		 * Counters were updated so we expect more updates
 		 * to occur in the future. Keep on running the
@@ -1991,7 +1952,7 @@ void quiet_vmstat(void)
 	 * it would be too expensive from this path.
 	 * vmstat_shepherd will take care about that for us.
 	 */
-	refresh_cpu_vm_stats(false);
+	refresh_cpu_vm_stats();
 }
 
 /*



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/5] mm/vmstat: switch counter modification to cmpxchg
  2023-02-01 19:50 [PATCH 0/5] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 1/5] mm/vmstat: remove remote node draining Marcelo Tosatti
@ 2023-02-01 19:50 ` Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-01 19:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, switch all functions that
modify the counters to use cmpxchg.

To test the performance difference, a page allocator microbenchmark:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c 
with loops=1000000 was used, on Intel Core i7-11850H @ 2.50GHz.

For the single_page_alloc_free test, which does

        /** Loop to measure **/
        for (i = 0; i < rec->loops; i++) {
                my_page = alloc_page(gfp_mask);
                if (unlikely(my_page == NULL))
                        return 0;
                __free_page(my_page);
        }

Unit is cycles.

Vanilla			Patched		Diff
159			156		-1.9%

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -334,6 +334,188 @@ void set_pgdat_percpu_threshold(pg_data_
 	}
 }
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/*
+ * If we have cmpxchg_local support then we do not need to incur the overhead
+ * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
+ *
+ * mod_state() modifies the zone counter state through atomic per cpu
+ * operations.
+ *
+ * Overstep mode specifies how overstep should handled:
+ *     0       No overstepping
+ *     1       Overstepping half of threshold
+ *     -1      Overstepping minus half of threshold
+ */
+static inline void mod_zone_state(struct zone *zone, enum zone_stat_item item,
+				  long delta, int overstep_mode)
+{
+	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
+	s8 __percpu *p = pcp->vm_stat_diff + item;
+	long o, n, t, z;
+
+	do {
+		z = 0;  /* overflow to zone counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a zone.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to zone counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		zone_page_state_add(z, zone, item);
+}
+
+void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			 long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_zone_page_state);
+
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+			   long delta)
+{
+	mod_zone_state(zone, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+void inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_zone_page_state);
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_zone_page_state);
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+	mod_zone_state(page_zone(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void mod_node_state(struct pglist_data *pgdat,
+				  enum node_stat_item item,
+				  int delta, int overstep_mode)
+{
+	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
+	s8 __percpu *p = pcp->vm_node_stat_diff + item;
+	long o, n, t, z;
+
+	if (vmstat_item_in_bytes(item)) {
+		/*
+		 * Only cgroups use subpage accounting right now; at
+		 * the global level, these items still change in
+		 * multiples of whole pages. Store them as pages
+		 * internally to keep the per-cpu counters compact.
+		 */
+		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
+		delta >>= PAGE_SHIFT;
+	}
+
+	do {
+		z = 0;  /* overflow to node counters */
+
+		/*
+		 * The fetching of the stat_threshold is racy. We may apply
+		 * a counter threshold to the wrong the cpu if we get
+		 * rescheduled while executing here. However, the next
+		 * counter update will apply the threshold again and
+		 * therefore bring the counter under the threshold again.
+		 *
+		 * Most of the time the thresholds are the same anyways
+		 * for all cpus in a node.
+		 */
+		t = this_cpu_read(pcp->stat_threshold);
+
+		o = this_cpu_read(*p);
+		n = delta + o;
+
+		if (abs(n) > t) {
+			int os = overstep_mode * (t >> 1);
+
+			/* Overflow must be added to node counters */
+			z = n + os;
+			n = -os;
+		}
+	} while (this_cpu_cmpxchg(*p, o, n) != o);
+
+	if (z)
+		node_page_state_add(z, pgdat, item);
+}
+
+void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(mod_node_page_state);
+
+void __mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
+					long delta)
+{
+	mod_node_state(pgdat, item, delta, 0);
+}
+EXPORT_SYMBOL(__mod_node_page_state);
+
+void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
+{
+	mod_node_state(pgdat, item, 1, 1);
+}
+
+void inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(inc_node_page_state);
+
+void __inc_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, 1, 1);
+}
+EXPORT_SYMBOL(__inc_node_page_state);
+
+void dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(dec_node_page_state);
+
+void __dec_node_page_state(struct page *page, enum node_stat_item item)
+{
+	mod_node_state(page_pgdat(page), item, -1, -1);
+}
+EXPORT_SYMBOL(__dec_node_page_state);
+#else
 /*
  * For use when we know that interrupts are disabled,
  * or when we know that preemption is disabled and that
@@ -541,149 +723,6 @@ void __dec_node_page_state(struct page *
 }
 EXPORT_SYMBOL(__dec_node_page_state);
 
-#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
-/*
- * If we have cmpxchg_local support then we do not need to incur the overhead
- * that comes with local_irq_save/restore if we use this_cpu_cmpxchg.
- *
- * mod_state() modifies the zone counter state through atomic per cpu
- * operations.
- *
- * Overstep mode specifies how overstep should handled:
- *     0       No overstepping
- *     1       Overstepping half of threshold
- *     -1      Overstepping minus half of threshold
-*/
-static inline void mod_zone_state(struct zone *zone,
-       enum zone_stat_item item, long delta, int overstep_mode)
-{
-	struct per_cpu_zonestat __percpu *pcp = zone->per_cpu_zonestats;
-	s8 __percpu *p = pcp->vm_stat_diff + item;
-	long o, n, t, z;
-
-	do {
-		z = 0;  /* overflow to zone counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a zone.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to zone counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		zone_page_state_add(z, zone, item);
-}
-
-void mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
-			 long delta)
-{
-	mod_zone_state(zone, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_zone_page_state);
-
-void inc_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_zone_page_state);
-
-void dec_zone_page_state(struct page *page, enum zone_stat_item item)
-{
-	mod_zone_state(page_zone(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_zone_page_state);
-
-static inline void mod_node_state(struct pglist_data *pgdat,
-       enum node_stat_item item, int delta, int overstep_mode)
-{
-	struct per_cpu_nodestat __percpu *pcp = pgdat->per_cpu_nodestats;
-	s8 __percpu *p = pcp->vm_node_stat_diff + item;
-	long o, n, t, z;
-
-	if (vmstat_item_in_bytes(item)) {
-		/*
-		 * Only cgroups use subpage accounting right now; at
-		 * the global level, these items still change in
-		 * multiples of whole pages. Store them as pages
-		 * internally to keep the per-cpu counters compact.
-		 */
-		VM_WARN_ON_ONCE(delta & (PAGE_SIZE - 1));
-		delta >>= PAGE_SHIFT;
-	}
-
-	do {
-		z = 0;  /* overflow to node counters */
-
-		/*
-		 * The fetching of the stat_threshold is racy. We may apply
-		 * a counter threshold to the wrong the cpu if we get
-		 * rescheduled while executing here. However, the next
-		 * counter update will apply the threshold again and
-		 * therefore bring the counter under the threshold again.
-		 *
-		 * Most of the time the thresholds are the same anyways
-		 * for all cpus in a node.
-		 */
-		t = this_cpu_read(pcp->stat_threshold);
-
-		o = this_cpu_read(*p);
-		n = delta + o;
-
-		if (abs(n) > t) {
-			int os = overstep_mode * (t >> 1) ;
-
-			/* Overflow must be added to node counters */
-			z = n + os;
-			n = -os;
-		}
-	} while (this_cpu_cmpxchg(*p, o, n) != o);
-
-	if (z)
-		node_page_state_add(z, pgdat, item);
-}
-
-void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
-					long delta)
-{
-	mod_node_state(pgdat, item, delta, 0);
-}
-EXPORT_SYMBOL(mod_node_page_state);
-
-void inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
-{
-	mod_node_state(pgdat, item, 1, 1);
-}
-
-void inc_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, 1, 1);
-}
-EXPORT_SYMBOL(inc_node_page_state);
-
-void dec_node_page_state(struct page *page, enum node_stat_item item)
-{
-	mod_node_state(page_pgdat(page), item, -1, -1);
-}
-EXPORT_SYMBOL(dec_node_page_state);
-#else
 /*
  * Use interrupt disable to serialize counter updates
  */



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-01 19:50 [PATCH 0/5] fold per-CPU vmstats remotely Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 1/5] mm/vmstat: remove remote node draining Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 2/5] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
@ 2023-02-01 19:50 ` Marcelo Tosatti
  2023-02-02 14:38   ` Christoph Lameter
  2023-02-06 19:19   ` Matthew Wilcox
  2023-02-01 19:50 ` [PATCH 4/5] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 5/5] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
  4 siblings, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-01 19:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

In preparation to switch vmstat shepherd to flush
per-CPU counters remotely, use a cmpxchg loop 
instead of a pair of read/write instructions.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -885,7 +885,7 @@ static int refresh_cpu_vm_stats(void)
 }
 
 /*
- * Fold the data for an offline cpu into the global array.
+ * Fold the data for a cpu into the global array.
  * There cannot be any access by the offline cpu and therefore
  * synchronization is simplified.
  */
@@ -906,8 +906,9 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_stat_diff[i]) {
 				int v;
 
-				v = pzstats->vm_stat_diff[i];
-				pzstats->vm_stat_diff[i] = 0;
+				do {
+					v = pzstats->vm_stat_diff[i];
+				} while (cmpxchg(&pzstats->vm_stat_diff[i], v, 0) != v);
 				atomic_long_add(v, &zone->vm_stat[i]);
 				global_zone_diff[i] += v;
 			}
@@ -917,8 +918,9 @@ void cpu_vm_stats_fold(int cpu)
 			if (pzstats->vm_numa_event[i]) {
 				unsigned long v;
 
-				v = pzstats->vm_numa_event[i];
-				pzstats->vm_numa_event[i] = 0;
+				do {
+					v = pzstats->vm_numa_event[i];
+				} while (cmpxchg(&pzstats->vm_numa_event[i], v, 0) != v);
 				zone_numa_event_add(v, zone, i);
 			}
 		}
@@ -934,8 +936,9 @@ void cpu_vm_stats_fold(int cpu)
 			if (p->vm_node_stat_diff[i]) {
 				int v;
 
-				v = p->vm_node_stat_diff[i];
-				p->vm_node_stat_diff[i] = 0;
+				do {
+					v = p->vm_node_stat_diff[i];
+				} while (cmpxchg(&p->vm_node_stat_diff[i], v, 0) != v);
 				atomic_long_add(v, &pgdat->vm_stat[i]);
 				global_node_diff[i] += v;
 			}



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/5] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely
  2023-02-01 19:50 [PATCH 0/5] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2023-02-01 19:50 ` [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-02-01 19:50 ` Marcelo Tosatti
  2023-02-01 19:50 ` [PATCH 5/5] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti
  4 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-01 19:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Now that the counters are modified via cmpxchg both CPU locally
(via the account functions), and remotely (via cpu_vm_stats_fold),
its possible to switch vmstat_shepherd to perform the per-CPU 
vmstats folding remotely.

This fixes the following two problems:

 1. A customer provided some evidence which indicates that
    the idle tick was stopped; albeit, CPU-specific vmstat
    counters still remained populated.

    Thus one can only assume quiet_vmstat() was not
    invoked on return to the idle loop. If I understand
    correctly, I suspect this divergence might erroneously
    prevent a reclaim attempt by kswapd. If the number of
    zone specific free pages are below their per-cpu drift
    value then zone_page_state_snapshot() is used to
    compute a more accurate view of the aforementioned
    statistic.  Thus any task blocked on the NUMA node
    specific pfmemalloc_wait queue will be unable to make
    significant progress via direct reclaim unless it is
    killed after being woken up by kswapd
    (see throttle_direct_reclaim())

 2. With a SCHED_FIFO task that busy loops on a given CPU,
    and kworker for that CPU at SCHED_OTHER priority,
    queuing work to sync per-vmstats will either cause that
    work to never execute, or stalld (i.e. stall daemon)
    boosts kworker priority which causes a latency
    violation

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -2007,6 +2007,23 @@ static void vmstat_shepherd(struct work_
 
 static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);
 
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+/* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	cpus_read_lock();
+	for_each_online_cpu(cpu) {
+		cpu_vm_stats_fold(cpu);
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
+}
+#else
 static void vmstat_shepherd(struct work_struct *w)
 {
 	int cpu;
@@ -2026,6 +2043,7 @@ static void vmstat_shepherd(struct work_
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }
+#endif
 
 static void __init start_shepherd_timer(void)
 {



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/5] mm/vmstat: refresh stats remotely instead of via work item
  2023-02-01 19:50 [PATCH 0/5] fold per-CPU vmstats remotely Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2023-02-01 19:50 ` [PATCH 4/5] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
@ 2023-02-01 19:50 ` Marcelo Tosatti
  4 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-01 19:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-mm, Marcelo Tosatti

Refresh per-CPU stats remotely, instead of queueing 
work items, for the stat_refresh procfs method.

This fixes sosreport hang (which uses vmstat_refresh) with
spinning SCHED_FIFO process.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Index: linux-vmstat-remote/mm/vmstat.c
===================================================================
--- linux-vmstat-remote.orig/mm/vmstat.c
+++ linux-vmstat-remote/mm/vmstat.c
@@ -1865,11 +1865,21 @@ static DEFINE_PER_CPU(struct delayed_wor
 int sysctl_stat_interval __read_mostly = HZ;
 
 #ifdef CONFIG_PROC_FS
+
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+static int refresh_all_vm_stats(void);
+#else
 static void refresh_vm_stats(struct work_struct *work)
 {
 	refresh_cpu_vm_stats();
 }
 
+static int refresh_all_vm_stats(void)
+{
+	return schedule_on_each_cpu(refresh_vm_stats);
+}
+#endif
+
 int vmstat_refresh(struct ctl_table *table, int write,
 		   void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -1889,7 +1899,7 @@ int vmstat_refresh(struct ctl_table *tab
 	 * transiently negative values, report an error here if any of
 	 * the stats is negative, so we know to go looking for imbalance.
 	 */
-	err = schedule_on_each_cpu(refresh_vm_stats);
+	err = refresh_all_vm_stats();
 	if (err)
 		return err;
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
@@ -2009,7 +2019,7 @@ static DECLARE_DEFERRABLE_WORK(shepherd,
 
 #ifdef CONFIG_HAVE_CMPXCHG_LOCAL
 /* Flush counters remotely if CPU uses cmpxchg to update its per-CPU counters */
-static void vmstat_shepherd(struct work_struct *w)
+static int refresh_all_vm_stats(void)
 {
 	int cpu;
 
@@ -2019,7 +2029,12 @@ static void vmstat_shepherd(struct work_
 		cond_resched();
 	}
 	cpus_read_unlock();
+	return 0;
+}
 
+static void vmstat_shepherd(struct work_struct *w)
+{
+	refresh_all_vm_stats();
 	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
 }



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-01 19:50 ` [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
@ 2023-02-02 14:38   ` Christoph Lameter
  2023-02-02 15:54     ` Marcelo Tosatti
  2023-02-06 19:19   ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2023-02-02 14:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Wed, 1 Feb 2023, Marcelo Tosatti wrote:

> In preparation to switch vmstat shepherd to flush
> per-CPU counters remotely, use a cmpxchg loop
> instead of a pair of read/write instructions.

You are mixing full atomic cmpxchg and  per cpu atomic cmpxchg? That does
not work.

I thought you would only run this while the kernel is not active on the
remote cpu? Then you dont need any cmpxchg and you can leave the function
as is.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-02 14:38   ` Christoph Lameter
@ 2023-02-02 15:54     ` Marcelo Tosatti
  2023-02-03  9:34       ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-02 15:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, Feb 02, 2023 at 03:38:58PM +0100, Christoph Lameter wrote:
> On Wed, 1 Feb 2023, Marcelo Tosatti wrote:
> 
> > In preparation to switch vmstat shepherd to flush
> > per-CPU counters remotely, use a cmpxchg loop
> > instead of a pair of read/write instructions.
> 
> You are mixing full atomic cmpxchg and  per cpu atomic cmpxchg? That does
> not work.

OK, missing locked on the local functions. Can fix that.

> I thought you would only run this while the kernel is not active on the
> remote cpu? Then you dont need any cmpxchg and you can leave the function
> as is.

The remote cpu can enter kernel mode while this function executes.

There is no mode which indicates userspace cannot enter the kernel.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-02 15:54     ` Marcelo Tosatti
@ 2023-02-03  9:34       ` Christoph Lameter
  2023-02-03 18:52         ` Marcelo Tosatti
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2023-02-03  9:34 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Thu, 2 Feb 2023, Marcelo Tosatti wrote:

> > I thought you would only run this while the kernel is not active on the
> > remote cpu? Then you dont need any cmpxchg and you can leave the function
> > as is.
>
> The remote cpu can enter kernel mode while this function executes.

Isnt there some lock/serializtion to stall the kernel until you are done?

> There is no mode which indicates userspace cannot enter the kernel.

There are lot of thinngs that happen upon entry to the kernel. I would
hope that you can do something there. Scheduler?



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-03  9:34       ` Christoph Lameter
@ 2023-02-03 18:52         ` Marcelo Tosatti
  2023-02-06  9:42           ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-03 18:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Fri, Feb 03, 2023 at 10:34:22AM +0100, Christoph Lameter wrote:
> On Thu, 2 Feb 2023, Marcelo Tosatti wrote:
> 
> > > I thought you would only run this while the kernel is not active on the
> > > remote cpu? Then you dont need any cmpxchg and you can leave the function
> > > as is.
> >
> > The remote cpu can enter kernel mode while this function executes.
> 
> Isnt there some lock/serializtion to stall the kernel until you are done?

Not that i know of. Anyway, an additional datapoint is:

"Software defined PLC"
(https://www.redhat.com/en/blog/software-defined-programmable-logic-controller-introduction),
applications
can perform system calls in their time sensitive loop.

One example of an opensource software is OpenPLC.

One would like to avoid interruptions for those cases as well.

> > There is no mode which indicates userspace cannot enter the kernel.
> 
> There are lot of thinngs that happen upon entry to the kernel. I would
> hope that you can do something there. Scheduler?

The use-case in question is with isolation, where a CPU is dedicated
to a single task. So the scheduler should not be an issue.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-03 18:52         ` Marcelo Tosatti
@ 2023-02-06  9:42           ` Christoph Lameter
  2023-02-06 19:10             ` Marcelo Tosatti
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2023-02-06  9:42 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Fri, 3 Feb 2023, Marcelo Tosatti wrote:

> > Isnt there some lock/serializtion to stall the kernel until you are done?
>
> Not that i know of. Anyway, an additional datapoint is:
>
> "Software defined PLC"
> (https://www.redhat.com/en/blog/software-defined-programmable-logic-controller-introduction),
> applications
> can perform system calls in their time sensitive loop.
>
> One example of an opensource software is OpenPLC.
>
> One would like to avoid interruptions for those cases as well.

Well allowing sytem calls during "time sensitiveness" implies
it is not that sensitive to vmstat updates which have a smaller impact
than system calls.

Unless we are talking about virtual system calls like gettimeofday or
clock_gettime. These do not enter the kernel if configured correctly.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-06  9:42           ` Christoph Lameter
@ 2023-02-06 19:10             ` Marcelo Tosatti
  0 siblings, 0 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2023-02-06 19:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Aaron Tomlin, Frederic Weisbecker, Andrew Morton, linux-kernel, linux-mm

On Mon, Feb 06, 2023 at 10:42:46AM +0100, Christoph Lameter wrote:
> On Fri, 3 Feb 2023, Marcelo Tosatti wrote:
> 
> > > Isnt there some lock/serializtion to stall the kernel until you are done?
> >
> > Not that i know of. Anyway, an additional datapoint is:
> >
> > "Software defined PLC"
> > (https://www.redhat.com/en/blog/software-defined-programmable-logic-controller-introduction),
> > applications
> > can perform system calls in their time sensitive loop.
> >
> > One example of an opensource software is OpenPLC.
> >
> > One would like to avoid interruptions for those cases as well.
> 
> Well allowing sytem calls during "time sensitiveness" implies
> it is not that sensitive to vmstat updates
> which have a smaller impact than system calls.

Not necessarily. Certain system calls won't touch per-CPU vmstats: nanosleep,
for example. Perhaps i misunderstood your suggestion:

So the patchset in discussion uses (or should use, in v2), in both
vmstat_shepherd and vmstat counter modification, LOCK CMPXCHG.

There is the potential that LOCK CMPXCHG, from vmstat counter modification, 
incurs a performance degradation.

Note however, that cachelocking should hopefully "hide" the costs. 

Do you have any concerns about this patchset other than the performance
degradation due to addition of LOCK in CMPXCHG? 

The other possible concern is that the preempt-disabled functions,
namely:
__inc_node_page_state, __dec_node_page_state, __mod_node_page_state,
__inc_zone_page_state, __dec_zone_page_state, __mod_zone_page_state
have been switched to cmpxchg loop. Is that a problem?

Would expect that measuring LOCK CMPXCHG does not incur significant
performance degradation as compared to CMPXCHG (from the 
page allocation benchmark) would address your concerns?

Thanks


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold
  2023-02-01 19:50 ` [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
  2023-02-02 14:38   ` Christoph Lameter
@ 2023-02-06 19:19   ` Matthew Wilcox
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2023-02-06 19:19 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Christoph Lameter, Aaron Tomlin, Frederic Weisbecker,
	Andrew Morton, linux-kernel, linux-mm

On Wed, Feb 01, 2023 at 04:50:16PM -0300, Marcelo Tosatti wrote:
> In preparation to switch vmstat shepherd to flush
> per-CPU counters remotely, use a cmpxchg loop 
> instead of a pair of read/write instructions.

FYI, try_cmpxchg() is preferred to plain cmpxchg() these days.
Apparently it generates better code on x86.

> -				v = pzstats->vm_stat_diff[i];
> -				pzstats->vm_stat_diff[i] = 0;
> +				do {
> +					v = pzstats->vm_stat_diff[i];
> +				} while (cmpxchg(&pzstats->vm_stat_diff[i], v, 0) != v);

I think this would be:

				do {
					v = pzstats->vm_stat_diff[i];
				} while (!try_cmpxchg(&pzstats->vm_stat_diff[i], v, 0));



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-02-06 19:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01 19:50 [PATCH 0/5] fold per-CPU vmstats remotely Marcelo Tosatti
2023-02-01 19:50 ` [PATCH 1/5] mm/vmstat: remove remote node draining Marcelo Tosatti
2023-02-01 19:50 ` [PATCH 2/5] mm/vmstat: switch counter modification to cmpxchg Marcelo Tosatti
2023-02-01 19:50 ` [PATCH 3/5] mm/vmstat: use cmpxchg loop in cpu_vm_stats_fold Marcelo Tosatti
2023-02-02 14:38   ` Christoph Lameter
2023-02-02 15:54     ` Marcelo Tosatti
2023-02-03  9:34       ` Christoph Lameter
2023-02-03 18:52         ` Marcelo Tosatti
2023-02-06  9:42           ` Christoph Lameter
2023-02-06 19:10             ` Marcelo Tosatti
2023-02-06 19:19   ` Matthew Wilcox
2023-02-01 19:50 ` [PATCH 4/5] mm/vmstat: switch vmstat shepherd to flush per-CPU counters remotely Marcelo Tosatti
2023-02-01 19:50 ` [PATCH 5/5] mm/vmstat: refresh stats remotely instead of via work item Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).