All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes
@ 2024-04-16 17:51 Jesper Dangaard Brouer
  2024-04-16 17:51 ` [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints Jesper Dangaard Brouer
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-16 17:51 UTC (permalink / raw)
  To: tj, hannes, lizefan.x, cgroups, yosryahmed, longman
  Cc: Jesper Dangaard Brouer, netdev, linux-mm, linux-kernel,
	shakeel.butt, kernel-team, linux-kernel,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

This patchset is focused on the global cgroup_rstat_lock.

 Patch-1: Adds tracepoints to improve measuring lock behavior.
 Patch-2: Converts the global lock into a mutex.
 Patch-3: Limits userspace triggered pressure on the lock.

Background in discussion thread [1].
 [1] https://lore.kernel.org/all/ac4cf07f-52dd-454f-b897-2a4b3796a4d9@kernel.org/

---

Jesper Dangaard Brouer (3):
      cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
      cgroup/rstat: convert cgroup_rstat_lock back to mutex
      cgroup/rstat: introduce ratelimited rstat flushing


 block/blk-cgroup.c            |   2 +-
 include/linux/cgroup-defs.h   |   1 +
 include/linux/cgroup.h        |   5 +-
 include/trace/events/cgroup.h |  48 +++++++++++++++
 kernel/cgroup/rstat.c         | 111 ++++++++++++++++++++++++++++++----
 mm/memcontrol.c               |   1 +
 6 files changed, 153 insertions(+), 15 deletions(-)

--



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  2024-04-16 17:51 [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Jesper Dangaard Brouer
@ 2024-04-16 17:51 ` Jesper Dangaard Brouer
  2024-04-16 21:36   ` Tejun Heo
  2024-04-23 16:53   ` Simon Horman
  2024-04-16 17:51 ` [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-16 17:51 UTC (permalink / raw)
  To: tj, hannes, lizefan.x, cgroups, yosryahmed, longman
  Cc: Jesper Dangaard Brouer, netdev, linux-mm, linux-kernel,
	shakeel.butt, kernel-team, linux-kernel,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

This commit enhances the ability to troubleshoot the global
cgroup_rstat_lock by introducing wrapper helper functions for the lock
along with associated tracepoints.

Although global, the cgroup_rstat_lock helper APIs and tracepoints take
arguments such as cgroup pointer and cpu_in_loop variable. This
adjustment is made because flushing occurs per cgroup despite the lock
being global. Hence, when troubleshooting, it's important to identify the
relevant cgroup. The cpu_in_loop variable is necessary because the global
lock may be released within the main flushing loop that traverses CPUs.
In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the
main lock; otherwise, it denotes the CPU number processed last.

The new feature in this patchset is detecting when lock is contended. The
tracepoints are implemented with production in mind. For minimum overhead
attach to cgroup:cgroup_rstat_lock_contended, which only gets activated
when trylock detects lock is contended. A quick production check for
issues could be done via this perf commands:

 perf record -g -e cgroup:cgroup_rstat_lock_contended

Next natural question would be asking how long time do lock contenders
wait for obtaining the lock. This can be answered by measuring the time
between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked
when args->contended is set.  Like this bpftrace script:

 bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
   tracepoint:cgroup:cgroup_rstat_locked {
     if (args->contended) {
       @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
   interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }'

Extending with time spend holding the lock will be more expensive as this
also looks at all the non-contended cases.
Like this bpftrace script:

 bpftrace -e '
   tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
   tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs;
     if (args->contended) {
       @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
   tracepoint:cgroup:cgroup_rstat_unlock {
       @locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);}
   interval:s:1 {time("%H:%M:%S ");  print(@wait_ns);print(@locked_ns); }'

Signes-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 include/linux/cgroup.h        |    2 +-
 include/trace/events/cgroup.h |   48 +++++++++++++++++++++++++++++++++++++++++
 kernel/cgroup/rstat.c         |   47 +++++++++++++++++++++++++++++++++-------
 3 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 34aaf0e87def..2150ca60394b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -690,7 +690,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(void);
+void cgroup_rstat_flush_release(struct cgroup *cgrp);
 
 /*
  * Basic resource stats.
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index dd7d7c9efecd..13f375800135 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -204,6 +204,54 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
 	TP_ARGS(cgrp, path, val)
 );
 
+DECLARE_EVENT_CLASS(cgroup_rstat,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended),
+
+	TP_ARGS(cgrp, cpu_in_loop, contended),
+
+	TP_STRUCT__entry(
+		__field(	int,		root			)
+		__field(	int,		level			)
+		__field(	u64,		id			)
+		__field(	int,		cpu_in_loop		)
+		__field(	bool,		contended		)
+	),
+
+	TP_fast_assign(
+		__entry->root = cgrp->root->hierarchy_id;
+		__entry->id = cgroup_id(cgrp);
+		__entry->level = cgrp->level;
+		__entry->cpu_in_loop = cpu_in_loop;
+		__entry->contended = contended;
+	),
+
+	TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock contended:%d",
+		  __entry->root, __entry->id, __entry->level,
+		  __entry->cpu_in_loop, __entry->contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+	TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_locked,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+	TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
+
+	TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+	TP_ARGS(cgrp, cpu, contended)
+);
+
 #endif /* _TRACE_CGROUP_H */
 
 /* This part must be outside protection */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 07e2284bb499..ff68c904e647 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -7,6 +7,8 @@
 #include <linux/btf.h>
 #include <linux/btf_ids.h>
 
+#include <trace/events/cgroup.h>
+
 static DEFINE_SPINLOCK(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
@@ -222,6 +224,35 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 
 __bpf_hook_end();
 
+/*
+ * Helper functions for locking cgroup_rstat_lock.
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments.  The parameter @cpu_in_loop indicate lock
+ * was released and re-taken when collection data from the CPUs. The
+ * value -1 is used when obtaining the main lock else this is the CPU
+ * number processed last.
+ */
+static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
+	__acquires(&cgroup_rstat_lock)
+{
+	bool contended;
+
+	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	if (contended) {
+		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
+		spin_lock_irq(&cgroup_rstat_lock);
+	}
+	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
+}
+
+static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
+	__releases(&cgroup_rstat_lock)
+{
+	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
+	spin_unlock_irq(&cgroup_rstat_lock);
+}
+
 /* see cgroup_rstat_flush() */
 static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -248,10 +279,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 
 		/* play nice and yield if necessary */
 		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
-			spin_unlock_irq(&cgroup_rstat_lock);
+			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();
-			spin_lock_irq(&cgroup_rstat_lock);
+			__cgroup_rstat_lock(cgrp, cpu);
 		}
 	}
 }
@@ -273,9 +304,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 {
 	might_sleep();
 
-	spin_lock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_lock(cgrp, -1);
 	cgroup_rstat_flush_locked(cgrp);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_unlock(cgrp, -1);
 }
 
 /**
@@ -291,17 +322,17 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
 	might_sleep();
-	spin_lock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_lock(cgrp, -1);
 	cgroup_rstat_flush_locked(cgrp);
 }
 
 /**
  * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
  */
-void cgroup_rstat_flush_release(void)
+void cgroup_rstat_flush_release(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock)
 {
-	spin_unlock_irq(&cgroup_rstat_lock);
+	__cgroup_rstat_unlock(cgrp, -1);
 }
 
 int cgroup_rstat_init(struct cgroup *cgrp)
@@ -533,7 +564,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
 #ifdef CONFIG_SCHED_CORE
 		forceidle_time = cgrp->bstat.forceidle_sum;
 #endif
-		cgroup_rstat_flush_release();
+		cgroup_rstat_flush_release(cgrp);
 	} else {
 		root_cgroup_cputime(&bstat);
 		usage = bstat.cputime.sum_exec_runtime;



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-16 17:51 [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Jesper Dangaard Brouer
  2024-04-16 17:51 ` [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints Jesper Dangaard Brouer
@ 2024-04-16 17:51 ` Jesper Dangaard Brouer
  2024-04-18  2:19   ` Yosry Ahmed
  2024-04-16 17:51 ` [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing Jesper Dangaard Brouer
  2024-04-16 21:38 ` [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Tejun Heo
  3 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-16 17:51 UTC (permalink / raw)
  To: tj, hannes, lizefan.x, cgroups, yosryahmed, longman
  Cc: Jesper Dangaard Brouer, netdev, linux-mm, linux-kernel,
	shakeel.butt, kernel-team, linux-kernel,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
with a spinlock").

Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
necessary during the collection of per-CPU stats, this approach has led
to several scaling issues observed in production environments. Holding
this IRQ lock has caused starvation of other critical kernel functions,
such as softirq (e.g., timers and netstack). Although kernel v6.8
introduced optimizations in this area, we continue to observe instances
where the spin_lock is held for 64-128 ms in production.

This patch converts cgroup_rstat_lock back to being a mutex lock. This
change is made possible thanks to the significant effort by Yosry Ahmed
to eliminate all atomic context use-cases through multiple commits,
ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
included in kernel v6.5.

After this patch lock contention will be less obvious, as converting this
to a mutex avoids multiple CPUs spinning while waiting for the lock, but
it doesn't remove the lock contention. It is recommended to use the
tracepoints to diagnose this.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 kernel/cgroup/rstat.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ff68c904e647..a90d68a7c27f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,7 +9,7 @@
 
 #include <trace/events/cgroup.h>
 
-static DEFINE_SPINLOCK(cgroup_rstat_lock);
+static DEFINE_MUTEX(cgroup_rstat_lock);
 static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
 
 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
 {
 	bool contended;
 
-	contended = !spin_trylock_irq(&cgroup_rstat_lock);
+	contended = !mutex_trylock(&cgroup_rstat_lock);
 	if (contended) {
 		trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
-		spin_lock_irq(&cgroup_rstat_lock);
+		mutex_lock(&cgroup_rstat_lock);
 	}
 	trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
 }
@@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
 	__releases(&cgroup_rstat_lock)
 {
 	trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
-	spin_unlock_irq(&cgroup_rstat_lock);
+	mutex_unlock(&cgroup_rstat_lock);
 }
 
 /* see cgroup_rstat_flush() */
@@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 		}
 
 		/* play nice and yield if necessary */
-		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+		if (need_resched()) {
 			__cgroup_rstat_unlock(cgrp, cpu);
 			if (!cond_resched())
 				cpu_relax();



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-16 17:51 [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Jesper Dangaard Brouer
  2024-04-16 17:51 ` [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints Jesper Dangaard Brouer
  2024-04-16 17:51 ` [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex Jesper Dangaard Brouer
@ 2024-04-16 17:51 ` Jesper Dangaard Brouer
  2024-04-18  2:21   ` Yosry Ahmed
  2024-04-16 21:38 ` [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Tejun Heo
  3 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-16 17:51 UTC (permalink / raw)
  To: tj, hannes, lizefan.x, cgroups, yosryahmed, longman
  Cc: Jesper Dangaard Brouer, netdev, linux-mm, linux-kernel,
	shakeel.butt, kernel-team, linux-kernel,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

This patch aims to reduce userspace-triggered pressure on the global
cgroup_rstat_lock by introducing a mechanism to limit how often reading
stat files causes cgroup rstat flushing.

In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
mem_cgroup_flush_stats_ratelimited() already limits pressure on the
global lock (cgroup_rstat_lock). As a result, reading memory-related stat
files (such as memory.stat, memory.numa_stat, zswap.current) is already
a less userspace-triggerable issue.

However, other userspace users of cgroup_rstat_flush(), such as when
reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
limit pressure on the global lock. Furthermore, userspace can easily
trigger this issue by reading those stat files.

Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
same cgroup) without realizing that on the kernel side, they share the
same global lock. This limitation also helps prevent malicious userspace
applications from harming the kernel by reading these stat files in a
tight loop.

To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
similar to memcg's mem_cgroup_flush_stats_ratelimited().

Flushing occurs per cgroup (even though the lock remains global) a
variable named rstat_flush_last_time is introduced to track when a given
cgroup was last flushed. This variable, which contains the jiffies of the
flush, shares properties and a cache line with rstat_flush_next and is
updated simultaneously.

For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
because other data is read under the lock, but we skip the expensive
flushing if it occurred recently.

Regarding io.stat, there is an opportunity outside the lock to skip the
flush, but inside the lock, we must recheck to handle races.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 block/blk-cgroup.c          |    2 +
 include/linux/cgroup-defs.h |    1 +
 include/linux/cgroup.h      |    3 +-
 kernel/cgroup/rstat.c       |   60 ++++++++++++++++++++++++++++++++++++++++++-
 mm/memcontrol.c             |    1 +
 5 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bdbb557feb5a..4156fedbb910 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1162,7 +1162,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
 	if (!seq_css(sf)->parent)
 		blkcg_fill_root_iostats();
 	else
-		cgroup_rstat_flush(blkcg->css.cgroup);
+		cgroup_rstat_flush_ratelimited(blkcg->css.cgroup);
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ea48c861cd36..366dc644e075 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -509,6 +509,7 @@ struct cgroup {
 	 * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
 	 */
 	struct cgroup	*rstat_flush_next;
+	unsigned long	rstat_flush_last_time;
 
 	/* cgroup basic resource statistics */
 	struct cgroup_base_stat last_bstat;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2150ca60394b..8622b222453e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -689,7 +689,8 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
+int cgroup_rstat_flush_hold(struct cgroup *cgrp);
+int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(struct cgroup *cgrp);
 
 /*
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a90d68a7c27f..8c71af67b197 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -193,6 +193,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
 	/* Push @root to the list first before pushing the children */
 	head = root;
 	root->rstat_flush_next = NULL;
+	root->rstat_flush_last_time = jiffies;
 	child = rstatc->updated_children;
 	rstatc->updated_children = root;
 	if (child != root)
@@ -261,12 +262,15 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 
 	lockdep_assert_held(&cgroup_rstat_lock);
 
+	cgrp->rstat_flush_last_time = jiffies;
+
 	for_each_possible_cpu(cpu) {
 		struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);
 
 		for (; pos; pos = pos->rstat_flush_next) {
 			struct cgroup_subsys_state *css;
 
+			pos->rstat_flush_last_time = jiffies;
 			cgroup_base_stat_flush(pos, cpu);
 			bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
 
@@ -309,6 +313,49 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 	__cgroup_rstat_unlock(cgrp, -1);
 }
 
+#define FLUSH_TIME	msecs_to_jiffies(50)
+
+/**
+ * cgroup_rstat_flush_ratelimited - limit pressure on global lock (cgroup_rstat_lock)
+ * @cgrp: target cgroup
+ *
+ * Trade accuracy for less pressure on global lock. Only use this when caller
+ * don't need 100% up-to-date stats.
+ *
+ * Userspace stats tools spawn threads reading io.stat, cpu.stat and memory.stat
+ * from same cgroup, but kernel side they share same global lock.  This also
+ * limit malicious userspace from hurting kernel by reading these stat files in
+ * a tight loop.
+ *
+ * This function exit early if flush recently happened.
+ *
+ * Returns 1 if flush happened
+ */
+int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp)
+{
+	int res = 0;
+
+	might_sleep();
+
+	/* Outside lock: check if this flush and lock can be skipped */
+	if (time_before(jiffies,
+			READ_ONCE(cgrp->rstat_flush_last_time) + FLUSH_TIME))
+		return 0;
+
+	__cgroup_rstat_lock(cgrp, -1);
+
+	/* Recheck inside lock, do flush after enough time have passed */
+	if (time_after(jiffies,
+		       cgrp->rstat_flush_last_time + FLUSH_TIME)) {
+		cgroup_rstat_flush_locked(cgrp);
+		res = 1;
+	}
+
+	__cgroup_rstat_unlock(cgrp, -1);
+
+	return res;
+}
+
 /**
  * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
  * @cgrp: target cgroup
@@ -317,13 +364,22 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
  * paired with cgroup_rstat_flush_release().
  *
  * This function may block.
+ *
+ * Returns 1 if flush happened
  */
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
+int cgroup_rstat_flush_hold(struct cgroup *cgrp)
 	__acquires(&cgroup_rstat_lock)
 {
 	might_sleep();
 	__cgroup_rstat_lock(cgrp, -1);
-	cgroup_rstat_flush_locked(cgrp);
+
+	/* Only do expensive flush after enough time have passed */
+	if (time_after(jiffies,
+		       cgrp->rstat_flush_last_time + FLUSH_TIME)) {
+		cgroup_rstat_flush_locked(cgrp);
+		return 1;
+	}
+	return 0;
 }
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fabce2b50c69..771261612ec1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -742,6 +742,7 @@ static void do_flush_stats(struct mem_cgroup *memcg)
 	if (mem_cgroup_is_root(memcg))
 		WRITE_ONCE(flush_last_time, jiffies_64);
 
+	/* memcg have own ratelimit layer */
 	cgroup_rstat_flush(memcg->css.cgroup);
 }
 



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  2024-04-16 17:51 ` [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints Jesper Dangaard Brouer
@ 2024-04-16 21:36   ` Tejun Heo
  2024-04-18  8:00     ` Jesper Dangaard Brouer
  2024-04-23 16:53   ` Simon Horman
  1 sibling, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2024-04-16 21:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: hannes, lizefan.x, cgroups, yosryahmed, longman, netdev,
	linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:
> This commit enhances the ability to troubleshoot the global
> cgroup_rstat_lock by introducing wrapper helper functions for the lock
> along with associated tracepoints.

Applied to cgroup/for-6.10.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes
  2024-04-16 17:51 [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2024-04-16 17:51 ` [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing Jesper Dangaard Brouer
@ 2024-04-16 21:38 ` Tejun Heo
  2024-04-18  2:13   ` Yosry Ahmed
  3 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2024-04-16 21:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: hannes, lizefan.x, cgroups, yosryahmed, longman, netdev,
	linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

On Tue, Apr 16, 2024 at 07:51:19PM +0200, Jesper Dangaard Brouer wrote:
> This patchset is focused on the global cgroup_rstat_lock.
> 
>  Patch-1: Adds tracepoints to improve measuring lock behavior.
>  Patch-2: Converts the global lock into a mutex.
>  Patch-3: Limits userspace triggered pressure on the lock.

Imma wait for people's inputs on patch 2 and 3. ISTR switching the lock to
mutex made some tail latencies really bad for some workloads at google?
Yosry, was that you?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes
  2024-04-16 21:38 ` [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Tejun Heo
@ 2024-04-18  2:13   ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-18  2:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jesper Dangaard Brouer, hannes, lizefan.x, cgroups, longman,
	netdev, linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu

On Tue, Apr 16, 2024 at 2:38 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Tue, Apr 16, 2024 at 07:51:19PM +0200, Jesper Dangaard Brouer wrote:
> > This patchset is focused on the global cgroup_rstat_lock.
> >
> >  Patch-1: Adds tracepoints to improve measuring lock behavior.
> >  Patch-2: Converts the global lock into a mutex.
> >  Patch-3: Limits userspace triggered pressure on the lock.
>
> Imma wait for people's inputs on patch 2 and 3. ISTR switching the lock to
> mutex made some tail latencies really bad for some workloads at google?
> Yosry, was that you?

I spent some time going through the history of my previous patchsets
to find context.

There were two separate instances where concerns were raised about
using a mutex.

(a) Converting the global rstat spinlock to a mutex:

Shakeel had concerns about priority inversion with a global sleepable
lock. So I never actually tested replacing the spinlock with a mutex
based on Shakeel's concerns as priority inversions would be difficult
to reproduce with synthetic tests.

Generally speaking, other than priority inversions, I was depending on
Wei's synthetic test to measure performance for userspace reads, and a
script I wrote with parallel reclaimers to measure performance for
in-kernel flushers.

(b) Adding a mutex on top of the global rstat spinlock for userspace
reads (to limit contention from userspace on the in-kernel lock):

Wei reported that this significantly affects userspace read latency
[2]. I then proceeded to add per-memcg thresholds for flushing, which
resulted in the regressions from that mutex going away. However, at
that point the mutex didn't really provide much value, so I removed it
[3].

[1]https://lore.kernel.org/lkml/CALvZod441xBoXzhqLWTZ+xnqDOFkHmvrzspr9NAr+nybqXgS-A@mail.gmail.com/
[2]https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
[3]https://lore.kernel.org/lkml/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-16 17:51 ` [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex Jesper Dangaard Brouer
@ 2024-04-18  2:19   ` Yosry Ahmed
  2024-04-18  9:02     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-18  2:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
> with a spinlock").
>
> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
> necessary during the collection of per-CPU stats, this approach has led
> to several scaling issues observed in production environments. Holding
> this IRQ lock has caused starvation of other critical kernel functions,
> such as softirq (e.g., timers and netstack). Although kernel v6.8
> introduced optimizations in this area, we continue to observe instances
> where the spin_lock is held for 64-128 ms in production.
>
> This patch converts cgroup_rstat_lock back to being a mutex lock. This
> change is made possible thanks to the significant effort by Yosry Ahmed
> to eliminate all atomic context use-cases through multiple commits,
> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
> included in kernel v6.5.
>
> After this patch lock contention will be less obvious, as converting this
> to a mutex avoids multiple CPUs spinning while waiting for the lock, but
> it doesn't remove the lock contention. It is recommended to use the
> tracepoints to diagnose this.

I will keep the high-level conversation about using the mutex here in
the cover letter thread, but I am wondering why we are keeping the
lock dropping logic here with the mutex?

If this is to reduce lock contention, why does it depend on
need_resched()? spin_needbreak() is a good indicator for lock
contention, but need_resched() isn't, right?

Also, how was this tested?

When I did previous changes to the flushing logic I used to make sure
that userspace read latency was not impacted, as well as in-kernel
flushers (e.g. reclaim). We should make sure there are no regressions
on both fronts.

>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  kernel/cgroup/rstat.c |   10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index ff68c904e647..a90d68a7c27f 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,7 +9,7 @@
>
>  #include <trace/events/cgroup.h>
>
> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
> +static DEFINE_MUTEX(cgroup_rstat_lock);
>  static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
> @@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
>  {
>         bool contended;
>
> -       contended = !spin_trylock_irq(&cgroup_rstat_lock);
> +       contended = !mutex_trylock(&cgroup_rstat_lock);
>         if (contended) {
>                 trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
> -               spin_lock_irq(&cgroup_rstat_lock);
> +               mutex_lock(&cgroup_rstat_lock);
>         }
>         trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>  }
> @@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>         __releases(&cgroup_rstat_lock)
>  {
>         trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
> -       spin_unlock_irq(&cgroup_rstat_lock);
> +       mutex_unlock(&cgroup_rstat_lock);
>  }
>
>  /* see cgroup_rstat_flush() */
> @@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>                 }
>
>                 /* play nice and yield if necessary */
> -               if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> +               if (need_resched()) {
>                         __cgroup_rstat_unlock(cgrp, cpu);
>                         if (!cond_resched())
>                                 cpu_relax();

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-16 17:51 ` [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing Jesper Dangaard Brouer
@ 2024-04-18  2:21   ` Yosry Ahmed
  2024-04-18 11:00     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-18  2:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
> This patch aims to reduce userspace-triggered pressure on the global
> cgroup_rstat_lock by introducing a mechanism to limit how often reading
> stat files causes cgroup rstat flushing.
>
> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
> mem_cgroup_flush_stats_ratelimited() already limits pressure on the
> global lock (cgroup_rstat_lock). As a result, reading memory-related stat
> files (such as memory.stat, memory.numa_stat, zswap.current) is already
> a less userspace-triggerable issue.
>
> However, other userspace users of cgroup_rstat_flush(), such as when
> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
> limit pressure on the global lock. Furthermore, userspace can easily
> trigger this issue by reading those stat files.
>
> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
> same cgroup) without realizing that on the kernel side, they share the
> same global lock. This limitation also helps prevent malicious userspace
> applications from harming the kernel by reading these stat files in a
> tight loop.
>
> To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
> similar to memcg's mem_cgroup_flush_stats_ratelimited().
>
> Flushing occurs per cgroup (even though the lock remains global) a
> variable named rstat_flush_last_time is introduced to track when a given
> cgroup was last flushed. This variable, which contains the jiffies of the
> flush, shares properties and a cache line with rstat_flush_next and is
> updated simultaneously.
>
> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
> because other data is read under the lock, but we skip the expensive
> flushing if it occurred recently.
>
> Regarding io.stat, there is an opportunity outside the lock to skip the
> flush, but inside the lock, we must recheck to handle races.
>
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

As I mentioned in another thread, I really don't like time-based
rate-limiting [1]. Would it be possible to generalize the
magnitude-based rate-limiting instead? Have something like
memcg_vmstats_needs_flush() in the core rstat code?

Also, why do we keep the memcg time rate-limiting with this patch? Is
it because we use a much larger window there (2s)? Having two layers
of time-based rate-limiting is not ideal imo.

[1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  2024-04-16 21:36   ` Tejun Heo
@ 2024-04-18  8:00     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-18  8:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: hannes, lizefan.x, cgroups, yosryahmed, longman, netdev,
	linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko


On 16/04/2024 23.36, Tejun Heo wrote:
> On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:
>> This commit enhances the ability to troubleshoot the global
>> cgroup_rstat_lock by introducing wrapper helper functions for the lock
>> along with associated tracepoints.
> 
> Applied to cgroup/for-6.10.
> 

Thanks for applying the tracepoint patch. I've backported this to our 
main production kernels v6.6 LTS (with before mentioned upstream cgroup 
work from Yosry and Longman). I have it running in production on two 
machines this morning.  Doing manual bpftrace script inspection now, but 
plan is monitor this continuously (ebpf_exporter[1]) and even have 
alerts on excessive wait time on contention.

It makes sense to delay applying the next two patches, until we have 
some production experiments with those two patches, and I have fleet 
monitoring in place.  I'm be offline next week (on dive trip), so I'll 
resume work on this 29 April, before I start doing prod experiments.

--Jesper
[1] https://github.com/cloudflare/ebpf_exporter

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-18  2:19   ` Yosry Ahmed
@ 2024-04-18  9:02     ` Jesper Dangaard Brouer
  2024-04-18 14:49       ` Shakeel Butt
  2024-04-18 20:38       ` Yosry Ahmed
  0 siblings, 2 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-18  9:02 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko



On 18/04/2024 04.19, Yosry Ahmed wrote:
> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
>> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
>> with a spinlock").
>>
>> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
>> necessary during the collection of per-CPU stats, this approach has led
>> to several scaling issues observed in production environments. Holding
>> this IRQ lock has caused starvation of other critical kernel functions,
>> such as softirq (e.g., timers and netstack). Although kernel v6.8
>> introduced optimizations in this area, we continue to observe instances
>> where the spin_lock is held for 64-128 ms in production.
>>
>> This patch converts cgroup_rstat_lock back to being a mutex lock. This
>> change is made possible thanks to the significant effort by Yosry Ahmed
>> to eliminate all atomic context use-cases through multiple commits,
>> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
>> included in kernel v6.5.
>>
>> After this patch lock contention will be less obvious, as converting this
>> to a mutex avoids multiple CPUs spinning while waiting for the lock, but
>> it doesn't remove the lock contention. It is recommended to use the
>> tracepoints to diagnose this.
> 
> I will keep the high-level conversation about using the mutex here in
> the cover letter thread, but I am wondering why we are keeping the
> lock dropping logic here with the mutex?
> 

I agree that yielding the mutex in the loop makes less sense.
Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
will be a preemption point for my softirq.   But I kept it because, we
are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
there was no sched point for other userspace processes while holding the
mutex, but I don't fully know the sched implication when holding a mutex.


> If this is to reduce lock contention, why does it depend on
> need_resched()? spin_needbreak() is a good indicator for lock
> contention, but need_resched() isn't, right?
>

As I said, I'm unsure of the semantics of holding a mutex.


> Also, how was this tested?
> 

I tested this in a testlab, prior to posting upstream, with parallel
reader of the stat files.  As I said in other mail, I plan to experiment
with these patches(2+3) in production, as micro-benchmarking will not
reveal the corner cases we care about.  With BPF based measurements of
the lock congestion time, I hope we can catch production issues at a
time scale that is happens prior to user visible impacts.


> When I did previous changes to the flushing logic I used to make sure
> that userspace read latency was not impacted, as well as in-kernel
> flushers (e.g. reclaim). We should make sure there are no regressions
> on both fronts.
> 

Agree, we should consider both userspace readers and in-kernel flushers.
Maybe these needed separate handing as they have separate needs.

>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> ---
>>   kernel/cgroup/rstat.c |   10 +++++-----
>>   1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index ff68c904e647..a90d68a7c27f 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -9,7 +9,7 @@
>>
>>   #include <trace/events/cgroup.h>
>>
>> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
>> +static DEFINE_MUTEX(cgroup_rstat_lock);
>>   static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>>
>>   static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>> @@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
>>   {
>>          bool contended;
>>
>> -       contended = !spin_trylock_irq(&cgroup_rstat_lock);
>> +       contended = !mutex_trylock(&cgroup_rstat_lock);
>>          if (contended) {
>>                  trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
>> -               spin_lock_irq(&cgroup_rstat_lock);
>> +               mutex_lock(&cgroup_rstat_lock);
>>          }
>>          trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>>   }
>> @@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>>          __releases(&cgroup_rstat_lock)
>>   {
>>          trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
>> -       spin_unlock_irq(&cgroup_rstat_lock);
>> +       mutex_unlock(&cgroup_rstat_lock);
>>   }
>>
>>   /* see cgroup_rstat_flush() */
>> @@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>>                  }
>>
>>                  /* play nice and yield if necessary */
>> -               if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
>> +               if (need_resched()) {
>>                          __cgroup_rstat_unlock(cgrp, cpu);
>>                          if (!cond_resched())
>>                                  cpu_relax();

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-18  2:21   ` Yosry Ahmed
@ 2024-04-18 11:00     ` Jesper Dangaard Brouer
  2024-04-18 15:49       ` Shakeel Butt
  2024-04-18 21:00       ` Yosry Ahmed
  0 siblings, 2 replies; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-18 11:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko



On 18/04/2024 04.21, Yosry Ahmed wrote:
> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>> This patch aims to reduce userspace-triggered pressure on the global
>> cgroup_rstat_lock by introducing a mechanism to limit how often reading
>> stat files causes cgroup rstat flushing.
>>
>> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
>> mem_cgroup_flush_stats_ratelimited() already limits pressure on the
>> global lock (cgroup_rstat_lock). As a result, reading memory-related stat
>> files (such as memory.stat, memory.numa_stat, zswap.current) is already
>> a less userspace-triggerable issue.
>>
>> However, other userspace users of cgroup_rstat_flush(), such as when
>> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
>> limit pressure on the global lock. Furthermore, userspace can easily
>> trigger this issue by reading those stat files.
>>
>> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
>> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
>> same cgroup) without realizing that on the kernel side, they share the
>> same global lock. This limitation also helps prevent malicious userspace
>> applications from harming the kernel by reading these stat files in a
>> tight loop.
>>
>> To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
>> similar to memcg's mem_cgroup_flush_stats_ratelimited().
>>
>> Flushing occurs per cgroup (even though the lock remains global) a
>> variable named rstat_flush_last_time is introduced to track when a given
>> cgroup was last flushed. This variable, which contains the jiffies of the
>> flush, shares properties and a cache line with rstat_flush_next and is
>> updated simultaneously.
>>
>> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
>> because other data is read under the lock, but we skip the expensive
>> flushing if it occurred recently.
>>
>> Regarding io.stat, there is an opportunity outside the lock to skip the
>> flush, but inside the lock, we must recheck to handle races.
>>
>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> 
> As I mentioned in another thread, I really don't like time-based
> rate-limiting [1]. Would it be possible to generalize the
> magnitude-based rate-limiting instead? Have something like
> memcg_vmstats_needs_flush() in the core rstat code?
> 

I've taken a closer look at memcg_vmstats_needs_flush(). And I'm
concerned about overhead maintaining the stats (that is used as a filter).

   static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
   {
	return atomic64_read(&vmstats->stats_updates) >
		MEMCG_CHARGE_BATCH * num_online_cpus();
   }

I looked at `vmstats->stats_updates` to see how often this is getting 
updated.  It is updated in memcg_rstat_updated(), but it gets inlined 
into a number function (__mod_memcg_state, __mod_memcg_lruvec_state, 
__count_memcg_events), plus it calls cgroup_rstat_updated().
Counting invocations per sec (via funccount):

   10:28:09
   FUNC                                    COUNT
   __mod_memcg_state                      377553
   __count_memcg_events                   393078
   __mod_memcg_lruvec_state              1229673
   cgroup_rstat_updated                  2632389


I'm surprised to see how many time per sec this is getting invoked.
Originating from memcg_rstat_updated() = 2,000,304 times per sec.
(On a 128 CPU core machine with 39% idle CPU-load.)
Maintaining these stats seems excessive...

Then how often does the filter lower pressure on lock:

   MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192
   2000304/(64*128) = 244 time per sec (every ~4ms)
   (assuming memcg_rstat_updated val=1)


> Also, why do we keep the memcg time rate-limiting with this patch? Is
> it because we use a much larger window there (2s)? Having two layers
> of time-based rate-limiting is not ideal imo.
>

I'm also not-a-fan of having two layer of time-based rate-limiting, but 
they do operate a different time scales *and* are not active at the same 
time with current patch, if you noticed the details, then I excluded 
memcg from using this as I commented "memcg have own ratelimit layer" 
(in do_flush_stats).

I would prefer removing the memcg time rate-limiting and use this more 
granular 50ms (20 timer/sec) for memcg also.  And I was planning to do 
that in a followup patchset.  The 50ms (20 timer/sec) limit will be per 
cgroup in the system, which then "scales"/increase with the number of 
cgroups, but better than unbounded read/access locks per sec.

--Jesper


> [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/


sudo ./bcc/tools/funccount -Ti 1 -d 10 
'__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_rstat_updated'

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-18  9:02     ` Jesper Dangaard Brouer
@ 2024-04-18 14:49       ` Shakeel Butt
  2024-04-18 20:39         ` Yosry Ahmed
  2024-04-18 20:38       ` Yosry Ahmed
  1 sibling, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2024-04-18 14:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Yosry Ahmed, tj, hannes, lizefan.x, cgroups, longman, netdev,
	linux-mm, linux-kernel, kernel-team, Arnaldo Carvalho de Melo,
	Sebastian Andrzej Siewior, mhocko

On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 18/04/2024 04.19, Yosry Ahmed wrote:
[...]
> > 
> > I will keep the high-level conversation about using the mutex here in
> > the cover letter thread, but I am wondering why we are keeping the
> > lock dropping logic here with the mutex?
> > 
> 
> I agree that yielding the mutex in the loop makes less sense.
> Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> will be a preemption point for my softirq.   But I kept it because, we
> are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> there was no sched point for other userspace processes while holding the
> mutex, but I don't fully know the sched implication when holding a mutex.
> 

Are the softirqs you are interested in, raised from the same cpu or
remote cpu? What about local_softirq_pending() check in addition to
need_resched() and spin_needbreak() checks? If softirq can only be
raised on local cpu then convert the spin_lock to non-irq one (Please
correct me if I am wrong but on return from hard irq and not within bh
or irq disabled spin_lock, the kernel will run the pending softirqs,
right?). Did you get the chance to test these two changes or something
similar in your prod environment?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-18 11:00     ` Jesper Dangaard Brouer
@ 2024-04-18 15:49       ` Shakeel Butt
  2024-04-18 21:00       ` Yosry Ahmed
  1 sibling, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2024-04-18 15:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Yosry Ahmed, tj, hannes, lizefan.x, cgroups, longman, netdev,
	linux-mm, linux-kernel, kernel-team, Arnaldo Carvalho de Melo,
	Sebastian Andrzej Siewior, mhocko

On Thu, Apr 18, 2024 at 01:00:30PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 18/04/2024 04.21, Yosry Ahmed wrote:
> > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > > 
> > > This patch aims to reduce userspace-triggered pressure on the global
> > > cgroup_rstat_lock by introducing a mechanism to limit how often reading
> > > stat files causes cgroup rstat flushing.
> > > 
> > > In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
> > > mem_cgroup_flush_stats_ratelimited() already limits pressure on the
> > > global lock (cgroup_rstat_lock). As a result, reading memory-related stat
> > > files (such as memory.stat, memory.numa_stat, zswap.current) is already
> > > a less userspace-triggerable issue.
> > > 
> > > However, other userspace users of cgroup_rstat_flush(), such as when
> > > reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
> > > limit pressure on the global lock. Furthermore, userspace can easily
> > > trigger this issue by reading those stat files.
> > > 
> > > Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
> > > spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
> > > same cgroup) without realizing that on the kernel side, they share the
> > > same global lock. This limitation also helps prevent malicious userspace
> > > applications from harming the kernel by reading these stat files in a
> > > tight loop.
> > > 
> > > To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
> > > similar to memcg's mem_cgroup_flush_stats_ratelimited().
> > > 
> > > Flushing occurs per cgroup (even though the lock remains global) a
> > > variable named rstat_flush_last_time is introduced to track when a given
> > > cgroup was last flushed. This variable, which contains the jiffies of the
> > > flush, shares properties and a cache line with rstat_flush_next and is
> > > updated simultaneously.
> > > 
> > > For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
> > > because other data is read under the lock, but we skip the expensive
> > > flushing if it occurred recently.
> > > 
> > > Regarding io.stat, there is an opportunity outside the lock to skip the
> > > flush, but inside the lock, we must recheck to handle races.
> > > 
> > > Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> > 
> > As I mentioned in another thread, I really don't like time-based
> > rate-limiting [1]. Would it be possible to generalize the
> > magnitude-based rate-limiting instead? Have something like
> > memcg_vmstats_needs_flush() in the core rstat code?
> > 
> 
> I've taken a closer look at memcg_vmstats_needs_flush(). And I'm
> concerned about overhead maintaining the stats (that is used as a filter).
> 
>   static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>   {
> 	return atomic64_read(&vmstats->stats_updates) >
> 		MEMCG_CHARGE_BATCH * num_online_cpus();
>   }
> 
> I looked at `vmstats->stats_updates` to see how often this is getting
> updated.  It is updated in memcg_rstat_updated(), but it gets inlined into a
> number function (__mod_memcg_state, __mod_memcg_lruvec_state,
> __count_memcg_events), plus it calls cgroup_rstat_updated().
> Counting invocations per sec (via funccount):
> 
>   10:28:09
>   FUNC                                    COUNT
>   __mod_memcg_state                      377553
>   __count_memcg_events                   393078
>   __mod_memcg_lruvec_state              1229673
>   cgroup_rstat_updated                  2632389
> 

Is it possible for you to also measure the frequency of the unique
callstacks calling these functions? In addition the frequency of the
each stat item update would be awesome.

> 
> I'm surprised to see how many time per sec this is getting invoked.
> Originating from memcg_rstat_updated() = 2,000,304 times per sec.
> (On a 128 CPU core machine with 39% idle CPU-load.)
> Maintaining these stats seems excessive...
> 
> Then how often does the filter lower pressure on lock:
> 
>   MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192
>   2000304/(64*128) = 244 time per sec (every ~4ms)
>   (assuming memcg_rstat_updated val=1)
> 

It seems like we have opportunities to improve the stat update side and
we definitely need to improve the stat flush side. One issue from the
memcg side is that kernel has to do a lot of work, so we should be
reducing that.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-18  9:02     ` Jesper Dangaard Brouer
  2024-04-18 14:49       ` Shakeel Butt
@ 2024-04-18 20:38       ` Yosry Ahmed
  1 sibling, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-18 20:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu

On Thu, Apr 18, 2024 at 2:02 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 18/04/2024 04.19, Yosry Ahmed wrote:
> > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >>
> >> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
> >> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
> >> with a spinlock").
> >>
> >> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
> >> necessary during the collection of per-CPU stats, this approach has led
> >> to several scaling issues observed in production environments. Holding
> >> this IRQ lock has caused starvation of other critical kernel functions,
> >> such as softirq (e.g., timers and netstack). Although kernel v6.8
> >> introduced optimizations in this area, we continue to observe instances
> >> where the spin_lock is held for 64-128 ms in production.
> >>
> >> This patch converts cgroup_rstat_lock back to being a mutex lock. This
> >> change is made possible thanks to the significant effort by Yosry Ahmed
> >> to eliminate all atomic context use-cases through multiple commits,
> >> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
> >> included in kernel v6.5.
> >>
> >> After this patch lock contention will be less obvious, as converting this
> >> to a mutex avoids multiple CPUs spinning while waiting for the lock, but
> >> it doesn't remove the lock contention. It is recommended to use the
> >> tracepoints to diagnose this.
> >
> > I will keep the high-level conversation about using the mutex here in
> > the cover letter thread, but I am wondering why we are keeping the
> > lock dropping logic here with the mutex?
> >
>
> I agree that yielding the mutex in the loop makes less sense.
> Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> will be a preemption point for my softirq.   But I kept it because, we
> are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> there was no sched point for other userspace processes while holding the
> mutex, but I don't fully know the sched implication when holding a mutex.

I guess dropping the lock before rescheduling could be more preferable
in this case since we do not need to keep holding the lock for
correctness.

>
> > If this is to reduce lock contention, why does it depend on
> > need_resched()? spin_needbreak() is a good indicator for lock
> > contention, but need_resched() isn't, right?
> >
>
> As I said, I'm unsure of the semantics of holding a mutex.
>
>
> > Also, how was this tested?
> >
>
> I tested this in a testlab, prior to posting upstream, with parallel
> reader of the stat files.

I believe high concurrency is a key point here. CC'ing Wei who
reported regressions on previous attempts of mine before to address
the lock contention from userspace.

> As I said in other mail, I plan to experiment
> with these patches(2+3) in production, as micro-benchmarking will not
> reveal the corner cases we care about.

Right, but micro-benchmarking should give us a signal about
regressions. It was very useful for me when working with this code
before to use synthetic benchmarks with high concurrency of userspace
reads and/or kernel flushers.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-18 14:49       ` Shakeel Butt
@ 2024-04-18 20:39         ` Yosry Ahmed
  2024-04-19 13:15           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-18 20:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jesper Dangaard Brouer, tj, hannes, lizefan.x, cgroups, longman,
	netdev, linux-mm, linux-kernel, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

On Thu, Apr 18, 2024 at 7:49 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
> >
> >
> > On 18/04/2024 04.19, Yosry Ahmed wrote:
> [...]
> > >
> > > I will keep the high-level conversation about using the mutex here in
> > > the cover letter thread, but I am wondering why we are keeping the
> > > lock dropping logic here with the mutex?
> > >
> >
> > I agree that yielding the mutex in the loop makes less sense.
> > Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> > will be a preemption point for my softirq.   But I kept it because, we
> > are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> > there was no sched point for other userspace processes while holding the
> > mutex, but I don't fully know the sched implication when holding a mutex.
> >
>
> Are the softirqs you are interested in, raised from the same cpu or
> remote cpu? What about local_softirq_pending() check in addition to
> need_resched() and spin_needbreak() checks? If softirq can only be
> raised on local cpu then convert the spin_lock to non-irq one (Please
> correct me if I am wrong but on return from hard irq and not within bh
> or irq disabled spin_lock, the kernel will run the pending softirqs,
> right?). Did you get the chance to test these two changes or something
> similar in your prod environment?

I tried making the spinlock a non-irq lock before, but Tejun objected [1].

Perhaps we could experiment with always dropping the lock at CPU
boundaries instead?

[1]https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-18 11:00     ` Jesper Dangaard Brouer
  2024-04-18 15:49       ` Shakeel Butt
@ 2024-04-18 21:00       ` Yosry Ahmed
  2024-04-18 21:15         ` Tejun Heo
  2024-04-19 10:16         ` Jesper Dangaard Brouer
  1 sibling, 2 replies; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-18 21:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu

On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 18/04/2024 04.21, Yosry Ahmed wrote:
> > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> >>
> >> This patch aims to reduce userspace-triggered pressure on the global
> >> cgroup_rstat_lock by introducing a mechanism to limit how often reading
> >> stat files causes cgroup rstat flushing.
> >>
> >> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
> >> mem_cgroup_flush_stats_ratelimited() already limits pressure on the
> >> global lock (cgroup_rstat_lock). As a result, reading memory-related stat
> >> files (such as memory.stat, memory.numa_stat, zswap.current) is already
> >> a less userspace-triggerable issue.
> >>
> >> However, other userspace users of cgroup_rstat_flush(), such as when
> >> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
> >> limit pressure on the global lock. Furthermore, userspace can easily
> >> trigger this issue by reading those stat files.
> >>
> >> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
> >> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
> >> same cgroup) without realizing that on the kernel side, they share the
> >> same global lock. This limitation also helps prevent malicious userspace
> >> applications from harming the kernel by reading these stat files in a
> >> tight loop.
> >>
> >> To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
> >> similar to memcg's mem_cgroup_flush_stats_ratelimited().
> >>
> >> Flushing occurs per cgroup (even though the lock remains global) a
> >> variable named rstat_flush_last_time is introduced to track when a given
> >> cgroup was last flushed. This variable, which contains the jiffies of the
> >> flush, shares properties and a cache line with rstat_flush_next and is
> >> updated simultaneously.
> >>
> >> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
> >> because other data is read under the lock, but we skip the expensive
> >> flushing if it occurred recently.
> >>
> >> Regarding io.stat, there is an opportunity outside the lock to skip the
> >> flush, but inside the lock, we must recheck to handle races.
> >>
> >> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> >
> > As I mentioned in another thread, I really don't like time-based
> > rate-limiting [1]. Would it be possible to generalize the
> > magnitude-based rate-limiting instead? Have something like
> > memcg_vmstats_needs_flush() in the core rstat code?
> >
>
> I've taken a closer look at memcg_vmstats_needs_flush(). And I'm
> concerned about overhead maintaining the stats (that is used as a filter).
>
>    static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>    {
>         return atomic64_read(&vmstats->stats_updates) >
>                 MEMCG_CHARGE_BATCH * num_online_cpus();
>    }
>
> I looked at `vmstats->stats_updates` to see how often this is getting
> updated.  It is updated in memcg_rstat_updated(), but it gets inlined
> into a number function (__mod_memcg_state, __mod_memcg_lruvec_state,
> __count_memcg_events), plus it calls cgroup_rstat_updated().
> Counting invocations per sec (via funccount):
>
>    10:28:09
>    FUNC                                    COUNT
>    __mod_memcg_state                      377553
>    __count_memcg_events                   393078
>    __mod_memcg_lruvec_state              1229673
>    cgroup_rstat_updated                  2632389
>
>
> I'm surprised to see how many time per sec this is getting invoked.
> Originating from memcg_rstat_updated() = 2,000,304 times per sec.
> (On a 128 CPU core machine with 39% idle CPU-load.)
> Maintaining these stats seems excessive...

Well, the number of calls to memcg_rstat_updated() is not affected by
maintaining stats_updates, and this only adds a few percpu updates in
the common path. I did not see any regressions (after all
optimizations) in any benchmarks with this, including will-it-scale
and netperf.

>
> Then how often does the filter lower pressure on lock:
>
>    MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192
>    2000304/(64*128) = 244 time per sec (every ~4ms)
>    (assuming memcg_rstat_updated val=1)

This does not tell the whole story though because:

1. The threshold (8192 in this case) is per-memcg. I am assuming
2,000,304 is the number of calls per second for the entire system. In
this case, the filtering should be more effective.

2. This assumes that updates and flushes are uniform, I am not sure
this applies in practice.

3. In my experiments, this thresholding drastically improved userspace
read latency under heavy contention (100s or 1000s of readers),
especially the tail latencies.

Generally, I think magnitude-based thresholding is better than
time-based, especially in larger systems where a lot can change in a
short amount of time. I did not observe any regressions from this
scheme, and I observed very noticeable improvements in flushing
latency.

Taking a step back, I think this series is trying to address two
issues in one go: interrupt handling latency and lock contention.
While both are related because reducing flushing reduces irq
disablement, I think it would be better if we can fix that issue
separately with a more fundamental solution (e.g. using a mutex or
dropping the lock at each CPU boundary).

After that, we can more clearly evaluate the lock contention problem
with data purely about flushing latency, without taking into
consideration the irq handling problem.

Does this make sense to you?

>
>
> > Also, why do we keep the memcg time rate-limiting with this patch? Is
> > it because we use a much larger window there (2s)? Having two layers
> > of time-based rate-limiting is not ideal imo.
> >
>
> I'm also not-a-fan of having two layer of time-based rate-limiting, but
> they do operate a different time scales *and* are not active at the same
> time with current patch, if you noticed the details, then I excluded
> memcg from using this as I commented "memcg have own ratelimit layer"
> (in do_flush_stats).

Right, I meant generally having two schemes doing very similar things,
even if they are not active at the same time.

I think this is an artifact of different subsystems sharing the same
rstat tree for no specific reason. I think almost all flushing calls
really need the stats from one subsystem after all.

If we have separate trees, lock contention gets slightly better as
different subsystems do not compete. We can also have different
subsystems "customize" their trees, for e.g. by setting different
time-based or magnitude-based rate-limiting thresholds.

I know this is a bigger lift, just thinking out loud :)

>
> I would prefer removing the memcg time rate-limiting and use this more
> granular 50ms (20 timer/sec) for memcg also.  And I was planning to do
> that in a followup patchset.  The 50ms (20 timer/sec) limit will be per
> cgroup in the system, which then "scales"/increase with the number of
> cgroups, but better than unbounded read/access locks per sec.
>
> --Jesper
>
>
> > [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/
>
>
> sudo ./bcc/tools/funccount -Ti 1 -d 10
> '__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_rstat_updated'

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-18 21:00       ` Yosry Ahmed
@ 2024-04-18 21:15         ` Tejun Heo
  2024-04-18 21:22           ` Yosry Ahmed
  2024-04-19 10:16         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2024-04-18 21:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Jesper Dangaard Brouer, hannes, lizefan.x, cgroups, longman,
	netdev, linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu

Hello, Yosry.

On Thu, Apr 18, 2024 at 02:00:28PM -0700, Yosry Ahmed wrote:
...
> I think this is an artifact of different subsystems sharing the same
> rstat tree for no specific reason. I think almost all flushing calls
> really need the stats from one subsystem after all.
>
> If we have separate trees, lock contention gets slightly better as
> different subsystems do not compete. We can also have different
> subsystems "customize" their trees, for e.g. by setting different
> time-based or magnitude-based rate-limiting thresholds.
> 
> I know this is a bigger lift, just thinking out loud :)

I have no objection to separating out rstat trees so that it has
per-controller tracking. However, the high frequency source of updates are
cpu and memory, which tend to fire together, and the only really high
frequency consumer seems to be memory, so I'm not too sure how much benefit
separating the trees out would bring.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-18 21:15         ` Tejun Heo
@ 2024-04-18 21:22           ` Yosry Ahmed
  2024-04-18 21:32             ` Tejun Heo
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-18 21:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jesper Dangaard Brouer, hannes, lizefan.x, cgroups, longman,
	netdev, linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu

On Thu, Apr 18, 2024 at 2:15 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Yosry.
>
> On Thu, Apr 18, 2024 at 02:00:28PM -0700, Yosry Ahmed wrote:
> ...
> > I think this is an artifact of different subsystems sharing the same
> > rstat tree for no specific reason. I think almost all flushing calls
> > really need the stats from one subsystem after all.
> >
> > If we have separate trees, lock contention gets slightly better as
> > different subsystems do not compete. We can also have different
> > subsystems "customize" their trees, for e.g. by setting different
> > time-based or magnitude-based rate-limiting thresholds.
> >
> > I know this is a bigger lift, just thinking out loud :)
>
> I have no objection to separating out rstat trees so that it has
> per-controller tracking. However, the high frequency source of updates are
> cpu and memory, which tend to fire together, and the only really high
> frequency consumer seems to be memory, so I'm not too sure how much benefit
> separating the trees out would bring.

Well, we could split the global lock into multiple ones, which isn't
really a solution, but it would help other controllers not to be
affected by the high frequency of flushing from the memory controller
(which has its own thresholding).

For updates, cpu and memory would use separate percpu locks as well,
which may help slightly.

Outside of this, I think it helps us add controller-specific
optimizations. For example, I tried to generalize the thresholding
code in the memory controller and put it in the rstat code, but I
couldn't really have a single value representing the "pending stats"
from all controllers. It's impossible to compare memory stats (mostly
in pages or bytes) to cpu time stats for instance.

Similarly, with this proposal from Jesper (which I am not saying I am
agreeing with :P), instead of having time-based ratelimiting in both
the rstat code and the memcg code to support different thresholds, we
could have the memory controller set a different threshold for itself.

So perhaps the lock breakdowns are not enough motivation, but if we
start generalizing optimizations in some controllers, we may want to
split the tree for flexibility.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-18 21:22           ` Yosry Ahmed
@ 2024-04-18 21:32             ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2024-04-18 21:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Jesper Dangaard Brouer, hannes, lizefan.x, cgroups, longman,
	netdev, linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu

Hello,

On Thu, Apr 18, 2024 at 02:22:58PM -0700, Yosry Ahmed wrote:
> Outside of this, I think it helps us add controller-specific
> optimizations. For example, I tried to generalize the thresholding
> code in the memory controller and put it in the rstat code, but I
> couldn't really have a single value representing the "pending stats"
> from all controllers. It's impossible to compare memory stats (mostly
> in pages or bytes) to cpu time stats for instance.
> 
> Similarly, with this proposal from Jesper (which I am not saying I am
> agreeing with :P), instead of having time-based ratelimiting in both
> the rstat code and the memcg code to support different thresholds, we
> could have the memory controller set a different threshold for itself.
> 
> So perhaps the lock breakdowns are not enough motivation, but if we
> start generalizing optimizations in some controllers, we may want to
> split the tree for flexibility.

I see. Yeah, that makes more sense to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-18 21:00       ` Yosry Ahmed
  2024-04-18 21:15         ` Tejun Heo
@ 2024-04-19 10:16         ` Jesper Dangaard Brouer
  2024-04-19 19:25           ` Yosry Ahmed
  1 sibling, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-19 10:16 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu


On 18/04/2024 23.00, Yosry Ahmed wrote:
> On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer<hawk@kernel.org>  wrote:
>> On 18/04/2024 04.21, Yosry Ahmed wrote:
>>> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer<hawk@kernel.org>  wrote:
>>>> This patch aims to reduce userspace-triggered pressure on the global
>>>> cgroup_rstat_lock by introducing a mechanism to limit how often reading
>>>> stat files causes cgroup rstat flushing.
>>>>
[...]

> Taking a step back, I think this series is trying to address two
> issues in one go: interrupt handling latency and lock contention.

Yes, patch 2 and 3 are essentially independent and address two different 
aspects.

> While both are related because reducing flushing reduces irq
> disablement, I think it would be better if we can fix that issue
> separately with a more fundamental solution (e.g. using a mutex or
> dropping the lock at each CPU boundary).
> 
> After that, we can more clearly evaluate the lock contention problem
> with data purely about flushing latency, without taking into
> consideration the irq handling problem.
> 
> Does this make sense to you?

Yes, make sense.

So, you are suggesting we start with the mutex change? (patch 2)
(which still needs some adjustments/tuning)

This make sense to me, as there are likely many solutions to reducing
the pressure on the lock.

With the tracepoint patch in-place, I/we can measure the pressure on the
lock, and I plan to do this across our CF fleet.  Then we can slowly
work on improving lock contention and evaluate this on our fleets.

--Jesper
p.s.
Setting expectations:
  - Going on vacation today, so will resume work after 29th April.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-18 20:39         ` Yosry Ahmed
@ 2024-04-19 13:15           ` Jesper Dangaard Brouer
  2024-04-19 16:11             ` Shakeel Butt
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-19 13:15 UTC (permalink / raw)
  To: Yosry Ahmed, Shakeel Butt
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, kernel-team, Arnaldo Carvalho de Melo,
	Sebastian Andrzej Siewior, mhocko



On 18/04/2024 22.39, Yosry Ahmed wrote:
> On Thu, Apr 18, 2024 at 7:49 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 18/04/2024 04.19, Yosry Ahmed wrote:
>> [...]
>>>>
>>>> I will keep the high-level conversation about using the mutex here in
>>>> the cover letter thread, but I am wondering why we are keeping the
>>>> lock dropping logic here with the mutex?
>>>>
>>>
>>> I agree that yielding the mutex in the loop makes less sense.
>>> Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
>>> will be a preemption point for my softirq.   But I kept it because, we
>>> are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
>>> there was no sched point for other userspace processes while holding the
>>> mutex, but I don't fully know the sched implication when holding a mutex.
>>>
>>
>> Are the softirqs you are interested in, raised from the same cpu or
>> remote cpu? What about local_softirq_pending() check in addition to
>> need_resched() and spin_needbreak() checks? If softirq can only be
>> raised on local cpu then convert the spin_lock to non-irq one (Please
>> correct me if I am wrong but on return from hard irq and not within bh
>> or irq disabled spin_lock, the kernel will run the pending softirqs,
>> right?). Did you get the chance to test these two changes or something
>> similar in your prod environment?
> 
> I tried making the spinlock a non-irq lock before, but Tejun objected [1].
> [1] https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/
> 

After reading [1], I think using a mutex is a better approach (than 
non-irq spinlock).


> Perhaps we could experiment with always dropping the lock at CPU
> boundaries instead?
> 

I don't think this will be enough (always dropping the lock at CPU
boundaries).  My measured "lock-hold" times that is blocking IRQ (and
softirq) for too long.  When looking at prod with my new cgroup
tracepoint script[2]. When contention occurs, I see many Yields
happening and with same magnitude as Contended. But still see events
with long "lock-hold" times, even-though yields are high.

  [2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt

Example output:

  12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 
comm:kswapd7
  12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
  12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100 
comm:kworker/u261:12

  12:46:56  time elapsed: 36 sec (interval = 1 sec)
   Flushes(2051) 15/interval (avg 56/sec)
   Locks(44464) 1340/interval (avg 1235/sec)
   Yields(42413) 1325/interval (avg 1178/sec)
   Contended(42112) 1322/interval (avg 1169/sec)

There is reported 15 flushes/sec, but locks are yielded quickly.

More problematically (for softirq latency) we see a Long lock-hold time
reaching 18 ms.  For network RX softirq I need lower than 0.5ms latency,
to avoid RX-ring HW queue overflows.


--Jesper
p.s. I'm seeing a pattern with kswapdN contending on this lock.

@stack[697, kswapd3]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27

@stack[698, kswapd4]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27

@stack[699, kswapd5]:
         __cgroup_rstat_lock+107
         __cgroup_rstat_lock+107
         cgroup_rstat_flush_locked+851
         cgroup_rstat_flush+35
         shrink_node+226
         balance_pgdat+807
         kswapd+521
         kthread+228
         ret_from_fork+48
         ret_from_fork_asm+27


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-19 13:15           ` Jesper Dangaard Brouer
@ 2024-04-19 16:11             ` Shakeel Butt
  2024-04-19 19:21               ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2024-04-19 16:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Yosry Ahmed, tj, hannes, lizefan.x, cgroups, longman, netdev,
	linux-mm, linux-kernel, kernel-team, Arnaldo Carvalho de Melo,
	Sebastian Andrzej Siewior, mhocko

On Fri, Apr 19, 2024 at 03:15:01PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 18/04/2024 22.39, Yosry Ahmed wrote:
> > On Thu, Apr 18, 2024 at 7:49 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > 
> > > On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
> > > > 
> > > > 
> > > > On 18/04/2024 04.19, Yosry Ahmed wrote:
> > > [...]
> > > > > 
> > > > > I will keep the high-level conversation about using the mutex here in
> > > > > the cover letter thread, but I am wondering why we are keeping the
> > > > > lock dropping logic here with the mutex?
> > > > > 
> > > > 
> > > > I agree that yielding the mutex in the loop makes less sense.
> > > > Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> > > > will be a preemption point for my softirq.   But I kept it because, we
> > > > are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> > > > there was no sched point for other userspace processes while holding the
> > > > mutex, but I don't fully know the sched implication when holding a mutex.
> > > > 
> > > 
> > > Are the softirqs you are interested in, raised from the same cpu or
> > > remote cpu? What about local_softirq_pending() check in addition to
> > > need_resched() and spin_needbreak() checks? If softirq can only be
> > > raised on local cpu then convert the spin_lock to non-irq one (Please
> > > correct me if I am wrong but on return from hard irq and not within bh
> > > or irq disabled spin_lock, the kernel will run the pending softirqs,
> > > right?). Did you get the chance to test these two changes or something
> > > similar in your prod environment?
> > 
> > I tried making the spinlock a non-irq lock before, but Tejun objected [1].
> > [1] https://lore.kernel.org/lkml/ZBz%2FV5a7%2F6PZeM7S@slm.duckdns.org/
> > 
> 
> After reading [1], I think using a mutex is a better approach (than non-irq
> spinlock).
> 
> 
> > Perhaps we could experiment with always dropping the lock at CPU
> > boundaries instead?
> > 
> 
> I don't think this will be enough (always dropping the lock at CPU
> boundaries).  My measured "lock-hold" times that is blocking IRQ (and
> softirq) for too long.  When looking at prod with my new cgroup
> tracepoint script[2]. When contention occurs, I see many Yields
> happening and with same magnitude as Contended. But still see events
> with long "lock-hold" times, even-though yields are high.
> 
>  [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
> 
> Example output:
> 
>  12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7
>  12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
>  12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100
> comm:kworker/u261:12
> 
>  12:46:56  time elapsed: 36 sec (interval = 1 sec)
>   Flushes(2051) 15/interval (avg 56/sec)
>   Locks(44464) 1340/interval (avg 1235/sec)
>   Yields(42413) 1325/interval (avg 1178/sec)
>   Contended(42112) 1322/interval (avg 1169/sec)
> 
> There is reported 15 flushes/sec, but locks are yielded quickly.
> 
> More problematically (for softirq latency) we see a Long lock-hold time
> reaching 18 ms.  For network RX softirq I need lower than 0.5ms latency,
> to avoid RX-ring HW queue overflows.
> 
> 
> --Jesper
> p.s. I'm seeing a pattern with kswapdN contending on this lock.
> 
> @stack[697, kswapd3]:
>         __cgroup_rstat_lock+107
>         __cgroup_rstat_lock+107
>         cgroup_rstat_flush_locked+851
>         cgroup_rstat_flush+35
>         shrink_node+226
>         balance_pgdat+807
>         kswapd+521
>         kthread+228
>         ret_from_fork+48
>         ret_from_fork_asm+27
> 
> @stack[698, kswapd4]:
>         __cgroup_rstat_lock+107
>         __cgroup_rstat_lock+107
>         cgroup_rstat_flush_locked+851
>         cgroup_rstat_flush+35
>         shrink_node+226
>         balance_pgdat+807
>         kswapd+521
>         kthread+228
>         ret_from_fork+48
>         ret_from_fork_asm+27
> 
> @stack[699, kswapd5]:
>         __cgroup_rstat_lock+107
>         __cgroup_rstat_lock+107
>         cgroup_rstat_flush_locked+851
>         cgroup_rstat_flush+35
>         shrink_node+226
>         balance_pgdat+807
>         kswapd+521
>         kthread+228
>         ret_from_fork+48
>         ret_from_fork_asm+27
> 

Can you simply replace mem_cgroup_flush_stats() in
prepare_scan_control() with the ratelimited version and see if the issue
still persists for your production traffic?

Also were you able to get which specific stats are getting the most
updates?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex
  2024-04-19 16:11             ` Shakeel Butt
@ 2024-04-19 19:21               ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-19 19:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jesper Dangaard Brouer, tj, hannes, lizefan.x, cgroups, longman,
	netdev, linux-mm, linux-kernel, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

[..]
> > > Perhaps we could experiment with always dropping the lock at CPU
> > > boundaries instead?
> > >
> >
> > I don't think this will be enough (always dropping the lock at CPU
> > boundaries).  My measured "lock-hold" times that is blocking IRQ (and
> > softirq) for too long.  When looking at prod with my new cgroup
> > tracepoint script[2]. When contention occurs, I see many Yields
> > happening and with same magnitude as Contended. But still see events
> > with long "lock-hold" times, even-though yields are high.
> >
> >  [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
> >
> > Example output:
> >
> >  12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7
> >  12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
> >  12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100
> > comm:kworker/u261:12
> >
> >  12:46:56  time elapsed: 36 sec (interval = 1 sec)
> >   Flushes(2051) 15/interval (avg 56/sec)
> >   Locks(44464) 1340/interval (avg 1235/sec)
> >   Yields(42413) 1325/interval (avg 1178/sec)
> >   Contended(42112) 1322/interval (avg 1169/sec)
> >
> > There is reported 15 flushes/sec, but locks are yielded quickly.
> >
> > More problematically (for softirq latency) we see a Long lock-hold time
> > reaching 18 ms.  For network RX softirq I need lower than 0.5ms latency,
> > to avoid RX-ring HW queue overflows.

Here we are measuring yields against contention, but the main problem
here is IRQ serving latency, which doesn't have to correlate with
contention, right?

Perhaps contention is causing us to yield the lock every nth cpu
boundary, but apparently this is not enough for IRQ serving latency.
Dropping the lock on each boundary should improve IRQ serving latency,
regardless of the presence of contention.

Let's focus on one problem at a time ;)

> >
> >
> > --Jesper
> > p.s. I'm seeing a pattern with kswapdN contending on this lock.
> >
> > @stack[697, kswapd3]:
> >         __cgroup_rstat_lock+107
> >         __cgroup_rstat_lock+107
> >         cgroup_rstat_flush_locked+851
> >         cgroup_rstat_flush+35
> >         shrink_node+226
> >         balance_pgdat+807
> >         kswapd+521
> >         kthread+228
> >         ret_from_fork+48
> >         ret_from_fork_asm+27
> >
> > @stack[698, kswapd4]:
> >         __cgroup_rstat_lock+107
> >         __cgroup_rstat_lock+107
> >         cgroup_rstat_flush_locked+851
> >         cgroup_rstat_flush+35
> >         shrink_node+226
> >         balance_pgdat+807
> >         kswapd+521
> >         kthread+228
> >         ret_from_fork+48
> >         ret_from_fork_asm+27
> >
> > @stack[699, kswapd5]:
> >         __cgroup_rstat_lock+107
> >         __cgroup_rstat_lock+107
> >         cgroup_rstat_flush_locked+851
> >         cgroup_rstat_flush+35
> >         shrink_node+226
> >         balance_pgdat+807
> >         kswapd+521
> >         kthread+228
> >         ret_from_fork+48
> >         ret_from_fork_asm+27
> >
>
> Can you simply replace mem_cgroup_flush_stats() in
> prepare_scan_control() with the ratelimited version and see if the issue
> still persists for your production traffic?

With thresholding, the fact that we reach cgroup_rstat_flush() means
that there is a high magnitude of pending updates. I think Jesper
mentioned 128 CPUs before, that means 128 * 64 (MEMCG_CHARGE_BATCH)
page-sized updates. That could be over 33 MBs with 4K page size.

I am not sure if it's fine to ignore such updates in shrink_node(),
especially that it is called in a loop sometimes so I imagine we may
want to see what changed after the last iteration.

>
> Also were you able to get which specific stats are getting the most
> updates?

This, on the other hand, would be very interesting. I think it is very
possible that we don't actually have 33 MBs of updates, but rather we
keep adding and subtracting from the same stat until we reach the
threshold. This could especially be true for hot stats like slab
allocations.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing
  2024-04-19 10:16         ` Jesper Dangaard Brouer
@ 2024-04-19 19:25           ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2024-04-19 19:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: tj, hannes, lizefan.x, cgroups, longman, netdev, linux-mm,
	linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko,
	Wei Xu

On Fri, Apr 19, 2024 at 3:17 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
> On 18/04/2024 23.00, Yosry Ahmed wrote:
> > On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer<hawk@kernel.org>  wrote:
> >> On 18/04/2024 04.21, Yosry Ahmed wrote:
> >>> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer<hawk@kernel.org>  wrote:
> >>>> This patch aims to reduce userspace-triggered pressure on the global
> >>>> cgroup_rstat_lock by introducing a mechanism to limit how often reading
> >>>> stat files causes cgroup rstat flushing.
> >>>>
> [...]
>
> > Taking a step back, I think this series is trying to address two
> > issues in one go: interrupt handling latency and lock contention.
>
> Yes, patch 2 and 3 are essentially independent and address two different
> aspects.
>
> > While both are related because reducing flushing reduces irq
> > disablement, I think it would be better if we can fix that issue
> > separately with a more fundamental solution (e.g. using a mutex or
> > dropping the lock at each CPU boundary).
> >
> > After that, we can more clearly evaluate the lock contention problem
> > with data purely about flushing latency, without taking into
> > consideration the irq handling problem.
> >
> > Does this make sense to you?
>
> Yes, make sense.
>
> So, you are suggesting we start with the mutex change? (patch 2)
> (which still needs some adjustments/tuning)

Yes. Let's focus on (what I assume to be) the more important problem,
IRQ serving latency. Once this is fixed, let's consider the tradeoffs
for improving lock contention separately.

Thanks!

>
> This make sense to me, as there are likely many solutions to reducing
> the pressure on the lock.
>
> With the tracepoint patch in-place, I/we can measure the pressure on the
> lock, and I plan to do this across our CF fleet.  Then we can slowly
> work on improving lock contention and evaluate this on our fleets.
>
> --Jesper
> p.s.
> Setting expectations:
>   - Going on vacation today, so will resume work after 29th April.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  2024-04-16 17:51 ` [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints Jesper Dangaard Brouer
  2024-04-16 21:36   ` Tejun Heo
@ 2024-04-23 16:53   ` Simon Horman
  2024-04-29 11:36     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 28+ messages in thread
From: Simon Horman @ 2024-04-23 16:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: tj, hannes, lizefan.x, cgroups, yosryahmed, longman, netdev,
	linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:

...

>  /**
>   * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
>   */

Hi Jesper,

as a follow-up could you add an entry for cgrp to the kernel doc above?

> -void cgroup_rstat_flush_release(void)
> +void cgroup_rstat_flush_release(struct cgroup *cgrp)
>  	__releases(&cgroup_rstat_lock)
>  {
> -	spin_unlock_irq(&cgroup_rstat_lock);
> +	__cgroup_rstat_unlock(cgrp, -1);
>  }

...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  2024-04-23 16:53   ` Simon Horman
@ 2024-04-29 11:36     ` Jesper Dangaard Brouer
  2024-04-29 17:48       ` Simon Horman
  0 siblings, 1 reply; 28+ messages in thread
From: Jesper Dangaard Brouer @ 2024-04-29 11:36 UTC (permalink / raw)
  To: Simon Horman
  Cc: tj, hannes, lizefan.x, cgroups, yosryahmed, longman, netdev,
	linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko



On 23/04/2024 18.53, Simon Horman wrote:
> On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:
> 
> ...
> 
>>   /**
>>    * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
>>    */
> 
> Hi Jesper,
> 
> as a follow-up could you add an entry for cgrp to the kernel doc above?
> 

Already fixed in TJ's tree for-6.10
  - https://lore.kernel.org/cgroups/ZiB97v73Fr-wq14f@slm.duckdns.org/
  - 
https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/commit/?h=for-6.10&id=97a46a66ad7d9126

But thanks Simon for your vigilance in noticing these things :-)
--Jesper


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  2024-04-29 11:36     ` Jesper Dangaard Brouer
@ 2024-04-29 17:48       ` Simon Horman
  0 siblings, 0 replies; 28+ messages in thread
From: Simon Horman @ 2024-04-29 17:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: tj, hannes, lizefan.x, cgroups, yosryahmed, longman, netdev,
	linux-mm, linux-kernel, shakeel.butt, kernel-team,
	Arnaldo Carvalho de Melo, Sebastian Andrzej Siewior, mhocko

On Mon, Apr 29, 2024 at 01:36:20PM +0200, Jesper Dangaard Brouer wrote:
> 
> 
> On 23/04/2024 18.53, Simon Horman wrote:
> > On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > ...
> > 
> > >   /**
> > >    * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
> > >    */
> > 
> > Hi Jesper,
> > 
> > as a follow-up could you add an entry for cgrp to the kernel doc above?
> > 
> 
> Already fixed in TJ's tree for-6.10
>  - https://lore.kernel.org/cgroups/ZiB97v73Fr-wq14f@slm.duckdns.org/
>  - https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git/commit/?h=for-6.10&id=97a46a66ad7d9126
> 
> But thanks Simon for your vigilance in noticing these things :-)

Thanks Jesper,

I appreciate you letting me know :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-04-29 17:50 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-16 17:51 [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Jesper Dangaard Brouer
2024-04-16 17:51 ` [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints Jesper Dangaard Brouer
2024-04-16 21:36   ` Tejun Heo
2024-04-18  8:00     ` Jesper Dangaard Brouer
2024-04-23 16:53   ` Simon Horman
2024-04-29 11:36     ` Jesper Dangaard Brouer
2024-04-29 17:48       ` Simon Horman
2024-04-16 17:51 ` [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex Jesper Dangaard Brouer
2024-04-18  2:19   ` Yosry Ahmed
2024-04-18  9:02     ` Jesper Dangaard Brouer
2024-04-18 14:49       ` Shakeel Butt
2024-04-18 20:39         ` Yosry Ahmed
2024-04-19 13:15           ` Jesper Dangaard Brouer
2024-04-19 16:11             ` Shakeel Butt
2024-04-19 19:21               ` Yosry Ahmed
2024-04-18 20:38       ` Yosry Ahmed
2024-04-16 17:51 ` [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing Jesper Dangaard Brouer
2024-04-18  2:21   ` Yosry Ahmed
2024-04-18 11:00     ` Jesper Dangaard Brouer
2024-04-18 15:49       ` Shakeel Butt
2024-04-18 21:00       ` Yosry Ahmed
2024-04-18 21:15         ` Tejun Heo
2024-04-18 21:22           ` Yosry Ahmed
2024-04-18 21:32             ` Tejun Heo
2024-04-19 10:16         ` Jesper Dangaard Brouer
2024-04-19 19:25           ` Yosry Ahmed
2024-04-16 21:38 ` [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes Tejun Heo
2024-04-18  2:13   ` Yosry Ahmed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.