bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] memcg: make rstat flushing irq and sleep
@ 2023-03-28 22:16 Yosry Ahmed
  2023-03-28 22:16 ` [PATCH v2 1/9] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
                   ` (8 more replies)
  0 siblings, 9 replies; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

Patches 1 and 2 are cleanups requested during reviews of prior versions
of this series.

Patch 3 makes sure we never try to flush from within an irq context, and
patch 4 adds a WARN_ON_ONCE() to make sure we catch any violations.

Patches 5 to 8 introduce separate variants of mem_cgroup_flush_stats()
for atomic and non-atomic flushing, and make sure we only flush the
stats atomically when necessary.

Patch 9 is a slightly tangential optimization that limits the work done
by rstat flushing in some scenarios.

v1 -> v2:
- Added more context in patch 4's commit log.
- Added atomic_read() before atomic_xchg() in patch 5 to avoid
  needlessly locking the cache line (Shakeel).
- Refactored patch 6: added a common helper, do_flush_stats(), for
  mem_cgroup_flush_stats{_atomic}() (Johannes).
- Renamed mem_cgroup_flush_stats_ratelimited() to
  mem_cgroup_flush_stats_atomic_ratelimited() in patch 6. It is restored
  in patch 7, but this maintains consistency (Johannes).
- Added line break to keep the lock section visually separated in patch
  7 (Johannes).

RFC -> v1:
- Dropped patch 1 that attempted to make the global rstat lock a non-irq
  lock, will follow up on that separetly (Shakeel).
- Dropped stats_flush_lock entirely, replaced by an atomic (Johannes).
- Renamed cgroup_rstat_flush_irqsafe() to cgroup_rstat_flush_atomic()
  instead of removing it (Johannes).
- Added a patch to rename mem_cgroup_flush_stats_delayed() to
  mem_cgroup_flush_stats_ratelimited() (Johannes).
- Separate APIs for flushing memcg stats in atomic and non-atomic
  contexts instead of a boolean argument (Johannes).
- Added patches 3 & 4 to make sure we never flush from irq context
  (Shakeel & Johannes).

Yosry Ahmed (9):
  cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
  memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited"
  memcg: do not flush stats in irq context
  cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  memcg: replace stats_flush_lock with an atomic
  memcg: sleep during flushing stats in safe contexts
  workingset: memcg: sleep when flushing stats in workingset_refault()
  vmscan: memcg: sleep when flushing stats during reclaim
  memcg: do not modify rstat tree for zero updates

 include/linux/cgroup.h     |  2 +-
 include/linux/memcontrol.h |  9 ++++-
 kernel/cgroup/rstat.c      |  6 ++-
 mm/memcontrol.c            | 78 ++++++++++++++++++++++++++++++--------
 mm/workingset.c            |  5 ++-
 5 files changed, 78 insertions(+), 22 deletions(-)

-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 1/9] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-28 22:16 ` [PATCH v2 2/9] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" Yosry Ahmed
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/cgroup.h | 2 +-
 kernel/cgroup/rstat.c  | 4 ++--
 mm/memcontrol.c        | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3410aecffdb4..885f5395fcd0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -692,7 +692,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
+void cgroup_rstat_flush_atomic(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(void);
 
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 831f1f472bb8..d3252b0416b6 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -241,12 +241,12 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 }
 
 /**
- * cgroup_rstat_flush_irqsafe - irqsafe version of cgroup_rstat_flush()
+ * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush()
  * @cgrp: target cgroup
  *
  * This function can be called from any context.
  */
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
+void cgroup_rstat_flush_atomic(struct cgroup *cgrp)
 {
 	unsigned long flags;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5abffe6f8389..0205e58ea430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -642,7 +642,7 @@ static void __mem_cgroup_flush_stats(void)
 		return;
 
 	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
-	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
+	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
 	spin_unlock_irqrestore(&stats_flush_lock, flag);
 }
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 2/9] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited"
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
  2023-03-28 22:16 ` [PATCH v2 1/9] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-29 11:56   ` Michal Hocko
  2023-03-28 22:16 ` [PATCH v2 3/9] memcg: do not flush stats in irq context Yosry Ahmed
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work,
but this is actually sometimes flushing directly from the callsite.

What it's doing is ratelimited calls. A better name would be
mem_cgroup_flush_stats_ratelimited().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 4 ++--
 mm/memcontrol.c            | 2 +-
 mm/workingset.c            | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..ac3f3b3a45e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1037,7 +1037,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_delayed(void);
+void mem_cgroup_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1535,7 +1535,7 @@ static inline void mem_cgroup_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_delayed(void)
+static inline void mem_cgroup_flush_stats_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0205e58ea430..c3b6aae78901 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -653,7 +653,7 @@ void mem_cgroup_flush_stats(void)
 		__mem_cgroup_flush_stats();
 }
 
-void mem_cgroup_flush_stats_delayed(void)
+void mem_cgroup_flush_stats_ratelimited(void)
 {
 	if (time_after64(jiffies_64, flush_next_time))
 		mem_cgroup_flush_stats();
diff --git a/mm/workingset.c b/mm/workingset.c
index 00c6f4d9d9be..af862c6738c3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	mem_cgroup_flush_stats_delayed();
+	mem_cgroup_flush_stats_ratelimited();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 3/9] memcg: do not flush stats in irq context
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
  2023-03-28 22:16 ` [PATCH v2 1/9] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
  2023-03-28 22:16 ` [PATCH v2 2/9] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-29 11:58   ` Michal Hocko
  2023-03-28 22:16 ` [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context Yosry Ahmed
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

Currently, the only context in which we can invoke an rstat flush from
irq context is through mem_cgroup_usage() on the root memcg when called
from memcg_check_events(). An rstat flush is an expensive operation that
should not be done in irq context, so do not flush stats and use the
stale stats in this case.

Arguably, usage threshold events are not reliable on the root memcg
anyway since its usage is ill-defined.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3b6aae78901..ff39f78f962e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3669,7 +3669,21 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	unsigned long val;
 
 	if (mem_cgroup_is_root(memcg)) {
-		mem_cgroup_flush_stats();
+		/*
+		 * We can reach here from irq context through:
+		 * uncharge_batch()
+		 * |--memcg_check_events()
+		 *    |--mem_cgroup_threshold()
+		 *       |--__mem_cgroup_threshold()
+		 *          |--mem_cgroup_usage
+		 *
+		 * rstat flushing is an expensive operation that should not be
+		 * done from irq context; use stale stats in this case.
+		 * Arguably, usage threshold events are not reliable on the root
+		 * memcg anyway since its usage is ill-defined.
+		 */
+		if (in_task())
+			mem_cgroup_flush_stats();
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
                   ` (2 preceding siblings ...)
  2023-03-28 22:16 ` [PATCH v2 3/9] memcg: do not flush stats in irq context Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-29 11:22   ` Michal Hocko
  2023-03-28 22:16 ` [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

rstat flushing is too expensive to perform in irq context.
The previous patch removed the only context that may invoke an rstat
flush from irq context, add a WARN_ON_ONCE() to detect future
violations, or those that we are not aware of.

Ideally, we wouldn't flush with irqs disabled either, but we have one
context today that does so in mem_cgroup_usage(). Forbid callers from
irq context for now, and hopefully we can also forbid callers with irqs
disabled in the future when we can get rid of this callsite.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
---
 kernel/cgroup/rstat.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d3252b0416b6..c2571939139f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -176,6 +176,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 {
 	int cpu;
 
+	/* rstat flushing is too expensive for irq context */
+	WARN_ON_ONCE(!in_task());
 	lockdep_assert_held(&cgroup_rstat_lock);
 
 	for_each_possible_cpu(cpu) {
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
                   ` (3 preceding siblings ...)
  2023-03-28 22:16 ` [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-28 22:22   ` Shakeel Butt
  2023-03-29 15:58   ` Michal Hocko
  2023-03-28 22:16 ` [PATCH v2 6/9] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

As Johannes notes in [1], stats_flush_lock is currently used to:
(a) Protect updated to stats_flush_threshold.
(b) Protect updates to flush_next_time.
(c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.

However:

1. stats_flush_threshold is already an atomic

2. flush_next_time is not atomic. The writer is locked, but the reader
   is lockless. If the reader races with a flush, you could see this:

                                        if (time_after(jiffies, flush_next_time))
        spin_trylock()
        flush_next_time = now + delay
        flush()
        spin_unlock()
                                        spin_trylock()
                                        flush_next_time = now + delay
                                        flush()
                                        spin_unlock()

   which means we already can get flushes at a higher frequency than
   FLUSH_TIME during races. But it isn't really a problem.

   The reader could also see garbled partial updates, so it needs at
   least READ_ONCE and WRITE_ONCE protection.

3. Serializing cgroup_rstat_flush() calls against the ratelimit
   factors is currently broken because of the race in 2. But the race
   is actually harmless, all we might get is the occasional earlier
   flush. If there is no delta, the flush won't do much. And if there
   is, the flush is justified.

So the lock can be removed all together. However, the lock also served
the purpose of preventing a thundering herd problem for concurrent
flushers, see [2]. Use an atomic instead to serve the purpose of
unifying concurrent flushers.

[1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
[2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff39f78f962e..65750f8b8259 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
+static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 static u64 flush_next_time;
 
@@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 
 static void __mem_cgroup_flush_stats(void)
 {
-	unsigned long flag;
-
-	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
+	/*
+	 * We always flush the entire tree, so concurrent flushers can just
+	 * skip. This avoids a thundering herd problem on the rstat global lock
+	 * from memcg flushers (e.g. reclaim, refault, etc).
+	 */
+	if (atomic_read(&stats_flush_ongoing) ||
+	    atomic_xchg(&stats_flush_ongoing, 1))
 		return;
 
-	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
+	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
-	spin_unlock_irqrestore(&stats_flush_lock, flag);
+	atomic_set(&stats_flush_ongoing, 0);
 }
 
 void mem_cgroup_flush_stats(void)
@@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void)
 
 void mem_cgroup_flush_stats_ratelimited(void)
 {
-	if (time_after64(jiffies_64, flush_next_time))
+	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
 		mem_cgroup_flush_stats();
 }
 
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 6/9] memcg: sleep during flushing stats in safe contexts
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
                   ` (4 preceding siblings ...)
  2023-03-28 22:16 ` [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-30  7:35   ` Michal Hocko
  2023-03-28 22:16 ` [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

Currently, all contexts that flush memcg stats do so with sleeping not
allowed. Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.

Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
sleepable), and provide a separate atomic version. The atomic version is
used in reclaim, refault, writeback, and in mem_cgroup_usage(). All
other code paths are left to use the non-atomic version. This includes
callbacks for userspace reads and the periodic flusher.

Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and
refault code paths are modified to do non-atomic flushing in separate
later patches -- so it will eventually be changed back to
mem_cgroup_flush_stats_ratelimited().

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/memcontrol.h |  9 ++++++--
 mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++--------
 mm/vmscan.c                |  2 +-
 mm/workingset.c            |  2 +-
 4 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ac3f3b3a45e2..b424ba3ebd09 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1037,7 +1037,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_ratelimited(void);
+void mem_cgroup_flush_stats_atomic(void);
+void mem_cgroup_flush_stats_atomic_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1535,7 +1536,11 @@ static inline void mem_cgroup_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_ratelimited(void)
+static inline void mem_cgroup_flush_stats_atomic(void)
+{
+}
+
+static inline void mem_cgroup_flush_stats_atomic_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 65750f8b8259..a2ce3aa10d94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -634,7 +634,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void __mem_cgroup_flush_stats(void)
+static void do_flush_stats(bool atomic)
 {
 	/*
 	 * We always flush the entire tree, so concurrent flushers can just
@@ -646,26 +646,46 @@ static void __mem_cgroup_flush_stats(void)
 		return;
 
 	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
-	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
+
+	if (atomic)
+		cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
+	else
+		cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+
 	atomic_set(&stats_flush_threshold, 0);
 	atomic_set(&stats_flush_ongoing, 0);
 }
 
+static bool should_flush_stats(void)
+{
+	return atomic_read(&stats_flush_threshold) > num_online_cpus();
+}
+
 void mem_cgroup_flush_stats(void)
 {
-	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
-		__mem_cgroup_flush_stats();
+	if (should_flush_stats())
+		do_flush_stats(false);
 }
 
-void mem_cgroup_flush_stats_ratelimited(void)
+void mem_cgroup_flush_stats_atomic(void)
+{
+	if (should_flush_stats())
+		do_flush_stats(true);
+}
+
+void mem_cgroup_flush_stats_atomic_ratelimited(void)
 {
 	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
-		mem_cgroup_flush_stats();
+		mem_cgroup_flush_stats_atomic();
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
 {
-	__mem_cgroup_flush_stats();
+	/*
+	 * Always flush here so that flushing in latency-sensitive paths is
+	 * as cheap as possible.
+	 */
+	do_flush_stats(false);
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -3685,9 +3705,12 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 		 * done from irq context; use stale stats in this case.
 		 * Arguably, usage threshold events are not reliable on the root
 		 * memcg anyway since its usage is ill-defined.
+		 *
+		 * Additionally, other call paths through memcg_check_events()
+		 * disable irqs, so make sure we are flushing stats atomically.
 		 */
 		if (in_task())
-			mem_cgroup_flush_stats();
+			mem_cgroup_flush_stats_atomic();
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
@@ -4610,7 +4633,11 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_flush_stats();
+	/*
+	 * wb_writeback() takes a spinlock and calls
+	 * wb_over_bg_thresh()->mem_cgroup_wb_stats(). Do not sleep.
+	 */
+	mem_cgroup_flush_stats_atomic();
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..a9511ccb936f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_atomic();
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index af862c6738c3..dab0c362b9e3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	mem_cgroup_flush_stats_ratelimited();
+	mem_cgroup_flush_stats_atomic_ratelimited();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
                   ` (5 preceding siblings ...)
  2023-03-28 22:16 ` [PATCH v2 6/9] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-30  7:39   ` Michal Hocko
  2023-03-28 22:16 ` [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim Yosry Ahmed
  2023-03-28 22:16 ` [PATCH v2 9/9] memcg: do not modify rstat tree for zero updates Yosry Ahmed
  8 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

In workingset_refault(), we call
mem_cgroup_flush_stats_atomic_ratelimited() to flush stats within an
RCU read section and with sleeping disallowed. Move the call above
the RCU read section and allow sleeping to avoid unnecessarily
performing a lot of work without sleeping.

Since workingset_refault() is the only caller of
mem_cgroup_flush_stats_atomic_ratelimited(), just make it non-atomic,
and rename it to mem_cgroup_flush_stats_ratelimited().

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 4 ++--
 mm/memcontrol.c            | 4 ++--
 mm/workingset.c            | 5 +++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b424ba3ebd09..a4bc3910a2eb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1038,7 +1038,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 
 void mem_cgroup_flush_stats(void);
 void mem_cgroup_flush_stats_atomic(void);
-void mem_cgroup_flush_stats_atomic_ratelimited(void);
+void mem_cgroup_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1540,7 +1540,7 @@ static inline void mem_cgroup_flush_stats_atomic(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_atomic_ratelimited(void)
+static inline void mem_cgroup_flush_stats_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2ce3aa10d94..361c0bbf7283 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -673,10 +673,10 @@ void mem_cgroup_flush_stats_atomic(void)
 		do_flush_stats(true);
 }
 
-void mem_cgroup_flush_stats_atomic_ratelimited(void)
+void mem_cgroup_flush_stats_ratelimited(void)
 {
 	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
-		mem_cgroup_flush_stats_atomic();
+		mem_cgroup_flush_stats();
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
diff --git a/mm/workingset.c b/mm/workingset.c
index dab0c362b9e3..3025beee9b34 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -406,6 +406,9 @@ void workingset_refault(struct folio *folio, void *shadow)
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 	eviction <<= bucket_order;
 
+	/* Flush stats (and potentially sleep) before holding RCU read lock */
+	mem_cgroup_flush_stats_ratelimited();
+
 	rcu_read_lock();
 	/*
 	 * Look up the memcg associated with the stored ID. It might
@@ -461,8 +464,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
-
-	mem_cgroup_flush_stats_atomic_ratelimited();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
                   ` (6 preceding siblings ...)
  2023-03-28 22:16 ` [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-30  7:40   ` Michal Hocko
  2023-03-28 22:16 ` [PATCH v2 9/9] memcg: do not modify rstat tree for zero updates Yosry Ahmed
  8 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

Memory reclaim is a sleepable context. Allow sleeping when flushing
memcg stats to avoid unnecessarily performing a lot of work without
sleeping. This can slow down reclaim code if flushing stats is taking
too long, but there is already multiple cond_resched()'s in reclaim
code.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9511ccb936f..9c1c5e8b24b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats_atomic();
+	mem_cgroup_flush_stats();
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 9/9] memcg: do not modify rstat tree for zero updates
  2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
                   ` (7 preceding siblings ...)
  2023-03-28 22:16 ` [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim Yosry Ahmed
@ 2023-03-28 22:16 ` Yosry Ahmed
  2023-03-30  7:43   ` Michal Hocko
  8 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-28 22:16 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

In some situations, we may end up calling memcg_rstat_updated() with a
value of 0, which means the stat was not actually updated. An example is
if we fail to reclaim any pages in shrink_folio_list().

Do not add the cgroup to the rstat updated tree in this case, to avoid
unnecessarily flushing it.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 361c0bbf7283..a63ee2efa780 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -618,6 +618,9 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
 	unsigned int x;
 
+	if (!val)
+		return;
+
 	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
 
 	x = __this_cpu_add_return(stats_updates, abs(val));
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic
  2023-03-28 22:16 ` [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
@ 2023-03-28 22:22   ` Shakeel Butt
  2023-03-29 15:58   ` Michal Hocko
  1 sibling, 0 replies; 40+ messages in thread
From: Shakeel Butt @ 2023-03-28 22:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue, Mar 28, 2023 at 3:17 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> As Johannes notes in [1], stats_flush_lock is currently used to:
> (a) Protect updated to stats_flush_threshold.
> (b) Protect updates to flush_next_time.
> (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
>
> However:
>
> 1. stats_flush_threshold is already an atomic
>
> 2. flush_next_time is not atomic. The writer is locked, but the reader
>    is lockless. If the reader races with a flush, you could see this:
>
>                                         if (time_after(jiffies, flush_next_time))
>         spin_trylock()
>         flush_next_time = now + delay
>         flush()
>         spin_unlock()
>                                         spin_trylock()
>                                         flush_next_time = now + delay
>                                         flush()
>                                         spin_unlock()
>
>    which means we already can get flushes at a higher frequency than
>    FLUSH_TIME during races. But it isn't really a problem.
>
>    The reader could also see garbled partial updates, so it needs at
>    least READ_ONCE and WRITE_ONCE protection.
>
> 3. Serializing cgroup_rstat_flush() calls against the ratelimit
>    factors is currently broken because of the race in 2. But the race
>    is actually harmless, all we might get is the occasional earlier
>    flush. If there is no delta, the flush won't do much. And if there
>    is, the flush is justified.
>
> So the lock can be removed all together. However, the lock also served
> the purpose of preventing a thundering herd problem for concurrent
> flushers, see [2]. Use an atomic instead to serve the purpose of
> unifying concurrent flushers.
>
> [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
> [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-28 22:16 ` [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context Yosry Ahmed
@ 2023-03-29 11:22   ` Michal Hocko
  2023-03-29 18:41     ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-29 11:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:39, Yosry Ahmed wrote:
> rstat flushing is too expensive to perform in irq context.
> The previous patch removed the only context that may invoke an rstat
> flush from irq context, add a WARN_ON_ONCE() to detect future
> violations, or those that we are not aware of.
> 
> Ideally, we wouldn't flush with irqs disabled either, but we have one
> context today that does so in mem_cgroup_usage(). Forbid callers from
> irq context for now, and hopefully we can also forbid callers with irqs
> disabled in the future when we can get rid of this callsite.

I am sorry to be late to the discussion. I wanted to follow up on
Johannes reply in the previous version but you are too fast ;)

I do agree that this looks rather arbitrary. You do not explain how the
warning actually helps. Is the intention to be really verbose to the
kernel log when somebody uses this interface from the IRQ context and
get bug reports? What about configurations with panic on warn? Do we
really want to crash their systems for something like that?

> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Reviewed-by: Shakeel Butt <shakeelb@google.com>
> ---
>  kernel/cgroup/rstat.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index d3252b0416b6..c2571939139f 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -176,6 +176,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
>  {
>  	int cpu;
>  
> +	/* rstat flushing is too expensive for irq context */
> +	WARN_ON_ONCE(!in_task());
>  	lockdep_assert_held(&cgroup_rstat_lock);
>  
>  	for_each_possible_cpu(cpu) {
> -- 
> 2.40.0.348.gf938b09366-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 2/9] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited"
  2023-03-28 22:16 ` [PATCH v2 2/9] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" Yosry Ahmed
@ 2023-03-29 11:56   ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2023-03-29 11:56 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:37, Yosry Ahmed wrote:
> mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work,
> but this is actually sometimes flushing directly from the callsite.
> 
> What it's doing is ratelimited calls. A better name would be
> mem_cgroup_flush_stats_ratelimited().
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 4 ++--
>  mm/memcontrol.c            | 2 +-
>  mm/workingset.c            | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b6eda2ab205d..ac3f3b3a45e2 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1037,7 +1037,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  }
>  
>  void mem_cgroup_flush_stats(void);
> -void mem_cgroup_flush_stats_delayed(void);
> +void mem_cgroup_flush_stats_ratelimited(void);
>  
>  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  			      int val);
> @@ -1535,7 +1535,7 @@ static inline void mem_cgroup_flush_stats(void)
>  {
>  }
>  
> -static inline void mem_cgroup_flush_stats_delayed(void)
> +static inline void mem_cgroup_flush_stats_ratelimited(void)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0205e58ea430..c3b6aae78901 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -653,7 +653,7 @@ void mem_cgroup_flush_stats(void)
>  		__mem_cgroup_flush_stats();
>  }
>  
> -void mem_cgroup_flush_stats_delayed(void)
> +void mem_cgroup_flush_stats_ratelimited(void)
>  {
>  	if (time_after64(jiffies_64, flush_next_time))
>  		mem_cgroup_flush_stats();
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 00c6f4d9d9be..af862c6738c3 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow)
>  
>  	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>  
> -	mem_cgroup_flush_stats_delayed();
> +	mem_cgroup_flush_stats_ratelimited();
>  	/*
>  	 * Compare the distance to the existing workingset size. We
>  	 * don't activate pages that couldn't stay resident even if
> -- 
> 2.40.0.348.gf938b09366-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/9] memcg: do not flush stats in irq context
  2023-03-28 22:16 ` [PATCH v2 3/9] memcg: do not flush stats in irq context Yosry Ahmed
@ 2023-03-29 11:58   ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2023-03-29 11:58 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:38, Yosry Ahmed wrote:
> Currently, the only context in which we can invoke an rstat flush from
> irq context is through mem_cgroup_usage() on the root memcg when called
> from memcg_check_events(). An rstat flush is an expensive operation that
> should not be done in irq context, so do not flush stats and use the
> stale stats in this case.
> 
> Arguably, usage threshold events are not reliable on the root memcg
> anyway since its usage is ill-defined.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c3b6aae78901..ff39f78f962e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3669,7 +3669,21 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  	unsigned long val;
>  
>  	if (mem_cgroup_is_root(memcg)) {
> -		mem_cgroup_flush_stats();
> +		/*
> +		 * We can reach here from irq context through:
> +		 * uncharge_batch()
> +		 * |--memcg_check_events()
> +		 *    |--mem_cgroup_threshold()
> +		 *       |--__mem_cgroup_threshold()
> +		 *          |--mem_cgroup_usage
> +		 *
> +		 * rstat flushing is an expensive operation that should not be
> +		 * done from irq context; use stale stats in this case.
> +		 * Arguably, usage threshold events are not reliable on the root
> +		 * memcg anyway since its usage is ill-defined.
> +		 */
> +		if (in_task())
> +			mem_cgroup_flush_stats();
>  		val = memcg_page_state(memcg, NR_FILE_PAGES) +
>  			memcg_page_state(memcg, NR_ANON_MAPPED);
>  		if (swap)
> -- 
> 2.40.0.348.gf938b09366-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic
  2023-03-28 22:16 ` [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
  2023-03-28 22:22   ` Shakeel Butt
@ 2023-03-29 15:58   ` Michal Hocko
  2023-03-29 18:45     ` Yosry Ahmed
  1 sibling, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-29 15:58 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:40, Yosry Ahmed wrote:
> As Johannes notes in [1], stats_flush_lock is currently used to:
> (a) Protect updated to stats_flush_threshold.
> (b) Protect updates to flush_next_time.
> (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
> 
> However:
> 
> 1. stats_flush_threshold is already an atomic
> 
> 2. flush_next_time is not atomic. The writer is locked, but the reader
>    is lockless. If the reader races with a flush, you could see this:
> 
>                                         if (time_after(jiffies, flush_next_time))
>         spin_trylock()
>         flush_next_time = now + delay
>         flush()
>         spin_unlock()
>                                         spin_trylock()
>                                         flush_next_time = now + delay
>                                         flush()
>                                         spin_unlock()
> 
>    which means we already can get flushes at a higher frequency than
>    FLUSH_TIME during races. But it isn't really a problem.
> 
>    The reader could also see garbled partial updates, so it needs at
>    least READ_ONCE and WRITE_ONCE protection.

Just a nit. Sounds more serious than it is actually. This would only
happen if compiler decides to split the write.

> 3. Serializing cgroup_rstat_flush() calls against the ratelimit
>    factors is currently broken because of the race in 2. But the race
>    is actually harmless, all we might get is the occasional earlier
>    flush. If there is no delta, the flush won't do much. And if there
>    is, the flush is justified.
> 
> So the lock can be removed all together. However, the lock also served
> the purpose of preventing a thundering herd problem for concurrent
> flushers, see [2]. Use an atomic instead to serve the purpose of
> unifying concurrent flushers.
> 
> [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
> [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ff39f78f962e..65750f8b8259 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>   */
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> -static DEFINE_SPINLOCK(stats_flush_lock);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
> +static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
>  static u64 flush_next_time;
>  
> @@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  
>  static void __mem_cgroup_flush_stats(void)
>  {
> -	unsigned long flag;
> -
> -	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> +	/*
> +	 * We always flush the entire tree, so concurrent flushers can just
> +	 * skip. This avoids a thundering herd problem on the rstat global lock
> +	 * from memcg flushers (e.g. reclaim, refault, etc).
> +	 */
> +	if (atomic_read(&stats_flush_ongoing) ||
> +	    atomic_xchg(&stats_flush_ongoing, 1))
>  		return;
>  
> -	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> +	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
>  	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
>  	atomic_set(&stats_flush_threshold, 0);
> -	spin_unlock_irqrestore(&stats_flush_lock, flag);
> +	atomic_set(&stats_flush_ongoing, 0);
>  }
>  
>  void mem_cgroup_flush_stats(void)
> @@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void)
>  
>  void mem_cgroup_flush_stats_ratelimited(void)
>  {
> -	if (time_after64(jiffies_64, flush_next_time))
> +	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
>  		mem_cgroup_flush_stats();
>  }
>  
> -- 
> 2.40.0.348.gf938b09366-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-29 11:22   ` Michal Hocko
@ 2023-03-29 18:41     ` Yosry Ahmed
  2023-03-29 19:20       ` Shakeel Butt
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-29 18:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Wed, Mar 29, 2023 at 4:22 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 28-03-23 22:16:39, Yosry Ahmed wrote:
> > rstat flushing is too expensive to perform in irq context.
> > The previous patch removed the only context that may invoke an rstat
> > flush from irq context, add a WARN_ON_ONCE() to detect future
> > violations, or those that we are not aware of.
> >
> > Ideally, we wouldn't flush with irqs disabled either, but we have one
> > context today that does so in mem_cgroup_usage(). Forbid callers from
> > irq context for now, and hopefully we can also forbid callers with irqs
> > disabled in the future when we can get rid of this callsite.
>
> I am sorry to be late to the discussion. I wanted to follow up on
> Johannes reply in the previous version but you are too fast ;)
>
> I do agree that this looks rather arbitrary. You do not explain how the
> warning actually helps. Is the intention to be really verbose to the
> kernel log when somebody uses this interface from the IRQ context and
> get bug reports? What about configurations with panic on warn? Do we
> really want to crash their systems for something like that?

Thanks for taking a look, Michal!

The ultimate goal is not to flush in irq context or with irqs
disabled, as in some cases it causes irqs to be disabled for a long
time, as flushing is an expensive operation. The previous patch in the
series should have removed the only context that flushes in irq
context, and the purpose of the WARN_ON_ONCE() is to catch future uses
or uses that we might have missed.

There is still one code path that flushes with irqs disabled (also
mem_cgroup_usage()), and we cannot remove this just yet; we need to
deprecate usage threshold events for root to do that. So we cannot
enforce not flushing with irqs disabled yet.

So basically the patch is trying to enforce what we have now, not
flushing in irq context, and hopefully at some point we will also be
able to enforce not flushing with irqs disabled.

If WARN_ON_ONCE() is the wrong tool for this, please let me know.

>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > Reviewed-by: Shakeel Butt <shakeelb@google.com>
> > ---
> >  kernel/cgroup/rstat.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index d3252b0416b6..c2571939139f 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -176,6 +176,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
> >  {
> >       int cpu;
> >
> > +     /* rstat flushing is too expensive for irq context */
> > +     WARN_ON_ONCE(!in_task());
> >       lockdep_assert_held(&cgroup_rstat_lock);
> >
> >       for_each_possible_cpu(cpu) {
> > --
> > 2.40.0.348.gf938b09366-goog
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic
  2023-03-29 15:58   ` Michal Hocko
@ 2023-03-29 18:45     ` Yosry Ahmed
  0 siblings, 0 replies; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-29 18:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Wed, Mar 29, 2023 at 8:58 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 28-03-23 22:16:40, Yosry Ahmed wrote:
> > As Johannes notes in [1], stats_flush_lock is currently used to:
> > (a) Protect updated to stats_flush_threshold.
> > (b) Protect updates to flush_next_time.
> > (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
> >
> > However:
> >
> > 1. stats_flush_threshold is already an atomic
> >
> > 2. flush_next_time is not atomic. The writer is locked, but the reader
> >    is lockless. If the reader races with a flush, you could see this:
> >
> >                                         if (time_after(jiffies, flush_next_time))
> >         spin_trylock()
> >         flush_next_time = now + delay
> >         flush()
> >         spin_unlock()
> >                                         spin_trylock()
> >                                         flush_next_time = now + delay
> >                                         flush()
> >                                         spin_unlock()
> >
> >    which means we already can get flushes at a higher frequency than
> >    FLUSH_TIME during races. But it isn't really a problem.
> >
> >    The reader could also see garbled partial updates, so it needs at
> >    least READ_ONCE and WRITE_ONCE protection.
>
> Just a nit. Sounds more serious than it is actually. This would only
> happen if compiler decides to split the write.

Thanks for the note, Michal. I honestly quoted Johannes here as I do
not have much expertise when it comes to this. I will add "if the
compiler decides to split the write" to the commit log if I respin.

>
> > 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> >    factors is currently broken because of the race in 2. But the race
> >    is actually harmless, all we might get is the occasional earlier
> >    flush. If there is no delta, the flush won't do much. And if there
> >    is, the flush is justified.
> >
> > So the lock can be removed all together. However, the lock also served
> > the purpose of preventing a thundering herd problem for concurrent
> > flushers, see [2]. Use an atomic instead to serve the purpose of
> > unifying concurrent flushers.
> >
> > [1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
> > [2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> > ---
> >  mm/memcontrol.c | 18 +++++++++++-------
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ff39f78f962e..65750f8b8259 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >   */
> >  static void flush_memcg_stats_dwork(struct work_struct *w);
> >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> > -static DEFINE_SPINLOCK(stats_flush_lock);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > +static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >  static u64 flush_next_time;
> >
> > @@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >
> >  static void __mem_cgroup_flush_stats(void)
> >  {
> > -     unsigned long flag;
> > -
> > -     if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> > +     /*
> > +      * We always flush the entire tree, so concurrent flushers can just
> > +      * skip. This avoids a thundering herd problem on the rstat global lock
> > +      * from memcg flushers (e.g. reclaim, refault, etc).
> > +      */
> > +     if (atomic_read(&stats_flush_ongoing) ||
> > +         atomic_xchg(&stats_flush_ongoing, 1))
> >               return;
> >
> > -     flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > +     WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> >       cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> >       atomic_set(&stats_flush_threshold, 0);
> > -     spin_unlock_irqrestore(&stats_flush_lock, flag);
> > +     atomic_set(&stats_flush_ongoing, 0);
> >  }
> >
> >  void mem_cgroup_flush_stats(void)
> > @@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void)
> >
> >  void mem_cgroup_flush_stats_ratelimited(void)
> >  {
> > -     if (time_after64(jiffies_64, flush_next_time))
> > +     if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> >               mem_cgroup_flush_stats();
> >  }
> >
> > --
> > 2.40.0.348.gf938b09366-goog
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-29 18:41     ` Yosry Ahmed
@ 2023-03-29 19:20       ` Shakeel Butt
  2023-03-30  7:06         ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Shakeel Butt @ 2023-03-29 19:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Wed, Mar 29, 2023 at 11:41:39AM -0700, Yosry Ahmed wrote:
> On Wed, Mar 29, 2023 at 4:22 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 28-03-23 22:16:39, Yosry Ahmed wrote:
> > > rstat flushing is too expensive to perform in irq context.
> > > The previous patch removed the only context that may invoke an rstat
> > > flush from irq context, add a WARN_ON_ONCE() to detect future
> > > violations, or those that we are not aware of.
> > >
> > > Ideally, we wouldn't flush with irqs disabled either, but we have one
> > > context today that does so in mem_cgroup_usage(). Forbid callers from
> > > irq context for now, and hopefully we can also forbid callers with irqs
> > > disabled in the future when we can get rid of this callsite.
> >
> > I am sorry to be late to the discussion. I wanted to follow up on
> > Johannes reply in the previous version but you are too fast ;)
> >
> > I do agree that this looks rather arbitrary. You do not explain how the
> > warning actually helps. Is the intention to be really verbose to the
> > kernel log when somebody uses this interface from the IRQ context and
> > get bug reports? What about configurations with panic on warn? Do we
> > really want to crash their systems for something like that?
> 
> Thanks for taking a look, Michal!
> 
> The ultimate goal is not to flush in irq context or with irqs
> disabled, as in some cases it causes irqs to be disabled for a long
> time, as flushing is an expensive operation. The previous patch in the
> series should have removed the only context that flushes in irq
> context, and the purpose of the WARN_ON_ONCE() is to catch future uses
> or uses that we might have missed.
> 
> There is still one code path that flushes with irqs disabled (also
> mem_cgroup_usage()), and we cannot remove this just yet; we need to
> deprecate usage threshold events for root to do that. So we cannot
> enforce not flushing with irqs disabled yet.
> 
> So basically the patch is trying to enforce what we have now, not
> flushing in irq context, and hopefully at some point we will also be
> able to enforce not flushing with irqs disabled.
> 
> If WARN_ON_ONCE() is the wrong tool for this, please let me know.
> 

If I understand Michal's concern, the question is should be start with
pr_warn_once() instead of WARN_ON_ONCE() and I think yes we should start
with pr_warn_once().


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-29 19:20       ` Shakeel Butt
@ 2023-03-30  7:06         ` Michal Hocko
  2023-03-30  7:13           ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yosry Ahmed, Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Wed 29-03-23 19:20:59, Shakeel Butt wrote:
> On Wed, Mar 29, 2023 at 11:41:39AM -0700, Yosry Ahmed wrote:
> > On Wed, Mar 29, 2023 at 4:22 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 28-03-23 22:16:39, Yosry Ahmed wrote:
> > > > rstat flushing is too expensive to perform in irq context.
> > > > The previous patch removed the only context that may invoke an rstat
> > > > flush from irq context, add a WARN_ON_ONCE() to detect future
> > > > violations, or those that we are not aware of.
> > > >
> > > > Ideally, we wouldn't flush with irqs disabled either, but we have one
> > > > context today that does so in mem_cgroup_usage(). Forbid callers from
> > > > irq context for now, and hopefully we can also forbid callers with irqs
> > > > disabled in the future when we can get rid of this callsite.
> > >
> > > I am sorry to be late to the discussion. I wanted to follow up on
> > > Johannes reply in the previous version but you are too fast ;)
> > >
> > > I do agree that this looks rather arbitrary. You do not explain how the
> > > warning actually helps. Is the intention to be really verbose to the
> > > kernel log when somebody uses this interface from the IRQ context and
> > > get bug reports? What about configurations with panic on warn? Do we
> > > really want to crash their systems for something like that?
> > 
> > Thanks for taking a look, Michal!
> > 
> > The ultimate goal is not to flush in irq context or with irqs
> > disabled, as in some cases it causes irqs to be disabled for a long
> > time, as flushing is an expensive operation. The previous patch in the
> > series should have removed the only context that flushes in irq
> > context, and the purpose of the WARN_ON_ONCE() is to catch future uses
> > or uses that we might have missed.
> > 
> > There is still one code path that flushes with irqs disabled (also
> > mem_cgroup_usage()), and we cannot remove this just yet; we need to
> > deprecate usage threshold events for root to do that. So we cannot
> > enforce not flushing with irqs disabled yet.
> > 
> > So basically the patch is trying to enforce what we have now, not
> > flushing in irq context, and hopefully at some point we will also be
> > able to enforce not flushing with irqs disabled.
> > 
> > If WARN_ON_ONCE() is the wrong tool for this, please let me know.
> > 
> 
> If I understand Michal's concern, the question is should be start with
> pr_warn_once() instead of WARN_ON_ONCE() and I think yes we should start
> with pr_warn_once().

Yes, I do not really like the WARN_ON here. It is an overkill. pr_warn
would much less intrusive but potentially incomplete because you won't
know who that offender is. So if you really care about those then you
would need to call dump_stack as well.

So the real question is. Do we really care so deeply? After all somebody
might be calling this from within a spin lock or irq disabled section
resulting in a similar situation without noticing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  7:06         ` Michal Hocko
@ 2023-03-30  7:13           ` Yosry Ahmed
  2023-03-30  7:49             ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  7:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li,
	Johannes Weiner, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 29-03-23 19:20:59, Shakeel Butt wrote:
> > On Wed, Mar 29, 2023 at 11:41:39AM -0700, Yosry Ahmed wrote:
> > > On Wed, Mar 29, 2023 at 4:22 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 28-03-23 22:16:39, Yosry Ahmed wrote:
> > > > > rstat flushing is too expensive to perform in irq context.
> > > > > The previous patch removed the only context that may invoke an rstat
> > > > > flush from irq context, add a WARN_ON_ONCE() to detect future
> > > > > violations, or those that we are not aware of.
> > > > >
> > > > > Ideally, we wouldn't flush with irqs disabled either, but we have one
> > > > > context today that does so in mem_cgroup_usage(). Forbid callers from
> > > > > irq context for now, and hopefully we can also forbid callers with irqs
> > > > > disabled in the future when we can get rid of this callsite.
> > > >
> > > > I am sorry to be late to the discussion. I wanted to follow up on
> > > > Johannes reply in the previous version but you are too fast ;)
> > > >
> > > > I do agree that this looks rather arbitrary. You do not explain how the
> > > > warning actually helps. Is the intention to be really verbose to the
> > > > kernel log when somebody uses this interface from the IRQ context and
> > > > get bug reports? What about configurations with panic on warn? Do we
> > > > really want to crash their systems for something like that?
> > >
> > > Thanks for taking a look, Michal!
> > >
> > > The ultimate goal is not to flush in irq context or with irqs
> > > disabled, as in some cases it causes irqs to be disabled for a long
> > > time, as flushing is an expensive operation. The previous patch in the
> > > series should have removed the only context that flushes in irq
> > > context, and the purpose of the WARN_ON_ONCE() is to catch future uses
> > > or uses that we might have missed.
> > >
> > > There is still one code path that flushes with irqs disabled (also
> > > mem_cgroup_usage()), and we cannot remove this just yet; we need to
> > > deprecate usage threshold events for root to do that. So we cannot
> > > enforce not flushing with irqs disabled yet.
> > >
> > > So basically the patch is trying to enforce what we have now, not
> > > flushing in irq context, and hopefully at some point we will also be
> > > able to enforce not flushing with irqs disabled.
> > >
> > > If WARN_ON_ONCE() is the wrong tool for this, please let me know.
> > >
> >
> > If I understand Michal's concern, the question is should be start with
> > pr_warn_once() instead of WARN_ON_ONCE() and I think yes we should start
> > with pr_warn_once().
>
> Yes, I do not really like the WARN_ON here. It is an overkill. pr_warn
> would much less intrusive but potentially incomplete because you won't
> know who that offender is. So if you really care about those then you
> would need to call dump_stack as well.
>
> So the real question is. Do we really care so deeply? After all somebody
> might be calling this from within a spin lock or irq disabled section
> resulting in a similar situation without noticing.

There are discussions in [1] about making atomic rstat flush not
disable irqs throughout the process, so in that case it would only
result in a similar situation if the caller has irq disabled. The only
caller that might have irq disabled is the same caller that might be
in irq context before this series: mem_cgroup_usage().

On that note, and while I have your attention, I was wondering if we
can eliminate the flush call completely from mem_cgroup_usage(), and
read the global stats counters for root memcg instead of the root
counters. There might be subtle differences, but the root memcg usage
isn't super accurate now anyway (e.g. it doesn't include kernel
memory).

With that removed, no callers to rstat flushing would be from irq
context or have irqs disabled. There will only be one atomic flusher
(mem_cgroup_wb_stats()), and we can proceed with [1] if it causes a
problem.

What do you think?

[1] https://lore.kernel.org/lkml/CAJD7tkZrp=4zWvjE9_010TAG1T_crCbf9P64UzJABspgcrGPKg@mail.gmail.com/

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 6/9] memcg: sleep during flushing stats in safe contexts
  2023-03-28 22:16 ` [PATCH v2 6/9] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
@ 2023-03-30  7:35   ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:35 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:41, Yosry Ahmed wrote:
> Currently, all contexts that flush memcg stats do so with sleeping not
> allowed. Some of these contexts are perfectly safe to sleep in, such as
> reading cgroup files from userspace or the background periodic flusher.
> 
> Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
> sleepable), and provide a separate atomic version. The atomic version is
> used in reclaim, refault, writeback, and in mem_cgroup_usage(). All
> other code paths are left to use the non-atomic version. This includes
> callbacks for userspace reads and the periodic flusher.
> 
> Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
> change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and
> refault code paths are modified to do non-atomic flushing in separate
> later patches -- so it will eventually be changed back to
> mem_cgroup_flush_stats_ratelimited().
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Shakeel Butt <shakeelb@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  9 ++++++--
>  mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++--------
>  mm/vmscan.c                |  2 +-
>  mm/workingset.c            |  2 +-
>  4 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ac3f3b3a45e2..b424ba3ebd09 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1037,7 +1037,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  }
>  
>  void mem_cgroup_flush_stats(void);
> -void mem_cgroup_flush_stats_ratelimited(void);
> +void mem_cgroup_flush_stats_atomic(void);
> +void mem_cgroup_flush_stats_atomic_ratelimited(void);
>  
>  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  			      int val);
> @@ -1535,7 +1536,11 @@ static inline void mem_cgroup_flush_stats(void)
>  {
>  }
>  
> -static inline void mem_cgroup_flush_stats_ratelimited(void)
> +static inline void mem_cgroup_flush_stats_atomic(void)
> +{
> +}
> +
> +static inline void mem_cgroup_flush_stats_atomic_ratelimited(void)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 65750f8b8259..a2ce3aa10d94 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -634,7 +634,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  	}
>  }
>  
> -static void __mem_cgroup_flush_stats(void)
> +static void do_flush_stats(bool atomic)
>  {
>  	/*
>  	 * We always flush the entire tree, so concurrent flushers can just
> @@ -646,26 +646,46 @@ static void __mem_cgroup_flush_stats(void)
>  		return;
>  
>  	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> -	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> +
> +	if (atomic)
> +		cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> +	else
> +		cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> +
>  	atomic_set(&stats_flush_threshold, 0);
>  	atomic_set(&stats_flush_ongoing, 0);
>  }
>  
> +static bool should_flush_stats(void)
> +{
> +	return atomic_read(&stats_flush_threshold) > num_online_cpus();
> +}
> +
>  void mem_cgroup_flush_stats(void)
>  {
> -	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
> -		__mem_cgroup_flush_stats();
> +	if (should_flush_stats())
> +		do_flush_stats(false);
>  }
>  
> -void mem_cgroup_flush_stats_ratelimited(void)
> +void mem_cgroup_flush_stats_atomic(void)
> +{
> +	if (should_flush_stats())
> +		do_flush_stats(true);
> +}
> +
> +void mem_cgroup_flush_stats_atomic_ratelimited(void)
>  {
>  	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> -		mem_cgroup_flush_stats();
> +		mem_cgroup_flush_stats_atomic();
>  }
>  
>  static void flush_memcg_stats_dwork(struct work_struct *w)
>  {
> -	__mem_cgroup_flush_stats();
> +	/*
> +	 * Always flush here so that flushing in latency-sensitive paths is
> +	 * as cheap as possible.
> +	 */
> +	do_flush_stats(false);
>  	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
>  }
>  
> @@ -3685,9 +3705,12 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  		 * done from irq context; use stale stats in this case.
>  		 * Arguably, usage threshold events are not reliable on the root
>  		 * memcg anyway since its usage is ill-defined.
> +		 *
> +		 * Additionally, other call paths through memcg_check_events()
> +		 * disable irqs, so make sure we are flushing stats atomically.
>  		 */
>  		if (in_task())
> -			mem_cgroup_flush_stats();
> +			mem_cgroup_flush_stats_atomic();
>  		val = memcg_page_state(memcg, NR_FILE_PAGES) +
>  			memcg_page_state(memcg, NR_ANON_MAPPED);
>  		if (swap)
> @@ -4610,7 +4633,11 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
>  	struct mem_cgroup *parent;
>  
> -	mem_cgroup_flush_stats();
> +	/*
> +	 * wb_writeback() takes a spinlock and calls
> +	 * wb_over_bg_thresh()->mem_cgroup_wb_stats(). Do not sleep.
> +	 */
> +	mem_cgroup_flush_stats_atomic();
>  
>  	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>  	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b24b8..a9511ccb936f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
>  	 * Flush the memory cgroup stats, so that we read accurate per-memcg
>  	 * lruvec stats for heuristics.
>  	 */
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_flush_stats_atomic();
>  
>  	/*
>  	 * Determine the scan balance between anon and file LRUs.
> diff --git a/mm/workingset.c b/mm/workingset.c
> index af862c6738c3..dab0c362b9e3 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow)
>  
>  	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
>  
> -	mem_cgroup_flush_stats_ratelimited();
> +	mem_cgroup_flush_stats_atomic_ratelimited();
>  	/*
>  	 * Compare the distance to the existing workingset size. We
>  	 * don't activate pages that couldn't stay resident even if
> -- 
> 2.40.0.348.gf938b09366-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-03-28 22:16 ` [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
@ 2023-03-30  7:39   ` Michal Hocko
  2023-03-30  7:42     ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:39 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:42, Yosry Ahmed wrote:
> In workingset_refault(), we call
> mem_cgroup_flush_stats_atomic_ratelimited() to flush stats within an
> RCU read section and with sleeping disallowed. Move the call above
> the RCU read section and allow sleeping to avoid unnecessarily
> performing a lot of work without sleeping.

Could you say few words why the flushing is done before counters are
updated rather than after (the RCU section)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim
  2023-03-28 22:16 ` [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim Yosry Ahmed
@ 2023-03-30  7:40   ` Michal Hocko
  2023-03-30  7:44     ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:40 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:43, Yosry Ahmed wrote:
> Memory reclaim is a sleepable context. Allow sleeping when flushing
> memcg stats to avoid unnecessarily performing a lot of work without
> sleeping. This can slow down reclaim code if flushing stats is taking
> too long, but there is already multiple cond_resched()'s in reclaim
> code.

Why is this preferred? Memory reclaim is surely a slow path but what is
the advantage of calling mem_cgroup_flush_stats here?

> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a9511ccb936f..9c1c5e8b24b8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
>  	 * Flush the memory cgroup stats, so that we read accurate per-memcg
>  	 * lruvec stats for heuristics.
>  	 */
> -	mem_cgroup_flush_stats_atomic();
> +	mem_cgroup_flush_stats();
>  
>  	/*
>  	 * Determine the scan balance between anon and file LRUs.
> -- 
> 2.40.0.348.gf938b09366-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-03-30  7:39   ` Michal Hocko
@ 2023-03-30  7:42     ` Yosry Ahmed
  2023-03-30  7:50       ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  7:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 28-03-23 22:16:42, Yosry Ahmed wrote:
> > In workingset_refault(), we call
> > mem_cgroup_flush_stats_atomic_ratelimited() to flush stats within an
> > RCU read section and with sleeping disallowed. Move the call above
> > the RCU read section and allow sleeping to avoid unnecessarily
> > performing a lot of work without sleeping.
>
> Could you say few words why the flushing is done before counters are
> updated rather than after (the RCU section)?

It's not about the counters that are updated, it's about the counters
that we read. Stats readers do a flush first to read accurate stats.
We flush before a read, not after an update.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 9/9] memcg: do not modify rstat tree for zero updates
  2023-03-28 22:16 ` [PATCH v2 9/9] memcg: do not modify rstat tree for zero updates Yosry Ahmed
@ 2023-03-30  7:43   ` Michal Hocko
  0 siblings, 0 replies; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:43 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Tue 28-03-23 22:16:44, Yosry Ahmed wrote:
> In some situations, we may end up calling memcg_rstat_updated() with a
> value of 0, which means the stat was not actually updated. An example is
> if we fail to reclaim any pages in shrink_folio_list().
> 
> Do not add the cgroup to the rstat updated tree in this case, to avoid
> unnecessarily flushing it.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 361c0bbf7283..a63ee2efa780 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -618,6 +618,9 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  {
>  	unsigned int x;
>  
> +	if (!val)
> +		return;
> +
>  	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
>  
>  	x = __this_cpu_add_return(stats_updates, abs(val));
> -- 
> 2.40.0.348.gf938b09366-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim
  2023-03-30  7:40   ` Michal Hocko
@ 2023-03-30  7:44     ` Yosry Ahmed
  2023-03-30  7:52       ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  7:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 12:40 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 28-03-23 22:16:43, Yosry Ahmed wrote:
> > Memory reclaim is a sleepable context. Allow sleeping when flushing
> > memcg stats to avoid unnecessarily performing a lot of work without
> > sleeping. This can slow down reclaim code if flushing stats is taking
> > too long, but there is already multiple cond_resched()'s in reclaim
> > code.
>
> Why is this preferred? Memory reclaim is surely a slow path but what is
> the advantage of calling mem_cgroup_flush_stats here?

The purpose of this series is to limit calls to atomic flushing as
much as possible, as flushing can become really expensive on systems
with high cpu counts and a lot of cgroups, and performing such an
expensive operation atomically causes problems -- so we'd rather avoid
doing it atomically where possible.

>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > Acked-by: Shakeel Butt <shakeelb@google.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a9511ccb936f..9c1c5e8b24b8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> >        * Flush the memory cgroup stats, so that we read accurate per-memcg
> >        * lruvec stats for heuristics.
> >        */
> > -     mem_cgroup_flush_stats_atomic();
> > +     mem_cgroup_flush_stats();
> >
> >       /*
> >        * Determine the scan balance between anon and file LRUs.
> > --
> > 2.40.0.348.gf938b09366-goog
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  7:13           ` Yosry Ahmed
@ 2023-03-30  7:49             ` Michal Hocko
  2023-03-30  8:06               ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:49 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: Shakeel Butt, Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li,
	Roman Gushchin, Muchun Song, Andrew Morton, Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu 30-03-23 00:13:16, Yosry Ahmed wrote:
> On Thu, Mar 30, 2023 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > So the real question is. Do we really care so deeply? After all somebody
> > might be calling this from within a spin lock or irq disabled section
> > resulting in a similar situation without noticing.
> 
> There are discussions in [1] about making atomic rstat flush not
> disable irqs throughout the process, so in that case it would only
> result in a similar situation if the caller has irq disabled. The only
> caller that might have irq disabled is the same caller that might be
> in irq context before this series: mem_cgroup_usage().
> 
> On that note, and while I have your attention, I was wondering if we
> can eliminate the flush call completely from mem_cgroup_usage(), and
> read the global stats counters for root memcg instead of the root
> counters. There might be subtle differences, but the root memcg usage
> isn't super accurate now anyway (e.g. it doesn't include kernel
> memory).

root memcg stats are imprecise indeed and I have to admit I do not
really know why we are adding more work for that case. I have tried to
dig into git history for that yesterday but couldn't find a satisfying
answer. It goes all the way down to 2d146aa3aa842 which has done the
switch to rstat. Maybe Johannes remembers.

Anyway, back to this particular patch. I still do not see strong reasons
to be verbose about !in_task case so I would just drop this patch.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-03-30  7:42     ` Yosry Ahmed
@ 2023-03-30  7:50       ` Michal Hocko
  2023-03-30  7:55         ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu 30-03-23 00:42:36, Yosry Ahmed wrote:
> On Thu, Mar 30, 2023 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 28-03-23 22:16:42, Yosry Ahmed wrote:
> > > In workingset_refault(), we call
> > > mem_cgroup_flush_stats_atomic_ratelimited() to flush stats within an
> > > RCU read section and with sleeping disallowed. Move the call above
> > > the RCU read section and allow sleeping to avoid unnecessarily
> > > performing a lot of work without sleeping.
> >
> > Could you say few words why the flushing is done before counters are
> > updated rather than after (the RCU section)?
> 
> It's not about the counters that are updated, it's about the counters
> that we read. Stats readers do a flush first to read accurate stats.
> We flush before a read, not after an update.

Right you are, my bad I have misread the intention here.

Acked-by: Michal Hocko <mhocko@suse.com>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim
  2023-03-30  7:44     ` Yosry Ahmed
@ 2023-03-30  7:52       ` Michal Hocko
  2023-03-30  7:54         ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  7:52 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu 30-03-23 00:44:10, Yosry Ahmed wrote:
> On Thu, Mar 30, 2023 at 12:40 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 28-03-23 22:16:43, Yosry Ahmed wrote:
> > > Memory reclaim is a sleepable context. Allow sleeping when flushing
> > > memcg stats to avoid unnecessarily performing a lot of work without
> > > sleeping. This can slow down reclaim code if flushing stats is taking
> > > too long, but there is already multiple cond_resched()'s in reclaim
> > > code.
> >
> > Why is this preferred? Memory reclaim is surely a slow path but what is
> > the advantage of calling mem_cgroup_flush_stats here?
> 
> The purpose of this series is to limit calls to atomic flushing as
> much as possible, as flushing can become really expensive on systems
> with high cpu counts and a lot of cgroups, and performing such an
> expensive operation atomically causes problems -- so we'd rather avoid
> doing it atomically where possible.

Please add that to the changelog. While the intention might be obvious
now (although cover is not explicit about it either) it can cause some
head scratching in the future when somebody looks at this commit without
a broader context (e.g. previous ML discussions).

with that
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks

> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > Acked-by: Shakeel Butt <shakeelb@google.com>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > ---
> > >  mm/vmscan.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index a9511ccb936f..9c1c5e8b24b8 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> > >        * Flush the memory cgroup stats, so that we read accurate per-memcg
> > >        * lruvec stats for heuristics.
> > >        */
> > > -     mem_cgroup_flush_stats_atomic();
> > > +     mem_cgroup_flush_stats();
> > >
> > >       /*
> > >        * Determine the scan balance between anon and file LRUs.
> > > --
> > > 2.40.0.348.gf938b09366-goog
> >
> > --
> > Michal Hocko
> > SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim
  2023-03-30  7:52       ` Michal Hocko
@ 2023-03-30  7:54         ` Yosry Ahmed
  0 siblings, 0 replies; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  7:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 12:52 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-03-23 00:44:10, Yosry Ahmed wrote:
> > On Thu, Mar 30, 2023 at 12:40 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 28-03-23 22:16:43, Yosry Ahmed wrote:
> > > > Memory reclaim is a sleepable context. Allow sleeping when flushing
> > > > memcg stats to avoid unnecessarily performing a lot of work without
> > > > sleeping. This can slow down reclaim code if flushing stats is taking
> > > > too long, but there is already multiple cond_resched()'s in reclaim
> > > > code.
> > >
> > > Why is this preferred? Memory reclaim is surely a slow path but what is
> > > the advantage of calling mem_cgroup_flush_stats here?
> >
> > The purpose of this series is to limit calls to atomic flushing as
> > much as possible, as flushing can become really expensive on systems
> > with high cpu counts and a lot of cgroups, and performing such an
> > expensive operation atomically causes problems -- so we'd rather avoid
> > doing it atomically where possible.
>
> Please add that to the changelog. While the intention might be obvious
> now (although cover is not explicit about it either) it can cause some
> head scratching in the future when somebody looks at this commit without
> a broader context (e.g. previous ML discussions).
>
> with that
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks, will do for the respin.

> Thanks
>
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > > Acked-by: Shakeel Butt <shakeelb@google.com>
> > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > ---
> > > >  mm/vmscan.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index a9511ccb936f..9c1c5e8b24b8 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> > > >        * Flush the memory cgroup stats, so that we read accurate per-memcg
> > > >        * lruvec stats for heuristics.
> > > >        */
> > > > -     mem_cgroup_flush_stats_atomic();
> > > > +     mem_cgroup_flush_stats();
> > > >
> > > >       /*
> > > >        * Determine the scan balance between anon and file LRUs.
> > > > --
> > > > 2.40.0.348.gf938b09366-goog
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-03-30  7:50       ` Michal Hocko
@ 2023-03-30  7:55         ` Yosry Ahmed
  0 siblings, 0 replies; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  7:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 12:50 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-03-23 00:42:36, Yosry Ahmed wrote:
> > On Thu, Mar 30, 2023 at 12:39 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 28-03-23 22:16:42, Yosry Ahmed wrote:
> > > > In workingset_refault(), we call
> > > > mem_cgroup_flush_stats_atomic_ratelimited() to flush stats within an
> > > > RCU read section and with sleeping disallowed. Move the call above
> > > > the RCU read section and allow sleeping to avoid unnecessarily
> > > > performing a lot of work without sleeping.
> > >
> > > Could you say few words why the flushing is done before counters are
> > > updated rather than after (the RCU section)?
> >
> > It's not about the counters that are updated, it's about the counters
> > that we read. Stats readers do a flush first to read accurate stats.
> > We flush before a read, not after an update.
>
> Right you are, my bad I have misread the intention here.
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  7:49             ` Michal Hocko
@ 2023-03-30  8:06               ` Yosry Ahmed
  2023-03-30  8:14                 ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  8:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 12:49 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-03-23 00:13:16, Yosry Ahmed wrote:
> > On Thu, Mar 30, 2023 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > So the real question is. Do we really care so deeply? After all somebody
> > > might be calling this from within a spin lock or irq disabled section
> > > resulting in a similar situation without noticing.
> >
> > There are discussions in [1] about making atomic rstat flush not
> > disable irqs throughout the process, so in that case it would only
> > result in a similar situation if the caller has irq disabled. The only
> > caller that might have irq disabled is the same caller that might be
> > in irq context before this series: mem_cgroup_usage().
> >
> > On that note, and while I have your attention, I was wondering if we
> > can eliminate the flush call completely from mem_cgroup_usage(), and
> > read the global stats counters for root memcg instead of the root
> > counters. There might be subtle differences, but the root memcg usage
> > isn't super accurate now anyway (e.g. it doesn't include kernel
> > memory).
>
> root memcg stats are imprecise indeed and I have to admit I do not
> really know why we are adding more work for that case. I have tried to
> dig into git history for that yesterday but couldn't find a satisfying
> answer. It goes all the way down to 2d146aa3aa842 which has done the
> switch to rstat. Maybe Johannes remembers.

I think it goes back even before that. Even before rstat, we used to
get the root usage by iterating over the hierarchy. Maybe we didn't
have the global counters at some point or they weren't in line with
the root memcg (e.g. use_hierarchy = 0). It would be great if we can
just use the global counters here. Hopefully Johannes can confirm or
deny this.

>
> Anyway, back to this particular patch. I still do not see strong reasons
> to be verbose about !in_task case so I would just drop this patch.

I will drop this patch in the next iteration. If I implement a patch
that makes rstat flushing not disable irqs all throughout (like [1]),
whether part of this series or not, and we remove flushing from
mem_cgroup_usage(), then at this point:
- No existing flushers will be disabling irqs.
- rstat flushing itself will not be disabling irqs throughout the entire flush.

If we achieve that, do you think it makes sense to add
WARN_ON_ONCE(irqs_disabled()) instead to prevent future users from
flushing while disabling irqs or in irq context?

All I am trying to achieve here is make sure we don't regress. I don't
want to minimize the current atomic flushers now only to have them
increase later.

[1] https://lore.kernel.org/lkml/CAJD7tkZrp=4zWvjE9_010TAG1T_crCbf9P64UzJABspgcrGPKg@mail.gmail.com/

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  8:06               ` Yosry Ahmed
@ 2023-03-30  8:14                 ` Michal Hocko
  2023-03-30  8:19                   ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  8:14 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu 30-03-23 01:06:26, Yosry Ahmed wrote:
[...]
> If we achieve that, do you think it makes sense to add
> WARN_ON_ONCE(irqs_disabled()) instead to prevent future users from
> flushing while disabling irqs or in irq context?

WARN_ON (similar to BUG_ON) will not prevent anybody from doing bad
things. We already have means to shout about sleepable code being
invoked from an atomic context and there is no reason to duplicate that.
As I've said earlier WARN_ON might panic the system in some
configurations (and yes they are used also in production systems - do
not ask me why...). So please be careful about that and use that only
when something really bad (yet recoverable) is going on.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  8:14                 ` Michal Hocko
@ 2023-03-30  8:19                   ` Yosry Ahmed
  2023-03-30  8:39                     ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  8:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 1:15 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-03-23 01:06:26, Yosry Ahmed wrote:
> [...]
> > If we achieve that, do you think it makes sense to add
> > WARN_ON_ONCE(irqs_disabled()) instead to prevent future users from
> > flushing while disabling irqs or in irq context?
>
> WARN_ON (similar to BUG_ON) will not prevent anybody from doing bad
> things. We already have means to shout about sleepable code being
> invoked from an atomic context and there is no reason to duplicate that.
> As I've said earlier WARN_ON might panic the system in some
> configurations (and yes they are used also in production systems - do
> not ask me why...). So please be careful about that and use that only
> when something really bad (yet recoverable) is going on.

Thanks for the information (I was about to ask why about production
systems, but okay..). I will avoid WARN_ON completely. For the
purposes of this series I will drop this patch anyway.

Any idea how to shout about "hey this may take too long, why are you
doing it with irqs disabled?!"?

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  8:19                   ` Yosry Ahmed
@ 2023-03-30  8:39                     ` Michal Hocko
  2023-03-30  8:53                       ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-30  8:39 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu 30-03-23 01:19:29, Yosry Ahmed wrote:
> On Thu, Mar 30, 2023 at 1:15 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 30-03-23 01:06:26, Yosry Ahmed wrote:
> > [...]
> > > If we achieve that, do you think it makes sense to add
> > > WARN_ON_ONCE(irqs_disabled()) instead to prevent future users from
> > > flushing while disabling irqs or in irq context?
> >
> > WARN_ON (similar to BUG_ON) will not prevent anybody from doing bad
> > things. We already have means to shout about sleepable code being
> > invoked from an atomic context and there is no reason to duplicate that.
> > As I've said earlier WARN_ON might panic the system in some
> > configurations (and yes they are used also in production systems - do
> > not ask me why...). So please be careful about that and use that only
> > when something really bad (yet recoverable) is going on.
> 
> Thanks for the information (I was about to ask why about production
> systems, but okay..). I will avoid WARN_ON completely. For the
> purposes of this series I will drop this patch anyway.

Thanks! People do strange things sometimes...

> Any idea how to shout about "hey this may take too long, why are you
> doing it with irqs disabled?!"?

Well we have a hard lockup detector. It hits at a much higher stall by
default but if you care about IRQ latencies in general then you likely
want to lower. Another thing would be IRQ tracing. In any case this code
path shouldn't be any special. Sure it can take long on large systems
but I bet there are more of those.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  8:39                     ` Michal Hocko
@ 2023-03-30  8:53                       ` Yosry Ahmed
  2023-03-31 11:02                         ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-30  8:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu, Mar 30, 2023 at 1:39 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-03-23 01:19:29, Yosry Ahmed wrote:
> > On Thu, Mar 30, 2023 at 1:15 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 30-03-23 01:06:26, Yosry Ahmed wrote:
> > > [...]
> > > > If we achieve that, do you think it makes sense to add
> > > > WARN_ON_ONCE(irqs_disabled()) instead to prevent future users from
> > > > flushing while disabling irqs or in irq context?
> > >
> > > WARN_ON (similar to BUG_ON) will not prevent anybody from doing bad
> > > things. We already have means to shout about sleepable code being
> > > invoked from an atomic context and there is no reason to duplicate that.
> > > As I've said earlier WARN_ON might panic the system in some
> > > configurations (and yes they are used also in production systems - do
> > > not ask me why...). So please be careful about that and use that only
> > > when something really bad (yet recoverable) is going on.
> >
> > Thanks for the information (I was about to ask why about production
> > systems, but okay..). I will avoid WARN_ON completely. For the
> > purposes of this series I will drop this patch anyway.
>
> Thanks! People do strange things sometimes...
>
> > Any idea how to shout about "hey this may take too long, why are you
> > doing it with irqs disabled?!"?
>
> Well we have a hard lockup detector. It hits at a much higher stall by
> default but if you care about IRQ latencies in general then you likely
> want to lower. Another thing would be IRQ tracing. In any case this code
> path shouldn't be any special. Sure it can take long on large systems
> but I bet there are more of those.

We did see hard lockups in extreme cases, and we do have setups with
"nmi_watchdog=panic" that will panic when this happens, so we would
rather catch the code paths that can lead to that before it actually
happens.

Maybe we can add a primitive like might_sleep() for this, just food for thought.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-30  8:53                       ` Yosry Ahmed
@ 2023-03-31 11:02                         ` Michal Hocko
  2023-03-31 19:03                           ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-03-31 11:02 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Thu 30-03-23 01:53:38, Yosry Ahmed wrote:
[...]
> Maybe we can add a primitive like might_sleep() for this, just food for thought.

I do not think it is the correct to abuse might_sleep if the function
itself doesn't sleep. If it does might_sleep is already involved.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-31 11:02                         ` Michal Hocko
@ 2023-03-31 19:03                           ` Yosry Ahmed
  2023-04-03  8:38                             ` Michal Hocko
  0 siblings, 1 reply; 40+ messages in thread
From: Yosry Ahmed @ 2023-03-31 19:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Fri, Mar 31, 2023 at 4:02 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 30-03-23 01:53:38, Yosry Ahmed wrote:
> [...]
> > Maybe we can add a primitive like might_sleep() for this, just food for thought.
>
> I do not think it is the correct to abuse might_sleep if the function
> itself doesn't sleep. If it does might_sleep is already involved.

Oh, sorry if I wasn't clear, I did not mean to reuse might_sleep() --
I meant introducing a new similar debug primitive that shouts if irqs
are disabled.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-03-31 19:03                           ` Yosry Ahmed
@ 2023-04-03  8:38                             ` Michal Hocko
  2023-04-03 20:39                               ` Yosry Ahmed
  0 siblings, 1 reply; 40+ messages in thread
From: Michal Hocko @ 2023-04-03  8:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Fri 31-03-23 12:03:47, Yosry Ahmed wrote:
> On Fri, Mar 31, 2023 at 4:02 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 30-03-23 01:53:38, Yosry Ahmed wrote:
> > [...]
> > > Maybe we can add a primitive like might_sleep() for this, just food for thought.
> >
> > I do not think it is the correct to abuse might_sleep if the function
> > itself doesn't sleep. If it does might_sleep is already involved.
> 
> Oh, sorry if I wasn't clear, I did not mean to reuse might_sleep() --
> I meant introducing a new similar debug primitive that shouts if irqs
> are disabled.

This is circling back to original concerns about arbitrary decision to
care about IRQs. Is this really any different from spin locks or preempt
disabled critical sections preventing any scheduling and potentially
triggereing soft lockups?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
  2023-04-03  8:38                             ` Michal Hocko
@ 2023-04-03 20:39                               ` Yosry Ahmed
  0 siblings, 0 replies; 40+ messages in thread
From: Yosry Ahmed @ 2023-04-03 20:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shakeel Butt, Tejun Heo, Josef Bacik,
	Jens Axboe, Zefan Li, Roman Gushchin, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf

On Mon, Apr 3, 2023 at 1:38 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 31-03-23 12:03:47, Yosry Ahmed wrote:
> > On Fri, Mar 31, 2023 at 4:02 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 30-03-23 01:53:38, Yosry Ahmed wrote:
> > > [...]
> > > > Maybe we can add a primitive like might_sleep() for this, just food for thought.
> > >
> > > I do not think it is the correct to abuse might_sleep if the function
> > > itself doesn't sleep. If it does might_sleep is already involved.
> >
> > Oh, sorry if I wasn't clear, I did not mean to reuse might_sleep() --
> > I meant introducing a new similar debug primitive that shouts if irqs
> > are disabled.
>
> This is circling back to original concerns about arbitrary decision to
> care about IRQs. Is this really any different from spin locks or preempt
> disabled critical sections preventing any scheduling and potentially
> triggereing soft lockups?

Not really, I am sure there are other code paths that may cause
similar problems. At least we can start annotating them so that we
don't regress them (e.g. by introducing a caller that disables irqs --
converting them into hard lockups).

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-04-03 20:40 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-28 22:16 [PATCH v2 0/9] memcg: make rstat flushing irq and sleep Yosry Ahmed
2023-03-28 22:16 ` [PATCH v2 1/9] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
2023-03-28 22:16 ` [PATCH v2 2/9] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" Yosry Ahmed
2023-03-29 11:56   ` Michal Hocko
2023-03-28 22:16 ` [PATCH v2 3/9] memcg: do not flush stats in irq context Yosry Ahmed
2023-03-29 11:58   ` Michal Hocko
2023-03-28 22:16 ` [PATCH v2 4/9] cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context Yosry Ahmed
2023-03-29 11:22   ` Michal Hocko
2023-03-29 18:41     ` Yosry Ahmed
2023-03-29 19:20       ` Shakeel Butt
2023-03-30  7:06         ` Michal Hocko
2023-03-30  7:13           ` Yosry Ahmed
2023-03-30  7:49             ` Michal Hocko
2023-03-30  8:06               ` Yosry Ahmed
2023-03-30  8:14                 ` Michal Hocko
2023-03-30  8:19                   ` Yosry Ahmed
2023-03-30  8:39                     ` Michal Hocko
2023-03-30  8:53                       ` Yosry Ahmed
2023-03-31 11:02                         ` Michal Hocko
2023-03-31 19:03                           ` Yosry Ahmed
2023-04-03  8:38                             ` Michal Hocko
2023-04-03 20:39                               ` Yosry Ahmed
2023-03-28 22:16 ` [PATCH v2 5/9] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
2023-03-28 22:22   ` Shakeel Butt
2023-03-29 15:58   ` Michal Hocko
2023-03-29 18:45     ` Yosry Ahmed
2023-03-28 22:16 ` [PATCH v2 6/9] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
2023-03-30  7:35   ` Michal Hocko
2023-03-28 22:16 ` [PATCH v2 7/9] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
2023-03-30  7:39   ` Michal Hocko
2023-03-30  7:42     ` Yosry Ahmed
2023-03-30  7:50       ` Michal Hocko
2023-03-30  7:55         ` Yosry Ahmed
2023-03-28 22:16 ` [PATCH v2 8/9] vmscan: memcg: sleep when flushing stats during reclaim Yosry Ahmed
2023-03-30  7:40   ` Michal Hocko
2023-03-30  7:44     ` Yosry Ahmed
2023-03-30  7:52       ` Michal Hocko
2023-03-30  7:54         ` Yosry Ahmed
2023-03-28 22:16 ` [PATCH v2 9/9] memcg: do not modify rstat tree for zero updates Yosry Ahmed
2023-03-30  7:43   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).