bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible
@ 2023-03-30 19:17 Yosry Ahmed
  2023-03-30 19:17 ` [PATCH v3 1/8] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:17 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system. The purpose of this series
is to minimize the contexts where we flush stats atomically.

Patches 1 and 2 are cleanups requested during reviews of prior versions
of this series.

Patch 3 makes sure we never try to flush from within an irq context.

Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats()
for atomic and non-atomic flushing, and make sure we only flush the
stats atomically when necessary.

Patch 8 is a slightly tangential optimization that limits the work done
by rstat flushing in some scenarios.

v2 -> v3:
- Collected more Acks (thanks everyone!).
- Dropped controversial patch 4 from v2.
- Improved commit logs and cover letter (Michal).
v2: https://lore.kernel.org/linux-mm/20230328221644.803272-1-yosryahmed@google.com/

v1 -> v2:
- Added more context in patch 4's commit log.
- Added atomic_read() before atomic_xchg() in patch 5 to avoid
  needlessly locking the cache line (Shakeel).
- Refactored patch 6: added a common helper, do_flush_stats(), for
  mem_cgroup_flush_stats{_atomic}() (Johannes).
- Renamed mem_cgroup_flush_stats_ratelimited() to
  mem_cgroup_flush_stats_atomic_ratelimited() in patch 6. It is restored
  in patch 7, but this maintains consistency (Johannes).
- Added line break to keep the lock section visually separated in patch
  7 (Johannes).
v1: https://lore.kernel.org/lkml/20230328061638.203420-1-yosryahmed@google.com/

RFC -> v1:
- Dropped patch 1 that attempted to make the global rstat lock a non-irq
  lock, will follow up on that separetly (Shakeel).
- Dropped stats_flush_lock entirely, replaced by an atomic (Johannes).
- Renamed cgroup_rstat_flush_irqsafe() to cgroup_rstat_flush_atomic()
  instead of removing it (Johannes).
- Added a patch to rename mem_cgroup_flush_stats_delayed() to
  mem_cgroup_flush_stats_ratelimited() (Johannes).
- Separate APIs for flushing memcg stats in atomic and non-atomic
  contexts instead of a boolean argument (Johannes).
- Added patches 3 & 4 to make sure we never flush from irq context
  (Shakeel & Johannes).
RFC: https://lore.kernel.org/lkml/20230323040037.2389095-1-yosryahmed@google.com/

Yosry Ahmed (8):
  cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
  memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited"
  memcg: do not flush stats in irq context
  memcg: replace stats_flush_lock with an atomic
  memcg: sleep during flushing stats in safe contexts
  workingset: memcg: sleep when flushing stats in workingset_refault()
  vmscan: memcg: sleep when flushing stats during reclaim
  memcg: do not modify rstat tree for zero updates

 include/linux/cgroup.h     |  2 +-
 include/linux/memcontrol.h |  9 ++++-
 kernel/cgroup/rstat.c      |  4 +-
 mm/memcontrol.c            | 78 ++++++++++++++++++++++++++++++--------
 mm/workingset.c            |  5 ++-
 5 files changed, 76 insertions(+), 22 deletions(-)

-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/8] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
@ 2023-03-30 19:17 ` Yosry Ahmed
  2023-03-30 19:17 ` [PATCH v3 2/8] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" Yosry Ahmed
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:17 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed

cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/cgroup.h | 2 +-
 kernel/cgroup/rstat.c  | 4 ++--
 mm/memcontrol.c        | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3410aecffdb4..885f5395fcd0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -692,7 +692,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
+void cgroup_rstat_flush_atomic(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(void);
 
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 831f1f472bb8..d3252b0416b6 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -241,12 +241,12 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 }
 
 /**
- * cgroup_rstat_flush_irqsafe - irqsafe version of cgroup_rstat_flush()
+ * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush()
  * @cgrp: target cgroup
  *
  * This function can be called from any context.
  */
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
+void cgroup_rstat_flush_atomic(struct cgroup *cgrp)
 {
 	unsigned long flags;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5abffe6f8389..0205e58ea430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -642,7 +642,7 @@ static void __mem_cgroup_flush_stats(void)
 		return;
 
 	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
-	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
+	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
 	spin_unlock_irqrestore(&stats_flush_lock, flag);
 }
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/8] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited"
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
  2023-03-30 19:17 ` [PATCH v3 1/8] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
@ 2023-03-30 19:17 ` Yosry Ahmed
  2023-03-30 19:17 ` [PATCH v3 3/8] memcg: do not flush stats in irq context Yosry Ahmed
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:17 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed, Michal Hocko

mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work,
but this is actually sometimes flushing directly from the callsite.

What it's doing is ratelimited calls. A better name would be
mem_cgroup_flush_stats_ratelimited().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h | 4 ++--
 mm/memcontrol.c            | 2 +-
 mm/workingset.c            | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..ac3f3b3a45e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1037,7 +1037,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_delayed(void);
+void mem_cgroup_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1535,7 +1535,7 @@ static inline void mem_cgroup_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_delayed(void)
+static inline void mem_cgroup_flush_stats_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0205e58ea430..c3b6aae78901 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -653,7 +653,7 @@ void mem_cgroup_flush_stats(void)
 		__mem_cgroup_flush_stats();
 }
 
-void mem_cgroup_flush_stats_delayed(void)
+void mem_cgroup_flush_stats_ratelimited(void)
 {
 	if (time_after64(jiffies_64, flush_next_time))
 		mem_cgroup_flush_stats();
diff --git a/mm/workingset.c b/mm/workingset.c
index 00c6f4d9d9be..af862c6738c3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	mem_cgroup_flush_stats_delayed();
+	mem_cgroup_flush_stats_ratelimited();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/8] memcg: do not flush stats in irq context
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
  2023-03-30 19:17 ` [PATCH v3 1/8] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
  2023-03-30 19:17 ` [PATCH v3 2/8] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" Yosry Ahmed
@ 2023-03-30 19:17 ` Yosry Ahmed
  2023-03-30 19:17 ` [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:17 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed, Michal Hocko

Currently, the only context in which we can invoke an rstat flush from
irq context is through mem_cgroup_usage() on the root memcg when called
from memcg_check_events(). An rstat flush is an expensive operation that
should not be done in irq context, so do not flush stats and use the
stale stats in this case.

Arguably, usage threshold events are not reliable on the root memcg
anyway since its usage is ill-defined.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3b6aae78901..ff39f78f962e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3669,7 +3669,21 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	unsigned long val;
 
 	if (mem_cgroup_is_root(memcg)) {
-		mem_cgroup_flush_stats();
+		/*
+		 * We can reach here from irq context through:
+		 * uncharge_batch()
+		 * |--memcg_check_events()
+		 *    |--mem_cgroup_threshold()
+		 *       |--__mem_cgroup_threshold()
+		 *          |--mem_cgroup_usage
+		 *
+		 * rstat flushing is an expensive operation that should not be
+		 * done from irq context; use stale stats in this case.
+		 * Arguably, usage threshold events are not reliable on the root
+		 * memcg anyway since its usage is ill-defined.
+		 */
+		if (in_task())
+			mem_cgroup_flush_stats();
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
                   ` (2 preceding siblings ...)
  2023-03-30 19:17 ` [PATCH v3 3/8] memcg: do not flush stats in irq context Yosry Ahmed
@ 2023-03-30 19:17 ` Yosry Ahmed
  2023-04-04 16:52   ` Michal Koutný
  2023-03-30 19:17 ` [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:17 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed, Michal Hocko

As Johannes notes in [1], stats_flush_lock is currently used to:
(a) Protect updated to stats_flush_threshold.
(b) Protect updates to flush_next_time.
(c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.

However:

1. stats_flush_threshold is already an atomic

2. flush_next_time is not atomic. The writer is locked, but the reader
   is lockless. If the reader races with a flush, you could see this:

                                        if (time_after(jiffies, flush_next_time))
        spin_trylock()
        flush_next_time = now + delay
        flush()
        spin_unlock()
                                        spin_trylock()
                                        flush_next_time = now + delay
                                        flush()
                                        spin_unlock()

   which means we already can get flushes at a higher frequency than
   FLUSH_TIME during races. But it isn't really a problem.

   The reader could also see garbled partial updates if the compiler
   decides to split the write, so it needs at least READ_ONCE and
   WRITE_ONCE protection.

3. Serializing cgroup_rstat_flush() calls against the ratelimit
   factors is currently broken because of the race in 2. But the race
   is actually harmless, all we might get is the occasional earlier
   flush. If there is no delta, the flush won't do much. And if there
   is, the flush is justified.

So the lock can be removed all together. However, the lock also served
the purpose of preventing a thundering herd problem for concurrent
flushers, see [2]. Use an atomic instead to serve the purpose of
unifying concurrent flushers.

[1]https://lore.kernel.org/lkml/20230323172732.GE739026@cmpxchg.org/
[2]https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff39f78f962e..65750f8b8259 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
  */
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_SPINLOCK(stats_flush_lock);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
+static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 static u64 flush_next_time;
 
@@ -636,15 +636,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 
 static void __mem_cgroup_flush_stats(void)
 {
-	unsigned long flag;
-
-	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
+	/*
+	 * We always flush the entire tree, so concurrent flushers can just
+	 * skip. This avoids a thundering herd problem on the rstat global lock
+	 * from memcg flushers (e.g. reclaim, refault, etc).
+	 */
+	if (atomic_read(&stats_flush_ongoing) ||
+	    atomic_xchg(&stats_flush_ongoing, 1))
 		return;
 
-	flush_next_time = jiffies_64 + 2*FLUSH_TIME;
+	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
-	spin_unlock_irqrestore(&stats_flush_lock, flag);
+	atomic_set(&stats_flush_ongoing, 0);
 }
 
 void mem_cgroup_flush_stats(void)
@@ -655,7 +659,7 @@ void mem_cgroup_flush_stats(void)
 
 void mem_cgroup_flush_stats_ratelimited(void)
 {
-	if (time_after64(jiffies_64, flush_next_time))
+	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
 		mem_cgroup_flush_stats();
 }
 
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
                   ` (3 preceding siblings ...)
  2023-03-30 19:17 ` [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
@ 2023-03-30 19:17 ` Yosry Ahmed
  2023-03-30 19:30   ` Johannes Weiner
  2023-04-04 15:09   ` Johannes Weiner
  2023-03-30 19:17 ` [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:17 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed, Michal Hocko

Currently, all contexts that flush memcg stats do so with sleeping not
allowed. Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.
Flushing is an expensive operation that scales with the number of cpus
and the number of cgroups in the system, so avoid doing it atomically
where possible.

Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
sleepable), and provide a separate atomic version. The atomic version is
used in reclaim, refault, writeback, and in mem_cgroup_usage(). All
other code paths are left to use the non-atomic version. This includes
callbacks for userspace reads and the periodic flusher.

Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and
refault code paths are modified to do non-atomic flushing in separate
later patches -- so it will eventually be changed back to
mem_cgroup_flush_stats_ratelimited().

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  9 ++++++--
 mm/memcontrol.c            | 45 ++++++++++++++++++++++++++++++--------
 mm/vmscan.c                |  2 +-
 mm/workingset.c            |  2 +-
 4 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ac3f3b3a45e2..b424ba3ebd09 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1037,7 +1037,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_ratelimited(void);
+void mem_cgroup_flush_stats_atomic(void);
+void mem_cgroup_flush_stats_atomic_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1535,7 +1536,11 @@ static inline void mem_cgroup_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_ratelimited(void)
+static inline void mem_cgroup_flush_stats_atomic(void)
+{
+}
+
+static inline void mem_cgroup_flush_stats_atomic_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 65750f8b8259..a2ce3aa10d94 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -634,7 +634,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void __mem_cgroup_flush_stats(void)
+static void do_flush_stats(bool atomic)
 {
 	/*
 	 * We always flush the entire tree, so concurrent flushers can just
@@ -646,26 +646,46 @@ static void __mem_cgroup_flush_stats(void)
 		return;
 
 	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
-	cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
+
+	if (atomic)
+		cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
+	else
+		cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+
 	atomic_set(&stats_flush_threshold, 0);
 	atomic_set(&stats_flush_ongoing, 0);
 }
 
+static bool should_flush_stats(void)
+{
+	return atomic_read(&stats_flush_threshold) > num_online_cpus();
+}
+
 void mem_cgroup_flush_stats(void)
 {
-	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
-		__mem_cgroup_flush_stats();
+	if (should_flush_stats())
+		do_flush_stats(false);
 }
 
-void mem_cgroup_flush_stats_ratelimited(void)
+void mem_cgroup_flush_stats_atomic(void)
+{
+	if (should_flush_stats())
+		do_flush_stats(true);
+}
+
+void mem_cgroup_flush_stats_atomic_ratelimited(void)
 {
 	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
-		mem_cgroup_flush_stats();
+		mem_cgroup_flush_stats_atomic();
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
 {
-	__mem_cgroup_flush_stats();
+	/*
+	 * Always flush here so that flushing in latency-sensitive paths is
+	 * as cheap as possible.
+	 */
+	do_flush_stats(false);
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -3685,9 +3705,12 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 		 * done from irq context; use stale stats in this case.
 		 * Arguably, usage threshold events are not reliable on the root
 		 * memcg anyway since its usage is ill-defined.
+		 *
+		 * Additionally, other call paths through memcg_check_events()
+		 * disable irqs, so make sure we are flushing stats atomically.
 		 */
 		if (in_task())
-			mem_cgroup_flush_stats();
+			mem_cgroup_flush_stats_atomic();
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
@@ -4610,7 +4633,11 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_flush_stats();
+	/*
+	 * wb_writeback() takes a spinlock and calls
+	 * wb_over_bg_thresh()->mem_cgroup_wb_stats(). Do not sleep.
+	 */
+	mem_cgroup_flush_stats_atomic();
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..a9511ccb936f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_flush_stats_atomic();
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index af862c6738c3..dab0c362b9e3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
-	mem_cgroup_flush_stats_ratelimited();
+	mem_cgroup_flush_stats_atomic_ratelimited();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
                   ` (4 preceding siblings ...)
  2023-03-30 19:17 ` [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
@ 2023-03-30 19:17 ` Yosry Ahmed
  2023-04-04 16:53   ` Michal Koutný
  2023-03-30 19:18 ` [PATCH v3 7/8] vmscan: memcg: sleep when flushing stats during reclaim Yosry Ahmed
  2023-03-30 19:18 ` [PATCH v3 8/8] memcg: do not modify rstat tree for zero updates Yosry Ahmed
  7 siblings, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:17 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed, Michal Hocko

In workingset_refault(), we call
mem_cgroup_flush_stats_atomic_ratelimited() to read accurate stats
within an RCU read section and with sleeping disallowed. Move the
call above the RCU read section to make it non-atomic.

Flushing is an expensive operation that scales with the number of cpus
and the number of cgroups in the system, so avoid doing it atomically
where possible.

Since workingset_refault() is the only caller of
mem_cgroup_flush_stats_atomic_ratelimited(), just make it non-atomic,
and rename it to mem_cgroup_flush_stats_ratelimited().

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h | 4 ++--
 mm/memcontrol.c            | 4 ++--
 mm/workingset.c            | 5 +++--
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b424ba3ebd09..a4bc3910a2eb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1038,7 +1038,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 
 void mem_cgroup_flush_stats(void);
 void mem_cgroup_flush_stats_atomic(void);
-void mem_cgroup_flush_stats_atomic_ratelimited(void);
+void mem_cgroup_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1540,7 +1540,7 @@ static inline void mem_cgroup_flush_stats_atomic(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_atomic_ratelimited(void)
+static inline void mem_cgroup_flush_stats_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2ce3aa10d94..361c0bbf7283 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -673,10 +673,10 @@ void mem_cgroup_flush_stats_atomic(void)
 		do_flush_stats(true);
 }
 
-void mem_cgroup_flush_stats_atomic_ratelimited(void)
+void mem_cgroup_flush_stats_ratelimited(void)
 {
 	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
-		mem_cgroup_flush_stats_atomic();
+		mem_cgroup_flush_stats();
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
diff --git a/mm/workingset.c b/mm/workingset.c
index dab0c362b9e3..3025beee9b34 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -406,6 +406,9 @@ void workingset_refault(struct folio *folio, void *shadow)
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
 	eviction <<= bucket_order;
 
+	/* Flush stats (and potentially sleep) before holding RCU read lock */
+	mem_cgroup_flush_stats_ratelimited();
+
 	rcu_read_lock();
 	/*
 	 * Look up the memcg associated with the stored ID. It might
@@ -461,8 +464,6 @@ void workingset_refault(struct folio *folio, void *shadow)
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
-
-	mem_cgroup_flush_stats_atomic_ratelimited();
 	/*
 	 * Compare the distance to the existing workingset size. We
 	 * don't activate pages that couldn't stay resident even if
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 7/8] vmscan: memcg: sleep when flushing stats during reclaim
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
                   ` (5 preceding siblings ...)
  2023-03-30 19:17 ` [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
@ 2023-03-30 19:18 ` Yosry Ahmed
  2023-03-30 19:18 ` [PATCH v3 8/8] memcg: do not modify rstat tree for zero updates Yosry Ahmed
  7 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:18 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed, Michal Hocko

Memory reclaim is a sleepable context. Flushing is an expensive
operaiton that scales with the number of cpus and the number of cgroups
in the system, so avoid doing it atomically unnecessarily.
This can slow down reclaim code if flushing stats is taking too long,
but there is already multiple cond_resched()'s in reclaim code.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9511ccb936f..9c1c5e8b24b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats_atomic();
+	mem_cgroup_flush_stats();
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 8/8] memcg: do not modify rstat tree for zero updates
  2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
                   ` (6 preceding siblings ...)
  2023-03-30 19:18 ` [PATCH v3 7/8] vmscan: memcg: sleep when flushing stats during reclaim Yosry Ahmed
@ 2023-03-30 19:18 ` Yosry Ahmed
  2023-04-04 16:53   ` Michal Koutný
  7 siblings, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2023-03-30 19:18 UTC (permalink / raw)
  To: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Michal Koutný
  Cc: Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Yosry Ahmed, Michal Hocko

In some situations, we may end up calling memcg_rstat_updated() with a
value of 0, which means the stat was not actually updated. An example is
if we fail to reclaim any pages in shrink_folio_list().

Do not add the cgroup to the rstat updated tree in this case, to avoid
unnecessarily flushing it.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 361c0bbf7283..a63ee2efa780 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -618,6 +618,9 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 {
 	unsigned int x;
 
+	if (!val)
+		return;
+
 	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
 
 	x = __this_cpu_add_return(stats_updates, abs(val));
-- 
2.40.0.348.gf938b09366-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts
  2023-03-30 19:17 ` [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
@ 2023-03-30 19:30   ` Johannes Weiner
  2023-04-04 15:09   ` Johannes Weiner
  1 sibling, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2023-03-30 19:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Michal Hocko

On Thu, Mar 30, 2023 at 07:17:58PM +0000, Yosry Ahmed wrote:
> Currently, all contexts that flush memcg stats do so with sleeping not
> allowed. Some of these contexts are perfectly safe to sleep in, such as
> reading cgroup files from userspace or the background periodic flusher.
> Flushing is an expensive operation that scales with the number of cpus
> and the number of cgroups in the system, so avoid doing it atomically
> where possible.
> 
> Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
> sleepable), and provide a separate atomic version. The atomic version is
> used in reclaim, refault, writeback, and in mem_cgroup_usage(). All
> other code paths are left to use the non-atomic version. This includes
> callbacks for userspace reads and the periodic flusher.
> 
> Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
> change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and
> refault code paths are modified to do non-atomic flushing in separate
> later patches -- so it will eventually be changed back to
> mem_cgroup_flush_stats_ratelimited().
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts
  2023-03-30 19:17 ` [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
  2023-03-30 19:30   ` Johannes Weiner
@ 2023-04-04 15:09   ` Johannes Weiner
  1 sibling, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2023-04-04 15:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Michal Koutný,
	Vasily Averin, cgroups, linux-block, linux-kernel, linux-mm, bpf,
	Michal Hocko

On Thu, Mar 30, 2023 at 07:17:58PM +0000, Yosry Ahmed wrote:
> Currently, all contexts that flush memcg stats do so with sleeping not
> allowed. Some of these contexts are perfectly safe to sleep in, such as
> reading cgroup files from userspace or the background periodic flusher.
> Flushing is an expensive operation that scales with the number of cpus
> and the number of cgroups in the system, so avoid doing it atomically
> where possible.
> 
> Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
> sleepable), and provide a separate atomic version. The atomic version is
> used in reclaim, refault, writeback, and in mem_cgroup_usage(). All
> other code paths are left to use the non-atomic version. This includes
> callbacks for userspace reads and the periodic flusher.
> 
> Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
> change it to mem_cgroup_flush_stats_atomic_ratelimited(). Reclaim and
> refault code paths are modified to do non-atomic flushing in separate
> later patches -- so it will eventually be changed back to
> mem_cgroup_flush_stats_ratelimited().
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Shakeel Butt <shakeelb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic
  2023-03-30 19:17 ` [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
@ 2023-04-04 16:52   ` Michal Koutný
  2023-04-04 17:13     ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Koutný @ 2023-04-04 16:52 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

Hello.

On Thu, Mar 30, 2023 at 07:17:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
>  static void __mem_cgroup_flush_stats(void)
>  {
> -	unsigned long flag;
> -
> -	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> +	/*
> +	 * We always flush the entire tree, so concurrent flushers can just
> +	 * skip. This avoids a thundering herd problem on the rstat global lock
> +	 * from memcg flushers (e.g. reclaim, refault, etc).
> +	 */
> +	if (atomic_read(&stats_flush_ongoing) ||
> +	    atomic_xchg(&stats_flush_ongoing, 1))
>  		return;

I'm curious about why this instead of

	if (atomic_xchg(&stats_flush_ongoing, 1))
		return;

Is that some microarchitectural cleverness?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-03-30 19:17 ` [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
@ 2023-04-04 16:53   ` Michal Koutný
  2023-04-04 18:09     ` Yosry Ahmed
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Koutný @ 2023-04-04 16:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 906 bytes --]

On Thu, Mar 30, 2023 at 07:17:59PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> In workingset_refault(), we call
> mem_cgroup_flush_stats_atomic_ratelimited() to read accurate stats
> within an RCU read section and with sleeping disallowed. Move the
> call above the RCU read section to make it non-atomic.
> 
> Flushing is an expensive operation that scales with the number of cpus
> and the number of cgroups in the system, so avoid doing it atomically
> where possible.

I understand why one does not process the whole flushing load in one go
in general.
However, I remember there were reports of workingset_refault() being
sensitive to latencies (hence the ratelimited call was created).

Is there any consideration on impact of this here?
(Or are there other cond_resched() precendents on this path? Should it
be mentioned like in the vmscan (7/8) commit?)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 8/8] memcg: do not modify rstat tree for zero updates
  2023-03-30 19:18 ` [PATCH v3 8/8] memcg: do not modify rstat tree for zero updates Yosry Ahmed
@ 2023-04-04 16:53   ` Michal Koutný
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Koutný @ 2023-04-04 16:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 199 bytes --]

On Thu, Mar 30, 2023 at 07:18:01PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
>  mm/memcontrol.c | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic
  2023-04-04 16:52   ` Michal Koutný
@ 2023-04-04 17:13     ` Shakeel Butt
  2023-04-04 17:21       ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2023-04-04 17:13 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Yosry Ahmed, Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

On Tue, Apr 4, 2023 at 9:53 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Thu, Mar 30, 2023 at 07:17:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> >  static void __mem_cgroup_flush_stats(void)
> >  {
> > -     unsigned long flag;
> > -
> > -     if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> > +     /*
> > +      * We always flush the entire tree, so concurrent flushers can just
> > +      * skip. This avoids a thundering herd problem on the rstat global lock
> > +      * from memcg flushers (e.g. reclaim, refault, etc).
> > +      */
> > +     if (atomic_read(&stats_flush_ongoing) ||
> > +         atomic_xchg(&stats_flush_ongoing, 1))
> >               return;
>
> I'm curious about why this instead of
>
>         if (atomic_xchg(&stats_flush_ongoing, 1))
>                 return;
>
> Is that some microarchitectural cleverness?
>

Yes indeed it is. Basically we want to avoid unconditional cache
dirtying. This pattern is also used at other places in the kernel like
qspinlock.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic
  2023-04-04 17:13     ` Shakeel Butt
@ 2023-04-04 17:21       ` Shakeel Butt
  2023-04-04 17:32         ` Michal Koutný
  0 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2023-04-04 17:21 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Yosry Ahmed, Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

On Tue, Apr 4, 2023 at 10:13 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Tue, Apr 4, 2023 at 9:53 AM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > Hello.
> >
> > On Thu, Mar 30, 2023 at 07:17:57PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > >  static void __mem_cgroup_flush_stats(void)
> > >  {
> > > -     unsigned long flag;
> > > -
> > > -     if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> > > +     /*
> > > +      * We always flush the entire tree, so concurrent flushers can just
> > > +      * skip. This avoids a thundering herd problem on the rstat global lock
> > > +      * from memcg flushers (e.g. reclaim, refault, etc).
> > > +      */
> > > +     if (atomic_read(&stats_flush_ongoing) ||
> > > +         atomic_xchg(&stats_flush_ongoing, 1))
> > >               return;
> >
> > I'm curious about why this instead of
> >
> >         if (atomic_xchg(&stats_flush_ongoing, 1))
> >                 return;
> >
> > Is that some microarchitectural cleverness?
> >
>
> Yes indeed it is. Basically we want to avoid unconditional cache
> dirtying. This pattern is also used at other places in the kernel like
> qspinlock.

Oh also take a look at
https://lore.kernel.org/all/20230404052228.15788-1-feng.tang@intel.com/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic
  2023-04-04 17:21       ` Shakeel Butt
@ 2023-04-04 17:32         ` Michal Koutný
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Koutný @ 2023-04-04 17:32 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yosry Ahmed, Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

On Tue, Apr 04, 2023 at 10:21:33AM -0700, Shakeel Butt <shakeelb@google.com> wrote:
> > Yes indeed it is. Basically we want to avoid unconditional cache
> > dirtying. This pattern is also used at other places in the kernel like
> > qspinlock.

Thanks for confirmation.

(I remembered the commit 873f64b791a2 ("mm/memcontrol.c: remove the
redundant updating of stats_flush_threshold"). But was slightly confused
why would it be open-coded every time.)

> Oh also take a look at
> https://lore.kernel.org/all/20230404052228.15788-1-feng.tang@intel.com/

Thanks for the link.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-04-04 16:53   ` Michal Koutný
@ 2023-04-04 18:09     ` Yosry Ahmed
  2023-04-05  8:00       ` Michal Koutný
  0 siblings, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-04 18:09 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

On Tue, Apr 4, 2023 at 9:53 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Thu, Mar 30, 2023 at 07:17:59PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > In workingset_refault(), we call
> > mem_cgroup_flush_stats_atomic_ratelimited() to read accurate stats
> > within an RCU read section and with sleeping disallowed. Move the
> > call above the RCU read section to make it non-atomic.
> >
> > Flushing is an expensive operation that scales with the number of cpus
> > and the number of cgroups in the system, so avoid doing it atomically
> > where possible.
>
> I understand why one does not process the whole flushing load in one go
> in general.
> However, I remember there were reports of workingset_refault() being
> sensitive to latencies (hence the ratelimited call was created).
>
> Is there any consideration on impact of this here?
> (Or are there other cond_resched() precendents on this path? Should it
> be mentioned like in the vmscan (7/8) commit?)

IIUC there are multiple places where we can sleep in this path, we can
sleep waiting for a page to be read from disk, we can sleep during
allocating the page to read into, and IIUC the allocations on the
fault path can even run into reclaim, going into the vmscan code. So
there are precedents, but I am not sure if that's enough argument.

I did some light performance testing and I did not notice any
regressions (i.e concurrent processes faulting memory with a lot of
cgroups/cpus), but this change is done intentionally in a separate
patch so that it's easy to revert if regressions are reported.

>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-04-04 18:09     ` Yosry Ahmed
@ 2023-04-05  8:00       ` Michal Koutný
  2023-04-05  8:02         ` Yosry Ahmed
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Koutný @ 2023-04-05  8:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 567 bytes --]

On Tue, Apr 04, 2023 at 11:09:02AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> IIUC there are multiple places where we can sleep in this path, we can
> sleep waiting for a page to be read from disk, we can sleep during
> allocating the page to read into, and IIUC the allocations on the
> fault path can even run into reclaim, going into the vmscan code. So
> there are precedents, but I am not sure if that's enough argument.

I expect it'd depend on the proportion of the slow/fast paths.
OK, let's see how it turns out in wider population.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault()
  2023-04-05  8:00       ` Michal Koutný
@ 2023-04-05  8:02         ` Yosry Ahmed
  0 siblings, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2023-04-05  8:02 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Josef Bacik, Jens Axboe, Zefan Li, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Vasily Averin, cgroups, linux-block, linux-kernel,
	linux-mm, bpf, Michal Hocko

On Wed, Apr 5, 2023 at 1:00 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Apr 04, 2023 at 11:09:02AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > IIUC there are multiple places where we can sleep in this path, we can
> > sleep waiting for a page to be read from disk, we can sleep during
> > allocating the page to read into, and IIUC the allocations on the
> > fault path can even run into reclaim, going into the vmscan code. So
> > there are precedents, but I am not sure if that's enough argument.
>
> I expect it'd depend on the proportion of the slow/fast paths.
> OK, let's see how it turns out in wider population.

Agreed. It also depends on the number of memcgs and how much the
periodic flusher is keeping up. I think no amount of testing will
cover all or even most workloads.

>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-04-05  8:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-30 19:17 [PATCH v3 0/8] memcg: avoid flushing stats atomically where possible Yosry Ahmed
2023-03-30 19:17 ` [PATCH v3 1/8] cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic" Yosry Ahmed
2023-03-30 19:17 ` [PATCH v3 2/8] memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited" Yosry Ahmed
2023-03-30 19:17 ` [PATCH v3 3/8] memcg: do not flush stats in irq context Yosry Ahmed
2023-03-30 19:17 ` [PATCH v3 4/8] memcg: replace stats_flush_lock with an atomic Yosry Ahmed
2023-04-04 16:52   ` Michal Koutný
2023-04-04 17:13     ` Shakeel Butt
2023-04-04 17:21       ` Shakeel Butt
2023-04-04 17:32         ` Michal Koutný
2023-03-30 19:17 ` [PATCH v3 5/8] memcg: sleep during flushing stats in safe contexts Yosry Ahmed
2023-03-30 19:30   ` Johannes Weiner
2023-04-04 15:09   ` Johannes Weiner
2023-03-30 19:17 ` [PATCH v3 6/8] workingset: memcg: sleep when flushing stats in workingset_refault() Yosry Ahmed
2023-04-04 16:53   ` Michal Koutný
2023-04-04 18:09     ` Yosry Ahmed
2023-04-05  8:00       ` Michal Koutný
2023-04-05  8:02         ` Yosry Ahmed
2023-03-30 19:18 ` [PATCH v3 7/8] vmscan: memcg: sleep when flushing stats during reclaim Yosry Ahmed
2023-03-30 19:18 ` [PATCH v3 8/8] memcg: do not modify rstat tree for zero updates Yosry Ahmed
2023-04-04 16:53   ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).