All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Revert "memcg: cleanup racy sum avoidance code"
@ 2022-08-17 17:21 ` Shakeel Butt
  0 siblings, 0 replies; 3+ messages in thread
From: Shakeel Butt @ 2022-08-17 17:21 UTC (permalink / raw)
  To: Michal Koutný,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	David Hildenbrand, Yosry Ahmed, Greg Thelen
  Cc: Andrew Morton, cgroups, linux-mm, linux-kernel, Shakeel Butt, stable

This reverts commit 96e51ccf1af33e82f429a0d6baebba29c6448d0f.

Recently we started running the kernel with rstat infrastructure on
production traffic and begin to see negative memcg stats values.
Particularly the 'sock' stat is the one which we observed having
negative value.

$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 18446744073708724224

Re-run after couple of seconds

$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 53248

For now we are only seeing this issue on large machines (256 CPUs) and
only with 'sock' stat. I think the networking stack increase the stat on
one cpu and decrease it on another cpu much more often. So, this
negative sock is due to rstat flusher flushing the stats on the CPU that
has seen the decrement of sock but missed the CPU that has increments. A
typical race condition.

For easy stable backport, revert is the most simple solution. For long
term solution, I am thinking of two directions. First is just reduce the
race window by optimizing the rstat flusher. Second is if the reader
sees a negative stat value, force flush and restart the stat collection.
Basically retry but limited.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: stable@vger.kernel.org # 5.15
---
 include/linux/memcontrol.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..6257867fbf95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -987,19 +987,30 @@ static inline void mod_memcg_page_state(struct page *page,
 
 static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 {
-	return READ_ONCE(memcg->vmstats.state[idx]);
+	long x = READ_ONCE(memcg->vmstats.state[idx]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
 }
 
 static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
 					      enum node_stat_item idx)
 {
 	struct mem_cgroup_per_node *pn;
+	long x;
 
 	if (mem_cgroup_disabled())
 		return node_page_state(lruvec_pgdat(lruvec), idx);
 
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	return READ_ONCE(pn->lruvec_stats.state[idx]);
+	x = READ_ONCE(pn->lruvec_stats.state[idx]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
 }
 
 static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH] Revert "memcg: cleanup racy sum avoidance code"
@ 2022-08-17 17:21 ` Shakeel Butt
  0 siblings, 0 replies; 3+ messages in thread
From: Shakeel Butt @ 2022-08-17 17:21 UTC (permalink / raw)
  To: Michal Koutný,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	David Hildenbrand, Yosry Ahmed, Greg Thelen
  Cc: Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt,
	stable-u79uwXL29TY76Z2rM5mHXA

This reverts commit 96e51ccf1af33e82f429a0d6baebba29c6448d0f.

Recently we started running the kernel with rstat infrastructure on
production traffic and begin to see negative memcg stats values.
Particularly the 'sock' stat is the one which we observed having
negative value.

$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 18446744073708724224

Re-run after couple of seconds

$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 53248

For now we are only seeing this issue on large machines (256 CPUs) and
only with 'sock' stat. I think the networking stack increase the stat on
one cpu and decrease it on another cpu much more often. So, this
negative sock is due to rstat flusher flushing the stats on the CPU that
has seen the decrement of sock but missed the CPU that has increments. A
typical race condition.

For easy stable backport, revert is the most simple solution. For long
term solution, I am thinking of two directions. First is just reduce the
race window by optimizing the rstat flusher. Second is if the reader
sees a negative stat value, force flush and restart the stat collection.
Basically retry but limited.

Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # 5.15
---
 include/linux/memcontrol.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..6257867fbf95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -987,19 +987,30 @@ static inline void mod_memcg_page_state(struct page *page,
 
 static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 {
-	return READ_ONCE(memcg->vmstats.state[idx]);
+	long x = READ_ONCE(memcg->vmstats.state[idx]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
 }
 
 static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
 					      enum node_stat_item idx)
 {
 	struct mem_cgroup_per_node *pn;
+	long x;
 
 	if (mem_cgroup_disabled())
 		return node_page_state(lruvec_pgdat(lruvec), idx);
 
 	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-	return READ_ONCE(pn->lruvec_stats.state[idx]);
+	x = READ_ONCE(pn->lruvec_stats.state[idx]);
+#ifdef CONFIG_SMP
+	if (x < 0)
+		x = 0;
+#endif
+	return x;
 }
 
 static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] Revert "memcg: cleanup racy sum avoidance code"
  2022-08-17 17:21 ` Shakeel Butt
  (?)
@ 2022-08-18 10:04 ` Michal Koutný
  -1 siblings, 0 replies; 3+ messages in thread
From: Michal Koutný @ 2022-08-18 10:04 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	David Hildenbrand, Yosry Ahmed, Greg Thelen, Andrew Morton,
	cgroups, linux-mm, linux-kernel, stable

On Wed, Aug 17, 2022 at 05:21:39PM +0000, Shakeel Butt <shakeelb@google.com> wrote:
> $ grep "sock " /mnt/memory/job/memory.stat
> sock 253952
> total_sock 18446744073708724224
> 
> Re-run after couple of seconds
> 
> $ grep "sock " /mnt/memory/job/memory.stat
> sock 253952
> total_sock 53248
> 
> For now we are only seeing this issue on large machines (256 CPUs) and
> only with 'sock' stat. I think the networking stack increase the stat on
> one cpu and decrease it on another cpu much more often. So, this
> negative sock is due to rstat flusher flushing the stats on the CPU that
> has seen the decrement of sock but missed the CPU that has increments. A
> typical race condition.

This theory adds up :-) (Provided the numbers.)

> For easy stable backport, revert is the most simple solution.

Sounds reasonable.

> For long term solution, I am thinking of two directions. First is just
> reduce the race window by optimizing the rstat flusher. Second is if
> the reader sees a negative stat value, force flush and restart the
> stat collection.  Basically retry but limited.

Or just stick with the revert since it already reduces the observed
error by rounding to zero in simple way.

(Or if the imprecision was worth extra storage, use two-stage flushing
to accumulate (cpus x cgroups) and assign in two steps.)

Thanks,
Michal

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-08-18 10:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17 17:21 [PATCH] Revert "memcg: cleanup racy sum avoidance code" Shakeel Butt
2022-08-17 17:21 ` Shakeel Butt
2022-08-18 10:04 ` Michal Koutný

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.