All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
@ 2023-09-22 17:57 Yosry Ahmed
  2023-09-22 17:57   ` Yosry Ahmed
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-09-22 17:57 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	linux-mm, cgroups, linux-kernel, Yosry Ahmed

While working on adjacent code [1], I realized that the values passed
into memcg_rstat_updated() to keep track of the magnitude of pending
updates is consistent. It is mostly in pages, but sometimes it can be in
bytes or KBs. Fix that.

Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
2 to check and normalize the units of state updates.

[1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/

v1 -> v2:
- Rebased on top of mm-unstable.

Yosry Ahmed (2):
  mm: memcg: refactor page state unit helpers
  mm: memcg: normalize the value passed into memcg_rstat_updated()

 mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 51 insertions(+), 13 deletions(-)

-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
@ 2023-09-22 17:57   ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-09-22 17:57 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	linux-mm, cgroups, linux-kernel, Yosry Ahmed

memcg_page_state_unit() is currently used to identify the unit of a
memcg state item so that all stats in memory.stat are in bytes. However,
it lies about the units of WORKINGSET_* stats. These stats actually
represent pages, but we present them to userspace as a scalar number of
events. In retrospect, maybe those stats should have been memcg "events"
rather than memcg "state".

In preparation for using memcg_page_state_unit() for other purposes that
need to know the truthful units of different stat items, break it down
into two helpers:
- memcg_page_state_unit() retuns the actual unit of the item.
- memcg_page_state_output_unit() returns the unit used for output.

Use the latter instead of the former in memcg_page_state_output() and
lruvec_page_state_output(). While we are at it, let's show cgroup v1
some love and add memcg_page_state_local_output() for consistency.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 44 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 927c64d3cbcb..308cc7353ef0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1535,7 +1535,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "workingset_nodereclaim",	WORKINGSET_NODERECLAIM		},
 };
 
-/* Translate stat items to the correct unit for memory.stat output */
+/* The actual unit of the state item, not the same as the output unit */
 static int memcg_page_state_unit(int item)
 {
 	switch (item) {
@@ -1543,6 +1543,22 @@ static int memcg_page_state_unit(int item)
 	case MEMCG_ZSWAP_B:
 	case NR_SLAB_RECLAIMABLE_B:
 	case NR_SLAB_UNRECLAIMABLE_B:
+		return 1;
+	case NR_KERNEL_STACK_KB:
+		return SZ_1K;
+	default:
+		return PAGE_SIZE;
+	}
+}
+
+/* Translate stat items to the correct unit for memory.stat output */
+static int memcg_page_state_output_unit(int item)
+{
+	/*
+	 * Workingset state is actually in pages, but we export it to userspace
+	 * as a scalar count of events, so special case it here.
+	 */
+	switch (item) {
 	case WORKINGSET_REFAULT_ANON:
 	case WORKINGSET_REFAULT_FILE:
 	case WORKINGSET_ACTIVATE_ANON:
@@ -1551,17 +1567,23 @@ static int memcg_page_state_unit(int item)
 	case WORKINGSET_RESTORE_FILE:
 	case WORKINGSET_NODERECLAIM:
 		return 1;
-	case NR_KERNEL_STACK_KB:
-		return SZ_1K;
 	default:
-		return PAGE_SIZE;
+		return memcg_page_state_unit(item);
 	}
 }
 
 static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
 						    int item)
 {
-	return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
+	return memcg_page_state(memcg, item) *
+		memcg_page_state_output_unit(item);
+}
+
+static inline unsigned long memcg_page_state_local_output(
+		struct mem_cgroup *memcg, int item)
+{
+	return memcg_page_state_local(memcg, item) *
+		memcg_page_state_output_unit(item);
 }
 
 static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
@@ -4106,9 +4128,8 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
 
-		nr = memcg_page_state_local(memcg, memcg1_stats[i]);
-		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i],
-			   nr * memcg_page_state_unit(memcg1_stats[i]));
+		nr = memcg_page_state_local_output(memcg, memcg1_stats[i]);
+		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr);
 	}
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
@@ -4134,9 +4155,9 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
 
-		nr = memcg_page_state(memcg, memcg1_stats[i]);
+		nr = memcg_page_state_output(memcg, memcg1_stats[i]);
 		seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
-			   (u64)nr * memcg_page_state_unit(memcg1_stats[i]));
+			       (u64)nr);
 	}
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
@@ -6614,7 +6635,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
 static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
 						     int item)
 {
-	return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
+	return lruvec_page_state(lruvec, item) *
+		memcg_page_state_output_unit(item);
 }
 
 static int memory_numa_stat_show(struct seq_file *m, void *v)
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
@ 2023-09-22 17:57   ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-09-22 17:57 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Yosry Ahmed

memcg_page_state_unit() is currently used to identify the unit of a
memcg state item so that all stats in memory.stat are in bytes. However,
it lies about the units of WORKINGSET_* stats. These stats actually
represent pages, but we present them to userspace as a scalar number of
events. In retrospect, maybe those stats should have been memcg "events"
rather than memcg "state".

In preparation for using memcg_page_state_unit() for other purposes that
need to know the truthful units of different stat items, break it down
into two helpers:
- memcg_page_state_unit() retuns the actual unit of the item.
- memcg_page_state_output_unit() returns the unit used for output.

Use the latter instead of the former in memcg_page_state_output() and
lruvec_page_state_output(). While we are at it, let's show cgroup v1
some love and add memcg_page_state_local_output() for consistency.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 44 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 927c64d3cbcb..308cc7353ef0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1535,7 +1535,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "workingset_nodereclaim",	WORKINGSET_NODERECLAIM		},
 };
 
-/* Translate stat items to the correct unit for memory.stat output */
+/* The actual unit of the state item, not the same as the output unit */
 static int memcg_page_state_unit(int item)
 {
 	switch (item) {
@@ -1543,6 +1543,22 @@ static int memcg_page_state_unit(int item)
 	case MEMCG_ZSWAP_B:
 	case NR_SLAB_RECLAIMABLE_B:
 	case NR_SLAB_UNRECLAIMABLE_B:
+		return 1;
+	case NR_KERNEL_STACK_KB:
+		return SZ_1K;
+	default:
+		return PAGE_SIZE;
+	}
+}
+
+/* Translate stat items to the correct unit for memory.stat output */
+static int memcg_page_state_output_unit(int item)
+{
+	/*
+	 * Workingset state is actually in pages, but we export it to userspace
+	 * as a scalar count of events, so special case it here.
+	 */
+	switch (item) {
 	case WORKINGSET_REFAULT_ANON:
 	case WORKINGSET_REFAULT_FILE:
 	case WORKINGSET_ACTIVATE_ANON:
@@ -1551,17 +1567,23 @@ static int memcg_page_state_unit(int item)
 	case WORKINGSET_RESTORE_FILE:
 	case WORKINGSET_NODERECLAIM:
 		return 1;
-	case NR_KERNEL_STACK_KB:
-		return SZ_1K;
 	default:
-		return PAGE_SIZE;
+		return memcg_page_state_unit(item);
 	}
 }
 
 static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
 						    int item)
 {
-	return memcg_page_state(memcg, item) * memcg_page_state_unit(item);
+	return memcg_page_state(memcg, item) *
+		memcg_page_state_output_unit(item);
+}
+
+static inline unsigned long memcg_page_state_local_output(
+		struct mem_cgroup *memcg, int item)
+{
+	return memcg_page_state_local(memcg, item) *
+		memcg_page_state_output_unit(item);
 }
 
 static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
@@ -4106,9 +4128,8 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
 
-		nr = memcg_page_state_local(memcg, memcg1_stats[i]);
-		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i],
-			   nr * memcg_page_state_unit(memcg1_stats[i]));
+		nr = memcg_page_state_local_output(memcg, memcg1_stats[i]);
+		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr);
 	}
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
@@ -4134,9 +4155,9 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
 
-		nr = memcg_page_state(memcg, memcg1_stats[i]);
+		nr = memcg_page_state_output(memcg, memcg1_stats[i]);
 		seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
-			   (u64)nr * memcg_page_state_unit(memcg1_stats[i]));
+			       (u64)nr);
 	}
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
@@ -6614,7 +6635,8 @@ static int memory_stat_show(struct seq_file *m, void *v)
 static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
 						     int item)
 {
-	return lruvec_page_state(lruvec, item) * memcg_page_state_unit(item);
+	return lruvec_page_state(lruvec, item) *
+		memcg_page_state_output_unit(item);
 }
 
 static int memory_numa_stat_show(struct seq_file *m, void *v)
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()
@ 2023-09-22 17:57   ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-09-22 17:57 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	linux-mm, cgroups, linux-kernel, Yosry Ahmed

memcg_rstat_updated() uses the value of the state update to keep track
of the magnitude of pending updates, so that we only do a stats flush
when it's worth the work. Most values passed into memcg_rstat_updated()
are in pages, however, a few of them are actually in bytes or KBs.

To put this into perspective, a 512 byte slab allocation today would
look the same as allocating 512 pages. This may result in premature
flushes, which means unnecessary work and latency.

Normalize all the state values passed into memcg_rstat_updated() to
pages. Round up non-zero sub-page to 1 page, because
memcg_rstat_updated() ignores 0 page updates.

Fixes: 5b3be698a872 ("memcg: better bounds on the memcg stats updates")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 308cc7353ef0..d1a322a75172 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -763,6 +763,22 @@ unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 	return x;
 }
 
+static int memcg_page_state_unit(int item);
+
+/*
+ * Normalize the value passed into memcg_rstat_updated() to be in pages. Round
+ * up non-zero sub-page updates to 1 page as zero page updates are ignored.
+ */
+static int memcg_state_val_in_pages(int idx, int val)
+{
+	int unit = memcg_page_state_unit(idx);
+
+	if (!val || unit == PAGE_SIZE)
+		return val;
+	else
+		return max(val * unit / PAGE_SIZE, 1UL);
+}
+
 /**
  * __mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
@@ -775,7 +791,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 		return;
 
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -826,7 +842,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
 	memcg_stats_unlock();
 }
 
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()
@ 2023-09-22 17:57   ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-09-22 17:57 UTC (permalink / raw)
  To: Andrew Morton, Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Yosry Ahmed

memcg_rstat_updated() uses the value of the state update to keep track
of the magnitude of pending updates, so that we only do a stats flush
when it's worth the work. Most values passed into memcg_rstat_updated()
are in pages, however, a few of them are actually in bytes or KBs.

To put this into perspective, a 512 byte slab allocation today would
look the same as allocating 512 pages. This may result in premature
flushes, which means unnecessary work and latency.

Normalize all the state values passed into memcg_rstat_updated() to
pages. Round up non-zero sub-page to 1 page, because
memcg_rstat_updated() ignores 0 page updates.

Fixes: 5b3be698a872 ("memcg: better bounds on the memcg stats updates")
Signed-off-by: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 mm/memcontrol.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 308cc7353ef0..d1a322a75172 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -763,6 +763,22 @@ unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 	return x;
 }
 
+static int memcg_page_state_unit(int item);
+
+/*
+ * Normalize the value passed into memcg_rstat_updated() to be in pages. Round
+ * up non-zero sub-page updates to 1 page as zero page updates are ignored.
+ */
+static int memcg_state_val_in_pages(int idx, int val)
+{
+	int unit = memcg_page_state_unit(idx);
+
+	if (!val || unit == PAGE_SIZE)
+		return val;
+	else
+		return max(val * unit / PAGE_SIZE, 1UL);
+}
+
 /**
  * __mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
@@ -775,7 +791,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 		return;
 
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -826,7 +842,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
 
-	memcg_rstat_updated(memcg, val);
+	memcg_rstat_updated(memcg, memcg_state_val_in_pages(idx, val));
 	memcg_stats_unlock();
 }
 
-- 
2.42.0.515.g380fc7ccd1-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
@ 2023-09-25 13:50   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2023-09-25 13:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> While working on adjacent code [1], I realized that the values passed
> into memcg_rstat_updated() to keep track of the magnitude of pending
> updates is consistent. It is mostly in pages, but sometimes it can be in
> bytes or KBs. Fix that.

What kind of practical difference does this change make? Is it worth
additional code?
 
> Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
> 2 to check and normalize the units of state updates.
> 
> [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
> 
> v1 -> v2:
> - Rebased on top of mm-unstable.
> 
> Yosry Ahmed (2):
>   mm: memcg: refactor page state unit helpers
>   mm: memcg: normalize the value passed into memcg_rstat_updated()
> 
>  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 51 insertions(+), 13 deletions(-)
> 
> -- 
> 2.42.0.515.g380fc7ccd1-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
@ 2023-09-25 13:50   ` Michal Hocko
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Hocko @ 2023-09-25 13:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> While working on adjacent code [1], I realized that the values passed
> into memcg_rstat_updated() to keep track of the magnitude of pending
> updates is consistent. It is mostly in pages, but sometimes it can be in
> bytes or KBs. Fix that.

What kind of practical difference does this change make? Is it worth
additional code?
 
> Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
> 2 to check and normalize the units of state updates.
> 
> [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
> 
> v1 -> v2:
> - Rebased on top of mm-unstable.
> 
> Yosry Ahmed (2):
>   mm: memcg: refactor page state unit helpers
>   mm: memcg: normalize the value passed into memcg_rstat_updated()
> 
>  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 51 insertions(+), 13 deletions(-)
> 
> -- 
> 2.42.0.515.g380fc7ccd1-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
@ 2023-09-25 17:11     ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-09-25 17:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > While working on adjacent code [1], I realized that the values passed
> > into memcg_rstat_updated() to keep track of the magnitude of pending
> > updates is consistent. It is mostly in pages, but sometimes it can be in
> > bytes or KBs. Fix that.
>
> What kind of practical difference does this change make? Is it worth
> additional code?

As explained in patch 2's commit message, the value passed into
memcg_rstat_updated() is used for the "flush only if not worth it"
heuristic. As we have discussed in different threads in the past few
weeks, unnecessary flushes can cause increased global lock contention
and/or latency.

Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
heuristic, but those are interpreted as pages, which means we will
flush earlier than we should. This was noticed by code inspection. How
much does this matter in practice? I would say it depends on the
workload: how many percpu/slab allocations are being made vs. how many
flushes are requested.

On a system with 100 cpus, 25M of stat updates are needed for a flush
usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
flush.

>
> > Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
> > 2 to check and normalize the units of state updates.
> >
> > [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
> >
> > v1 -> v2:
> > - Rebased on top of mm-unstable.
> >
> > Yosry Ahmed (2):
> >   mm: memcg: refactor page state unit helpers
> >   mm: memcg: normalize the value passed into memcg_rstat_updated()
> >
> >  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 51 insertions(+), 13 deletions(-)
> >
> > --
> > 2.42.0.515.g380fc7ccd1-goog
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
@ 2023-09-25 17:11     ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-09-25 17:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > While working on adjacent code [1], I realized that the values passed
> > into memcg_rstat_updated() to keep track of the magnitude of pending
> > updates is consistent. It is mostly in pages, but sometimes it can be in
> > bytes or KBs. Fix that.
>
> What kind of practical difference does this change make? Is it worth
> additional code?

As explained in patch 2's commit message, the value passed into
memcg_rstat_updated() is used for the "flush only if not worth it"
heuristic. As we have discussed in different threads in the past few
weeks, unnecessary flushes can cause increased global lock contention
and/or latency.

Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
heuristic, but those are interpreted as pages, which means we will
flush earlier than we should. This was noticed by code inspection. How
much does this matter in practice? I would say it depends on the
workload: how many percpu/slab allocations are being made vs. how many
flushes are requested.

On a system with 100 cpus, 25M of stat updates are needed for a flush
usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
flush.

>
> > Patch 1 reworks memcg_page_state_unit() so that we can reuse it in patch
> > 2 to check and normalize the units of state updates.
> >
> > [1]https://lore.kernel.org/lkml/20230921081057.3440885-1-yosryahmed@google.com/
> >
> > v1 -> v2:
> > - Rebased on top of mm-unstable.
> >
> > Yosry Ahmed (2):
> >   mm: memcg: refactor page state unit helpers
> >   mm: memcg: normalize the value passed into memcg_rstat_updated()
> >
> >  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 51 insertions(+), 13 deletions(-)
> >
> > --
> > 2.42.0.515.g380fc7ccd1-goog
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
  2023-09-25 17:11     ` Yosry Ahmed
  (?)
@ 2023-10-03  7:57     ` Michal Hocko
  2023-10-03  8:03       ` Yosry Ahmed
  -1 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2023-10-03  7:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > While working on adjacent code [1], I realized that the values passed
> > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > bytes or KBs. Fix that.
> >
> > What kind of practical difference does this change make? Is it worth
> > additional code?
> 
> As explained in patch 2's commit message, the value passed into
> memcg_rstat_updated() is used for the "flush only if not worth it"
> heuristic. As we have discussed in different threads in the past few
> weeks, unnecessary flushes can cause increased global lock contention
> and/or latency.
> 
> Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> heuristic, but those are interpreted as pages, which means we will
> flush earlier than we should. This was noticed by code inspection. How
> much does this matter in practice? I would say it depends on the
> workload: how many percpu/slab allocations are being made vs. how many
> flushes are requested.
> 
> On a system with 100 cpus, 25M of stat updates are needed for a flush
> usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> flush.

This surely depends on workload and that is understandable. But it would
be really nice to provide some numbers for typical workloads which
exercise slab heavily. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
  2023-10-03  7:57     ` Michal Hocko
@ 2023-10-03  8:03       ` Yosry Ahmed
  2023-10-03  8:09         ` Michal Hocko
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-03  8:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 3, 2023 at 12:57 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> > On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > > While working on adjacent code [1], I realized that the values passed
> > > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > > bytes or KBs. Fix that.
> > >
> > > What kind of practical difference does this change make? Is it worth
> > > additional code?
> >
> > As explained in patch 2's commit message, the value passed into
> > memcg_rstat_updated() is used for the "flush only if not worth it"
> > heuristic. As we have discussed in different threads in the past few
> > weeks, unnecessary flushes can cause increased global lock contention
> > and/or latency.
> >
> > Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> > heuristic, but those are interpreted as pages, which means we will
> > flush earlier than we should. This was noticed by code inspection. How
> > much does this matter in practice? I would say it depends on the
> > workload: how many percpu/slab allocations are being made vs. how many
> > flushes are requested.
> >
> > On a system with 100 cpus, 25M of stat updates are needed for a flush
> > usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> > flush.
>
> This surely depends on workload and that is understandable. But it would
> be really nice to provide some numbers for typical workloads which
> exercise slab heavily.

If you have a workload in mind I can run it and see how many flushes
we get with/without this patch. The first thing that pops into my head
is creating a bunch of empty files but I don't know if that's the best
thing to get numbers from.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
  2023-10-03  8:03       ` Yosry Ahmed
@ 2023-10-03  8:09         ` Michal Hocko
  2023-10-03  8:49           ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Hocko @ 2023-10-03  8:09 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Tue 03-10-23 01:03:53, Yosry Ahmed wrote:
> On Tue, Oct 3, 2023 at 12:57 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> > > On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > > > While working on adjacent code [1], I realized that the values passed
> > > > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > > > bytes or KBs. Fix that.
> > > >
> > > > What kind of practical difference does this change make? Is it worth
> > > > additional code?
> > >
> > > As explained in patch 2's commit message, the value passed into
> > > memcg_rstat_updated() is used for the "flush only if not worth it"
> > > heuristic. As we have discussed in different threads in the past few
> > > weeks, unnecessary flushes can cause increased global lock contention
> > > and/or latency.
> > >
> > > Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> > > heuristic, but those are interpreted as pages, which means we will
> > > flush earlier than we should. This was noticed by code inspection. How
> > > much does this matter in practice? I would say it depends on the
> > > workload: how many percpu/slab allocations are being made vs. how many
> > > flushes are requested.
> > >
> > > On a system with 100 cpus, 25M of stat updates are needed for a flush
> > > usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> > > flush.
> >
> > This surely depends on workload and that is understandable. But it would
> > be really nice to provide some numbers for typical workloads which
> > exercise slab heavily.
> 
> If you have a workload in mind I can run it and see how many flushes
> we get with/without this patch. The first thing that pops into my head
> is creating a bunch of empty files but I don't know if that's the best
> thing to get numbers from.

Let me remind you that you are proposing a performance optimization and
such a change requires some numbers to actually show it is benefitial.
There are cases where the resulting code is clearly an improvement and
the performance benefit is just a nice side effect. I do not consider
this to be the case. The whole thing is quite convoluted even without
a better precision you are proposing. And let me be clear, I am not
opposing your patch but I would rather see it based on more than just
hand waving.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values
  2023-10-03  8:09         ` Michal Hocko
@ 2023-10-03  8:49           ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-03  8:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 3, 2023 at 1:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 03-10-23 01:03:53, Yosry Ahmed wrote:
> > On Tue, Oct 3, 2023 at 12:57 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 25-09-23 10:11:05, Yosry Ahmed wrote:
> > > > On Mon, Sep 25, 2023 at 6:50 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 22-09-23 17:57:38, Yosry Ahmed wrote:
> > > > > > While working on adjacent code [1], I realized that the values passed
> > > > > > into memcg_rstat_updated() to keep track of the magnitude of pending
> > > > > > updates is consistent. It is mostly in pages, but sometimes it can be in
> > > > > > bytes or KBs. Fix that.
> > > > >
> > > > > What kind of practical difference does this change make? Is it worth
> > > > > additional code?
> > > >
> > > > As explained in patch 2's commit message, the value passed into
> > > > memcg_rstat_updated() is used for the "flush only if not worth it"
> > > > heuristic. As we have discussed in different threads in the past few
> > > > weeks, unnecessary flushes can cause increased global lock contention
> > > > and/or latency.
> > > >
> > > > Byte-sized paths (percpu, slab, zswap, ..) feed bytes into the
> > > > heuristic, but those are interpreted as pages, which means we will
> > > > flush earlier than we should. This was noticed by code inspection. How
> > > > much does this matter in practice? I would say it depends on the
> > > > workload: how many percpu/slab allocations are being made vs. how many
> > > > flushes are requested.
> > > >
> > > > On a system with 100 cpus, 25M of stat updates are needed for a flush
> > > > usually, but ~6K of slab/percpu updates will also (mistakenly) cause a
> > > > flush.
> > >
> > > This surely depends on workload and that is understandable. But it would
> > > be really nice to provide some numbers for typical workloads which
> > > exercise slab heavily.
> >
> > If you have a workload in mind I can run it and see how many flushes
> > we get with/without this patch. The first thing that pops into my head
> > is creating a bunch of empty files but I don't know if that's the best
> > thing to get numbers from.
>
> Let me remind you that you are proposing a performance optimization and
> such a change requires some numbers to actually show it is benefitial.
> There are cases where the resulting code is clearly an improvement and
> the performance benefit is just a nice side effect. I do not consider
> this to be the case. The whole thing is quite convoluted even without
> a better precision you are proposing. And let me be clear, I am not
> opposing your patch but I would rather see it based on more than just
> hand waving.

It is purely based on code inspection, and honestly I don't have
numbers to support it. I saw something wrong with the code and I tried
to fix it, I was working on something else when I noticed it. That
being said, I acknowledge it's not making the code any prettier :)

Feel free to suggest improvements to the code to make it more
bearable, otherwise if you don't like it I will just leave it to be
honest.

Thanks for taking a look!

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-09-22 17:57   ` Yosry Ahmed
  (?)
@ 2023-10-03 13:03   ` Johannes Weiner
  -1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2023-10-03 13:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Michal Hocko, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Fri, Sep 22, 2023 at 05:57:39PM +0000, Yosry Ahmed wrote:
> memcg_page_state_unit() is currently used to identify the unit of a
> memcg state item so that all stats in memory.stat are in bytes. However,
> it lies about the units of WORKINGSET_* stats. These stats actually
> represent pages, but we present them to userspace as a scalar number of
> events. In retrospect, maybe those stats should have been memcg "events"
> rather than memcg "state".
> 
> In preparation for using memcg_page_state_unit() for other purposes that
> need to know the truthful units of different stat items, break it down
> into two helpers:
> - memcg_page_state_unit() retuns the actual unit of the item.
> - memcg_page_state_output_unit() returns the unit used for output.
> 
> Use the latter instead of the former in memcg_page_state_output() and
> lruvec_page_state_output(). While we are at it, let's show cgroup v1
> some love and add memcg_page_state_local_output() for consistency.
> 
> No functional change intended.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

That's a nice cleanup in itself.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()
  2023-09-22 17:57   ` Yosry Ahmed
  (?)
@ 2023-10-03 13:13   ` Johannes Weiner
  2023-10-03 15:53     ` Yosry Ahmed
  -1 siblings, 1 reply; 28+ messages in thread
From: Johannes Weiner @ 2023-10-03 13:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Michal Hocko, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed wrote:
> memcg_rstat_updated() uses the value of the state update to keep track
> of the magnitude of pending updates, so that we only do a stats flush
> when it's worth the work. Most values passed into memcg_rstat_updated()
> are in pages, however, a few of them are actually in bytes or KBs.
> 
> To put this into perspective, a 512 byte slab allocation today would
> look the same as allocating 512 pages. This may result in premature
> flushes, which means unnecessary work and latency.

Yikes.

I'm somewhat less concerned about the performance as I am about the
variance in flushing cost that could be quite difficult to pinpoint.
IMO this is a correctness fix and a code cleanup, not a performance
thing.

> Normalize all the state values passed into memcg_rstat_updated() to
> pages. Round up non-zero sub-page to 1 page, because
> memcg_rstat_updated() ignores 0 page updates.
> 
> Fixes: 5b3be698a872 ("memcg: better bounds on the memcg stats updates")
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()
  2023-10-03 13:13   ` Johannes Weiner
@ 2023-10-03 15:53     ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-03 15:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Shakeel Butt, Michal Hocko, Roman Gushchin,
	Muchun Song, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 3, 2023 at 6:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed wrote:
> > memcg_rstat_updated() uses the value of the state update to keep track
> > of the magnitude of pending updates, so that we only do a stats flush
> > when it's worth the work. Most values passed into memcg_rstat_updated()
> > are in pages, however, a few of them are actually in bytes or KBs.
> >
> > To put this into perspective, a 512 byte slab allocation today would
> > look the same as allocating 512 pages. This may result in premature
> > flushes, which means unnecessary work and latency.
>
> Yikes.
>
> I'm somewhat less concerned about the performance as I am about the
> variance in flushing cost that could be quite difficult to pinpoint.
> IMO this is a correctness fix and a code cleanup, not a performance
> thing.

Agreed, the code right now has a subtle mistake.

>
> > Normalize all the state values passed into memcg_rstat_updated() to
> > pages. Round up non-zero sub-page to 1 page, because
> > memcg_rstat_updated() ignores 0 page updates.
> >
> > Fixes: 5b3be698a872 ("memcg: better bounds on the memcg stats updates")
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for taking a look!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-09-22 17:57   ` Yosry Ahmed
  (?)
  (?)
@ 2023-10-03 18:11   ` Michal Koutný
  2023-10-03 19:47     ` Yosry Ahmed
  -1 siblings, 1 reply; 28+ messages in thread
From: Michal Koutný @ 2023-10-03 18:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 614 bytes --]

On Fri, Sep 22, 2023 at 05:57:39PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> memcg_page_state_unit() is currently used to identify the unit of a
> memcg state item so that all stats in memory.stat are in bytes. However,
> it lies about the units of WORKINGSET_* stats. These stats actually
> represent pages, but we present them to userspace as a scalar number of
> events. In retrospect, maybe those stats should have been memcg "events"
> rather than memcg "state".

Why isn't it possible to move WORKINGSET_* stats under the events now?
(Instead of using internal and external units.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()
  2023-09-22 17:57   ` Yosry Ahmed
  (?)
  (?)
@ 2023-10-03 18:22   ` Michal Koutný
  2023-10-03 19:51     ` Yosry Ahmed
  -1 siblings, 1 reply; 28+ messages in thread
From: Michal Koutný @ 2023-10-03 18:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> memcg_rstat_updated() uses the value of the state update to keep track
> of the magnitude of pending updates, so that we only do a stats flush
> when it's worth the work. Most values passed into memcg_rstat_updated()
> are in pages, however, a few of them are actually in bytes or KBs.
> 
> To put this into perspective, a 512 byte slab allocation today would
> look the same as allocating 512 pages. This may result in premature
> flushes, which means unnecessary work and latency.
> 
> Normalize all the state values passed into memcg_rstat_updated() to
> pages.

I've dreamed about such normalization since error estimates were
introduced :-)

(As touched in the previous patch) it makes me wonder whether it makes
sense to add up state and event counters (apples and oranges).

Shouldn't with this approach events: a) have a separate counter, b)
wight with zero and rely on time-based flushing only?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-03 18:11   ` Michal Koutný
@ 2023-10-03 19:47     ` Yosry Ahmed
  2023-10-04  9:02       ` Michal Koutný
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-03 19:47 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

On Tue, Oct 3, 2023 at 11:11 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Sep 22, 2023 at 05:57:39PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > memcg_page_state_unit() is currently used to identify the unit of a
> > memcg state item so that all stats in memory.stat are in bytes. However,
> > it lies about the units of WORKINGSET_* stats. These stats actually
> > represent pages, but we present them to userspace as a scalar number of
> > events. In retrospect, maybe those stats should have been memcg "events"
> > rather than memcg "state".
>
> Why isn't it possible to move WORKINGSET_* stats under the events now?
> (Instead of using internal and external units.)

Those constants are shared with code outside of memcg, namely enum
node_stat_item and enum vm_event_item, and IIUC they are used
differently outside of memcg. Did I miss something?

>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/2] mm: memcg: normalize the value passed into memcg_rstat_updated()
  2023-10-03 18:22   ` Michal Koutný
@ 2023-10-03 19:51     ` Yosry Ahmed
  0 siblings, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-03 19:51 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

On Tue, Oct 3, 2023 at 11:22 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Sep 22, 2023 at 05:57:40PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > memcg_rstat_updated() uses the value of the state update to keep track
> > of the magnitude of pending updates, so that we only do a stats flush
> > when it's worth the work. Most values passed into memcg_rstat_updated()
> > are in pages, however, a few of them are actually in bytes or KBs.
> >
> > To put this into perspective, a 512 byte slab allocation today would
> > look the same as allocating 512 pages. This may result in premature
> > flushes, which means unnecessary work and latency.
> >
> > Normalize all the state values passed into memcg_rstat_updated() to
> > pages.
>
> I've dreamed about such normalization since error estimates were
> introduced :-)
>
> (As touched in the previous patch) it makes me wonder whether it makes
> sense to add up state and event counters (apples and oranges).

I conceptually agree that we are adding apples and oranges, but in
practice if you look at memcg_vm_event_stat, most stat updates
correspond to 1 page worth of something. I am guessing the implicit
assumption here is that event updates correspond roughly to page-sized
state updates. It's not perfect, but perhaps it's acceptable.

>
> Shouldn't with this approach events: a) have a separate counter, b)
> wight with zero and rely on time-based flushing only?

(a) If we have separate counters we need to eventually sum them to
figure out if the total magnitude of pending updates is worth flushing
anyway, right?
(b) I would be more nervous to rely on time-based flushing only as I
don't know how this can affect workloads.

>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-03 19:47     ` Yosry Ahmed
@ 2023-10-04  9:02       ` Michal Koutný
  2023-10-04 16:58         ` Yosry Ahmed
  2023-10-04 18:36         ` Johannes Weiner
  0 siblings, 2 replies; 28+ messages in thread
From: Michal Koutný @ 2023-10-04  9:02 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 841 bytes --]

On Tue, Oct 03, 2023 at 12:47:25PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> Those constants are shared with code outside of memcg, namely enum
> node_stat_item and enum vm_event_item, and IIUC they are used
> differently outside of memcg. Did I miss something?

The difference is not big, e.g.
  mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
could be
  __count_memcg_events(
    container_of(lruvec, struct mem_cgroup_per_node, lruvec)->memcg,
    WORKINGSET_ACTIVATE_BASE + type, delta
  );

Yes, it would mean transferring WORKINGSET_* items from enum
node_stat_item to enum vm_event_item.
IOW, I don't know what is the effective difference between
mod_memcg_lruvec_state() and count_memcg_events().
Is it per-memcg vs per-memcg-per-node resolution?
(Is _that_ read by workingset mechanism?)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-04  9:02       ` Michal Koutný
@ 2023-10-04 16:58         ` Yosry Ahmed
  2023-10-04 18:36         ` Johannes Weiner
  1 sibling, 0 replies; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-04 16:58 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, Shakeel Butt, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

On Wed, Oct 4, 2023 at 2:02 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Oct 03, 2023 at 12:47:25PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > Those constants are shared with code outside of memcg, namely enum
> > node_stat_item and enum vm_event_item, and IIUC they are used
> > differently outside of memcg. Did I miss something?
>
> The difference is not big, e.g.
>   mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> could be
>   __count_memcg_events(
>     container_of(lruvec, struct mem_cgroup_per_node, lruvec)->memcg,
>     WORKINGSET_ACTIVATE_BASE + type, delta
>   );
>
> Yes, it would mean transferring WORKINGSET_* items from enum
> node_stat_item to enum vm_event_item.
> IOW, I don't know what is the effective difference between
> mod_memcg_lruvec_state() and count_memcg_events().
> Is it per-memcg vs per-memcg-per-node resolution?
> (Is _that_ read by workingset mechanism?)

Even if it is not read, I think it is exposed in memory.numa_stat, right?

Outside of memcg code, if you look at vmstat_start(), you will see
that the items in enum vm_event_item are handled differently (see
all_vm_events()) when reading vmstat. I don't think we can just move
it, unfortunately.

>
> Thanks,
> Michal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-04  9:02       ` Michal Koutný
  2023-10-04 16:58         ` Yosry Ahmed
@ 2023-10-04 18:36         ` Johannes Weiner
  2023-10-05  9:06           ` Michal Koutný
  1 sibling, 1 reply; 28+ messages in thread
From: Johannes Weiner @ 2023-10-04 18:36 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Yosry Ahmed, Andrew Morton, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

On Wed, Oct 04, 2023 at 11:02:11AM +0200, Michal Koutný wrote:
> On Tue, Oct 03, 2023 at 12:47:25PM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > Those constants are shared with code outside of memcg, namely enum
> > node_stat_item and enum vm_event_item, and IIUC they are used
> > differently outside of memcg. Did I miss something?
> 
> The difference is not big, e.g.
>   mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> could be
>   __count_memcg_events(
>     container_of(lruvec, struct mem_cgroup_per_node, lruvec)->memcg,
>     WORKINGSET_ACTIVATE_BASE + type, delta
>   );
> 
> Yes, it would mean transferring WORKINGSET_* items from enum
> node_stat_item to enum vm_event_item.
> IOW, I don't know what is the effective difference between
> mod_memcg_lruvec_state() and count_memcg_events().
> Is it per-memcg vs per-memcg-per-node resolution?

Yes, it's because of node resolution which event counters generally
don't have. Some of the refault events influence node-local reclaim
decisions, see mm/vmscan.c::snapshot_refaults().

There are a few other event counters in the stat array that people
thought would be useful to have split out in
/sys/devices/system/node/nodeN/vmstat to understand numa behavior
better.

It's a bit messy.

Some events would be useful to move to 'stats' for the numa awareness,
such as the allocator stats and reclaim activity.

Some events would be useful to move to 'stats' for the numa awareness,
but don't have the zone resolution required by them, such as
kswapd/kcompactd wakeups.

Some events aren't numa specific, such as oom kills, drop_pagecache.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-04 18:36         ` Johannes Weiner
@ 2023-10-05  9:06           ` Michal Koutný
  2023-10-05  9:31             ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Koutný @ 2023-10-05  9:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, Andrew Morton, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1564 bytes --]

On Wed, Oct 04, 2023 at 02:36:19PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Yes, it's because of node resolution which event counters generally
> don't have. Some of the refault events influence node-local reclaim
> decisions, see mm/vmscan.c::snapshot_refaults().
> 
> There are a few other event counters in the stat array that people
> thought would be useful to have split out in
> /sys/devices/system/node/nodeN/vmstat to understand numa behavior
> better.
> 
> It's a bit messy.
> 
> Some events would be useful to move to 'stats' for the numa awareness,
> such as the allocator stats and reclaim activity.
> 
> Some events would be useful to move to 'stats' for the numa awareness,
> but don't have the zone resolution required by them, such as
> kswapd/kcompactd wakeups.

Thanks for the enlightenment.

> Some events aren't numa specific, such as oom kills, drop_pagecache.

These are oddballs indeed. As with the normalization patchset these are
counted as PAGE_SIZE^W 1 error but they should rather be an infinite
error (to warrant a flush).

So my feedback to this series is:
- patch 1/2 -- creating two classes of units is consequence of unclarity
  between state and events (as in event=Δstate/Δt) and resolution
  (global vs per-node), so the better approach would be to tidy this up,
- patch 2/2 -- it could use the single unit class that exists, 
  it'll bound the error of printed numbers afterall (and can be changed
  later depending on how it affects internal consumers).

My 0.02€,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-05  9:06           ` Michal Koutný
@ 2023-10-05  9:31             ` Yosry Ahmed
  2023-10-05 16:30               ` Michal Koutný
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-05  9:31 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Johannes Weiner, Andrew Morton, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

On Thu, Oct 5, 2023 at 2:06 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Wed, Oct 04, 2023 at 02:36:19PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Yes, it's because of node resolution which event counters generally
> > don't have. Some of the refault events influence node-local reclaim
> > decisions, see mm/vmscan.c::snapshot_refaults().
> >
> > There are a few other event counters in the stat array that people
> > thought would be useful to have split out in
> > /sys/devices/system/node/nodeN/vmstat to understand numa behavior
> > better.
> >
> > It's a bit messy.
> >
> > Some events would be useful to move to 'stats' for the numa awareness,
> > such as the allocator stats and reclaim activity.
> >
> > Some events would be useful to move to 'stats' for the numa awareness,
> > but don't have the zone resolution required by them, such as
> > kswapd/kcompactd wakeups.
>
> Thanks for the enlightenment.
>
> > Some events aren't numa specific, such as oom kills, drop_pagecache.
>
> These are oddballs indeed. As with the normalization patchset these are
> counted as PAGE_SIZE^W 1 error but they should rather be an infinite
> error (to warrant a flush).
>
> So my feedback to this series is:
> - patch 1/2 -- creating two classes of units is consequence of unclarity
>   between state and events (as in event=Δstate/Δt) and resolution
>   (global vs per-node), so the better approach would be to tidy this up,

I am not really sure what you mean here. I understand that this series
fixes the unit normalization for state but leaves events out of it.
Looking at the event items tracked by memcg in memcg_vm_event_stat it
looks to me that most of them correspond roughly to a page's worth of
updates (all but the THP_* events). We don't track things like
OOM_KILL and DROP_PAGECACHE per memcg as far as I can tell.

Do you mean that we should add something similar to
memcg_page_state_unit() for events as well to get all of them right?
If yes, I think that should be easy to add, it would only special case
THP_* events.

Alternatively, I can add a comment above the call to
memcg_rstat_updated() in __count_memcg_events() explaining why we
don't normalize the event count for now.

> - patch 2/2 -- it could use the single unit class that exists,
>   it'll bound the error of printed numbers afterall (and can be changed
>   later depending on how it affects internal consumers).

This will mean that WORKINGSET_* state will become more stale. We will
need 4096 as many updates as today to get a flush. These are used by
internal flushers (reclaim), and are exposed to userspace. I am not
sure we want to do that.

>
> My 0.02€,
> Michal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-05  9:31             ` Yosry Ahmed
@ 2023-10-05 16:30               ` Michal Koutný
  2023-10-05 17:30                 ` Yosry Ahmed
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Koutný @ 2023-10-05 16:30 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1313 bytes --]

On Thu, Oct 05, 2023 at 02:31:03AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> I am not really sure what you mean here.

My "vision" is to treat WORKINGSET_ entries as events.
That would mean implementing per-node tracking for vm_event_item
(costlier?).
That would mean node_stat_item and vm_event_item being effectively
equal, so they could be merged in one.
That would be situation to come up with new classification based on use
cases (e.g. precision/timeliness requirements, state vs change
semantics).

(Do not take this as blocker of the patch 1/2, I rather used the
opportunity to discuss a greater possible cleanup.)

> We don't track things like OOM_KILL and DROP_PAGECACHE per memcg as
> far as I can tell.

Ah, good. (I forgot only subset of entries is relevant for memcgs.)

> This will mean that WORKINGSET_* state will become more stale. We will
> need 4096 as many updates as today to get a flush. These are used by
> internal flushers (reclaim), and are exposed to userspace. I am not
> sure we want to do that.

snapshot_refaults() doesn't seem to follow after flush
and
workigset_refault()'s flush doesn't seem to preceed readers

Is the flush misplaced or have I overlooked something?
(If the former, it seems to work good enough even with the current
flushing heuristics :-))


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-05 16:30               ` Michal Koutný
@ 2023-10-05 17:30                 ` Yosry Ahmed
  2023-10-18 19:27                   ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Yosry Ahmed @ 2023-10-05 17:30 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Johannes Weiner, Andrew Morton, Shakeel Butt, Michal Hocko,
	Roman Gushchin, Muchun Song, linux-mm, cgroups, linux-kernel

On Thu, Oct 5, 2023 at 9:30 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Thu, Oct 05, 2023 at 02:31:03AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > I am not really sure what you mean here.
>
> My "vision" is to treat WORKINGSET_ entries as events.
> That would mean implementing per-node tracking for vm_event_item
> (costlier?).
> That would mean node_stat_item and vm_event_item being effectively
> equal, so they could be merged in one.
> That would be situation to come up with new classification based on use
> cases (e.g. precision/timeliness requirements, state vs change
> semantics).
>
> (Do not take this as blocker of the patch 1/2, I rather used the
> opportunity to discuss a greater possible cleanup.)

Yeah ideally we can clean this up separately. I would be careful about
userspace exposure though. It seems like CONFIG_VM_EVENT_COUNTERS is
used to control tracking events and displaying them in vmstat, so
moving items between node_stat_item and vm_event_item (or merging
them) won't be easy.

>
> > We don't track things like OOM_KILL and DROP_PAGECACHE per memcg as
> > far as I can tell.
>
> Ah, good. (I forgot only subset of entries is relevant for memcgs.)
>
> > This will mean that WORKINGSET_* state will become more stale. We will
> > need 4096 as many updates as today to get a flush. These are used by
> > internal flushers (reclaim), and are exposed to userspace. I am not
> > sure we want to do that.
>
> snapshot_refaults() doesn't seem to follow after flush
> and
> workigset_refault()'s flush doesn't seem to preceed readers
>
> Is the flush misplaced or have I overlooked something?
> (If the former, it seems to work good enough even with the current
> flushing heuristics :-))

We flush in prepare_scan_count() before reading WORKINGSET_ACTIVATE_*
state. That flush also implicitly precedes every call to
snapshot_refaults(), which is unclear and not so robust, but we are
effectively flushing before snapshot_refaults() too.


>
>
> Michal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/2] mm: memcg: refactor page state unit helpers
  2023-10-05 17:30                 ` Yosry Ahmed
@ 2023-10-18 19:27                   ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2023-10-18 19:27 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Koutný,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Roman Gushchin,
	Muchun Song, linux-mm, cgroups, linux-kernel

On Thu, 5 Oct 2023 10:30:36 -0700 Yosry Ahmed <yosryahmed@google.com> wrote:

> On Thu, Oct 5, 2023 at 9:30 AM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > On Thu, Oct 05, 2023 at 02:31:03AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> > > I am not really sure what you mean here.
> >
> > My "vision" is to treat WORKINGSET_ entries as events.
> > That would mean implementing per-node tracking for vm_event_item
> > (costlier?).
> > That would mean node_stat_item and vm_event_item being effectively
> > equal, so they could be merged in one.
> > That would be situation to come up with new classification based on use
> > cases (e.g. precision/timeliness requirements, state vs change
> > semantics).
> >
> > (Do not take this as blocker of the patch 1/2, I rather used the
> > opportunity to discuss a greater possible cleanup.)
> 
> Yeah ideally we can clean this up separately. I would be careful about
> userspace exposure though. It seems like CONFIG_VM_EVENT_COUNTERS is
> used to control tracking events and displaying them in vmstat, so
> moving items between node_stat_item and vm_event_item (or merging
> them) won't be easy.

I like the word "separately".  This series has been in mm-unstable for
nearly a month, so I'll move it into mm-stable as-is,


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-10-18 19:27 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-22 17:57 [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values Yosry Ahmed
2023-09-22 17:57 ` [PATCH v2 1/2] mm: memcg: refactor page state unit helpers Yosry Ahmed
2023-09-22 17:57   ` Yosry Ahmed
2023-10-03 13:03   ` Johannes Weiner
2023-10-03 18:11   ` Michal Koutný
2023-10-03 19:47     ` Yosry Ahmed
2023-10-04  9:02       ` Michal Koutný
2023-10-04 16:58         ` Yosry Ahmed
2023-10-04 18:36         ` Johannes Weiner
2023-10-05  9:06           ` Michal Koutný
2023-10-05  9:31             ` Yosry Ahmed
2023-10-05 16:30               ` Michal Koutný
2023-10-05 17:30                 ` Yosry Ahmed
2023-10-18 19:27                   ` Andrew Morton
2023-09-22 17:57 ` [PATCH v2 2/2] mm: memcg: normalize the value passed into memcg_rstat_updated() Yosry Ahmed
2023-09-22 17:57   ` Yosry Ahmed
2023-10-03 13:13   ` Johannes Weiner
2023-10-03 15:53     ` Yosry Ahmed
2023-10-03 18:22   ` Michal Koutný
2023-10-03 19:51     ` Yosry Ahmed
2023-09-25 13:50 ` [PATCH v2 0/2] mm: memcg: fix tracking of pending stats updates values Michal Hocko
2023-09-25 13:50   ` Michal Hocko
2023-09-25 17:11   ` Yosry Ahmed
2023-09-25 17:11     ` Yosry Ahmed
2023-10-03  7:57     ` Michal Hocko
2023-10-03  8:03       ` Yosry Ahmed
2023-10-03  8:09         ` Michal Hocko
2023-10-03  8:49           ` Yosry Ahmed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.