All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] memcg: optimizatize charge codepath
@ 2022-08-22  0:17 ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song
  Cc: Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel,
	Shakeel Butt

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs. One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck. The
kernel test robot has also reported this regression. This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg		21694.8
2. 6.0-rc1		10482.7 (-51.6%)
3. 6.0-rc1 + (A)	14542.5 (-32.9%)
4. 6.0-rc1 + (B)	12413.7 (-42.7%)
5. 6.0-rc1 + (C)	17063.7 (-21.3%)
6. 6.0-rc1 + (A+B+C)	20120.3 (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

Shakeel Butt (3):
  mm: page_counter: remove unneeded atomic ops for low/min
  mm: page_counter: rearrange struct page_counter fields
  memcg: increase MEMCG_CHARGE_BATCH to 64

 include/linux/memcontrol.h   |  7 ++++---
 include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
 mm/page_counter.c            | 13 ++++++-------
 3 files changed, 33 insertions(+), 21 deletions(-)

-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 0/3] memcg: optimizatize charge codepath
@ 2022-08-22  0:17 ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1844 bytes --]

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs. One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck. The
kernel test robot has also reported this regression. This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg		21694.8
2. 6.0-rc1		10482.7 (-51.6%)
3. 6.0-rc1 + (A)	14542.5 (-32.9%)
4. 6.0-rc1 + (B)	12413.7 (-42.7%)
5. 6.0-rc1 + (C)	17063.7 (-21.3%)
6. 6.0-rc1 + (A+B+C)	20120.3 (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

Shakeel Butt (3):
  mm: page_counter: remove unneeded atomic ops for low/min
  mm: page_counter: rearrange struct page_counter fields
  memcg: increase MEMCG_CHARGE_BATCH to 64

 include/linux/memcontrol.h   |  7 ++++---
 include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
 mm/page_counter.c            | 13 ++++++-------
 3 files changed, 33 insertions(+), 21 deletions(-)

-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 0/3] memcg: optimizatize charge codepath
@ 2022-08-22  0:17 ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song
  Cc: Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs. One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck. The
kernel test robot has also reported this regression. This patch series
tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg		21694.8
2. 6.0-rc1		10482.7 (-51.6%)
3. 6.0-rc1 + (A)	14542.5 (-32.9%)
4. 6.0-rc1 + (B)	12413.7 (-42.7%)
5. 6.0-rc1 + (C)	17063.7 (-21.3%)
6. 6.0-rc1 + (A+B+C)	20120.3 (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

Shakeel Butt (3):
  mm: page_counter: remove unneeded atomic ops for low/min
  mm: page_counter: rearrange struct page_counter fields
  memcg: increase MEMCG_CHARGE_BATCH to 64

 include/linux/memcontrol.h   |  7 ++++---
 include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
 mm/page_counter.c            | 13 ++++++-------
 3 files changed, 33 insertions(+), 21 deletions(-)

-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22  0:17 ` Shakeel Butt
  (?)
@ 2022-08-22  0:17   ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song
  Cc: Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel,
	Shakeel Butt

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively. It only needs to do that operation if the new value of
protection is different from older one. This patch does that.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
---
 mm/page_counter.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index eb156ff5d603..47711aa28161 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
 				      unsigned long usage)
 {
 	unsigned long protected, old_protected;
-	unsigned long low, min;
 	long delta;
 
 	if (!c->parent)
 		return;
 
-	min = READ_ONCE(c->min);
-	if (min || atomic_long_read(&c->min_usage)) {
-		protected = min(usage, min);
+	protected = min(usage, READ_ONCE(c->min));
+	old_protected = atomic_long_read(&c->min_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
 			atomic_long_add(delta, &c->parent->children_min_usage);
 	}
 
-	low = READ_ONCE(c->low);
-	if (low || atomic_long_read(&c->low_usage)) {
-		protected = min(usage, low);
+	protected = min(usage, READ_ONCE(c->low));
+	old_protected = atomic_long_read(&c->low_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22  0:17   ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2261 bytes --]

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively. It only needs to do that operation if the new value of
protection is different from older one. This patch does that.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
---
 mm/page_counter.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index eb156ff5d603..47711aa28161 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
 				      unsigned long usage)
 {
 	unsigned long protected, old_protected;
-	unsigned long low, min;
 	long delta;
 
 	if (!c->parent)
 		return;
 
-	min = READ_ONCE(c->min);
-	if (min || atomic_long_read(&c->min_usage)) {
-		protected = min(usage, min);
+	protected = min(usage, READ_ONCE(c->min));
+	old_protected = atomic_long_read(&c->min_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
 			atomic_long_add(delta, &c->parent->children_min_usage);
 	}
 
-	low = READ_ONCE(c->low);
-	if (low || atomic_long_read(&c->low_usage)) {
-		protected = min(usage, low);
+	protected = min(usage, READ_ONCE(c->low));
+	old_protected = atomic_long_read(&c->low_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22  0:17   ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song
  Cc: Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively. It only needs to do that operation if the new value of
protection is different from older one. This patch does that.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 mm/page_counter.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index eb156ff5d603..47711aa28161 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
 				      unsigned long usage)
 {
 	unsigned long protected, old_protected;
-	unsigned long low, min;
 	long delta;
 
 	if (!c->parent)
 		return;
 
-	min = READ_ONCE(c->min);
-	if (min || atomic_long_read(&c->min_usage)) {
-		protected = min(usage, min);
+	protected = min(usage, READ_ONCE(c->min));
+	old_protected = atomic_long_read(&c->min_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
 			atomic_long_add(delta, &c->parent->children_min_usage);
 	}
 
-	low = READ_ONCE(c->low);
-	if (low || atomic_long_read(&c->low_usage)) {
-		protected = min(usage, low);
+	protected = min(usage, READ_ONCE(c->low));
+	old_protected = atomic_long_read(&c->low_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22  0:17 ` Shakeel Butt
@ 2022-08-22  0:17   ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song
  Cc: Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel,
	Shakeel Butt

With memcg v2 enabled, memcg->memory.usage is a very hot member for
the workloads doing memcg charging on multiple CPUs concurrently.
Particularly the network intensive workloads. In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path. This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		12413.7 Mbps (18.4% improvement)

With the patch, the throughput improved by 18.4%.

One side-effect of this patch is the increase in the size of struct
mem_cgroup. However for the performance improvement, this additional
size is worth it. In addition there are opportunities to reduce the size
of struct mem_cgroup like deprecation of kmem and tcpmem page counters
and better packing.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
---
 include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 679591301994..8ce99bde645f 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -3,15 +3,27 @@
 #define _LINUX_PAGE_COUNTER_H
 
 #include <linux/atomic.h>
+#include <linux/cache.h>
 #include <linux/kernel.h>
 #include <asm/page.h>
 
+#if defined(CONFIG_SMP)
+struct pc_padding {
+	char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define PC_PADDING(name)	struct pc_padding name
+#else
+#define PC_PADDING(name)
+#endif
+
 struct page_counter {
+	/*
+	 * Make sure 'usage' does not share cacheline with any other field. The
+	 * memcg->memory.usage is a hot member of struct mem_cgroup.
+	 */
+	PC_PADDING(_pad1_);
 	atomic_long_t usage;
-	unsigned long min;
-	unsigned long low;
-	unsigned long high;
-	unsigned long max;
+	PC_PADDING(_pad2_);
 
 	/* effective memory.min and memory.min usage tracking */
 	unsigned long emin;
@@ -23,16 +35,16 @@ struct page_counter {
 	atomic_long_t low_usage;
 	atomic_long_t children_low_usage;
 
-	/* legacy */
 	unsigned long watermark;
 	unsigned long failcnt;
 
-	/*
-	 * 'parent' is placed here to be far from 'usage' to reduce
-	 * cache false sharing, as 'usage' is written mostly while
-	 * parent is frequently read for cgroup's hierarchical
-	 * counting nature.
-	 */
+	/* Keep all the read most fields in a separete cacheline. */
+	PC_PADDING(_pad3_);
+
+	unsigned long min;
+	unsigned long low;
+	unsigned long high;
+	unsigned long max;
 	struct page_counter *parent;
 };
 
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  0:17   ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3162 bytes --]

With memcg v2 enabled, memcg->memory.usage is a very hot member for
the workloads doing memcg charging on multiple CPUs concurrently.
Particularly the network intensive workloads. In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path. This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		12413.7 Mbps (18.4% improvement)

With the patch, the throughput improved by 18.4%.

One side-effect of this patch is the increase in the size of struct
mem_cgroup. However for the performance improvement, this additional
size is worth it. In addition there are opportunities to reduce the size
of struct mem_cgroup like deprecation of kmem and tcpmem page counters
and better packing.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
---
 include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 679591301994..8ce99bde645f 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -3,15 +3,27 @@
 #define _LINUX_PAGE_COUNTER_H
 
 #include <linux/atomic.h>
+#include <linux/cache.h>
 #include <linux/kernel.h>
 #include <asm/page.h>
 
+#if defined(CONFIG_SMP)
+struct pc_padding {
+	char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define PC_PADDING(name)	struct pc_padding name
+#else
+#define PC_PADDING(name)
+#endif
+
 struct page_counter {
+	/*
+	 * Make sure 'usage' does not share cacheline with any other field. The
+	 * memcg->memory.usage is a hot member of struct mem_cgroup.
+	 */
+	PC_PADDING(_pad1_);
 	atomic_long_t usage;
-	unsigned long min;
-	unsigned long low;
-	unsigned long high;
-	unsigned long max;
+	PC_PADDING(_pad2_);
 
 	/* effective memory.min and memory.min usage tracking */
 	unsigned long emin;
@@ -23,16 +35,16 @@ struct page_counter {
 	atomic_long_t low_usage;
 	atomic_long_t children_low_usage;
 
-	/* legacy */
 	unsigned long watermark;
 	unsigned long failcnt;
 
-	/*
-	 * 'parent' is placed here to be far from 'usage' to reduce
-	 * cache false sharing, as 'usage' is written mostly while
-	 * parent is frequently read for cgroup's hierarchical
-	 * counting nature.
-	 */
+	/* Keep all the read most fields in a separete cacheline. */
+	PC_PADDING(_pad3_);
+
+	unsigned long min;
+	unsigned long low;
+	unsigned long high;
+	unsigned long max;
 	struct page_counter *parent;
 };
 
-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22  0:17 ` Shakeel Butt
@ 2022-08-22  0:17   ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song
  Cc: Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel,
	Shakeel Butt

For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
machines and the network intensive workloads requiring througput in
Gbps, 32 is too small and makes the memcg charging path a bottleneck.
For now, increase it to 64 for easy acceptance to 6.0. We will need to
revisit this in future for ever increasing demand of higher performance.

Please note that the memcg charge path drain the per-cpu memcg charge
stock, so there should not be any oom behavior change.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)       10482.7 Mbps
With patch              17064.7 Mbps (62.7% improvement)

With the patch, the throughput improved by 62.7%.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
---
 include/linux/memcontrol.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..70ae91188e16 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,10 +354,11 @@ struct mem_cgroup {
 };
 
 /*
- * size of first charge trial. "32" comes from vmscan.c's magic value.
- * TODO: maybe necessary to use big numbers in big irons.
+ * size of first charge trial.
+ * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
+ * workload.
  */
-#define MEMCG_CHARGE_BATCH 32U
+#define MEMCG_CHARGE_BATCH 64U
 
 extern struct mem_cgroup *root_mem_cgroup;
 
-- 
2.37.1.595.g718a3a8f04-goog


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22  0:17   ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  0:17 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
machines and the network intensive workloads requiring througput in
Gbps, 32 is too small and makes the memcg charging path a bottleneck.
For now, increase it to 64 for easy acceptance to 6.0. We will need to
revisit this in future for ever increasing demand of higher performance.

Please note that the memcg charge path drain the per-cpu memcg charge
stock, so there should not be any oom behavior change.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately. More specifically
memory.min equal to size of netperf binary and memory.low double of
that.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)       10482.7 Mbps
With patch              17064.7 Mbps (62.7% improvement)

With the patch, the throughput improved by 62.7%.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
---
 include/linux/memcontrol.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..70ae91188e16 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,10 +354,11 @@ struct mem_cgroup {
 };
 
 /*
- * size of first charge trial. "32" comes from vmscan.c's magic value.
- * TODO: maybe necessary to use big numbers in big irons.
+ * size of first charge trial.
+ * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
+ * workload.
  */
-#define MEMCG_CHARGE_BATCH 32U
+#define MEMCG_CHARGE_BATCH 64U
 
 extern struct mem_cgroup *root_mem_cgroup;
 
-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22  0:20     ` Soheil Hassas Yeganeh
  -1 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp,
	cgroups, linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              14542.5 Mbps (38.7% improvement)
>
> With the patch, the throughput improved by 38.7%
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Nice speed up!

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>                                       unsigned long usage)
>  {
>         unsigned long protected, old_protected;
> -       unsigned long low, min;
>         long delta;
>
>         if (!c->parent)
>                 return;
>
> -       min = READ_ONCE(c->min);
> -       if (min || atomic_long_read(&c->min_usage)) {
> -               protected = min(usage, min);
> +       protected = min(usage, READ_ONCE(c->min));
> +       old_protected = atomic_long_read(&c->min_usage);
> +       if (protected != old_protected) {
>                 old_protected = atomic_long_xchg(&c->min_usage, protected);
>                 delta = protected - old_protected;
>                 if (delta)
>                         atomic_long_add(delta, &c->parent->children_min_usage);
>         }
>
> -       low = READ_ONCE(c->low);
> -       if (low || atomic_long_read(&c->low_usage)) {
> -               protected = min(usage, low);
> +       protected = min(usage, READ_ONCE(c->low));
> +       old_protected = atomic_long_read(&c->low_usage);
> +       if (protected != old_protected) {
>                 old_protected = atomic_long_xchg(&c->low_usage, protected);
>                 delta = protected - old_protected;
>                 if (delta)
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22  0:20     ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:20 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2802 bytes --]

On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              14542.5 Mbps (38.7% improvement)
>
> With the patch, the throughput improved by 38.7%
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Nice speed up!

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>                                       unsigned long usage)
>  {
>         unsigned long protected, old_protected;
> -       unsigned long low, min;
>         long delta;
>
>         if (!c->parent)
>                 return;
>
> -       min = READ_ONCE(c->min);
> -       if (min || atomic_long_read(&c->min_usage)) {
> -               protected = min(usage, min);
> +       protected = min(usage, READ_ONCE(c->min));
> +       old_protected = atomic_long_read(&c->min_usage);
> +       if (protected != old_protected) {
>                 old_protected = atomic_long_xchg(&c->min_usage, protected);
>                 delta = protected - old_protected;
>                 if (delta)
>                         atomic_long_add(delta, &c->parent->children_min_usage);
>         }
>
> -       low = READ_ONCE(c->low);
> -       if (low || atomic_long_read(&c->low_usage)) {
> -               protected = min(usage, low);
> +       protected = min(usage, READ_ONCE(c->low));
> +       old_protected = atomic_long_read(&c->low_usage);
> +       if (protected != old_protected) {
>                 old_protected = atomic_long_xchg(&c->low_usage, protected);
>                 delta = protected - old_protected;
>                 if (delta)
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22  0:20     ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton,
	lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 8:17 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              14542.5 Mbps (38.7% improvement)
>
> With the patch, the throughput improved by 38.7%
>
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Nice speed up!

Acked-by: Soheil Hassas Yeganeh <soheil-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>                                       unsigned long usage)
>  {
>         unsigned long protected, old_protected;
> -       unsigned long low, min;
>         long delta;
>
>         if (!c->parent)
>                 return;
>
> -       min = READ_ONCE(c->min);
> -       if (min || atomic_long_read(&c->min_usage)) {
> -               protected = min(usage, min);
> +       protected = min(usage, READ_ONCE(c->min));
> +       old_protected = atomic_long_read(&c->min_usage);
> +       if (protected != old_protected) {
>                 old_protected = atomic_long_xchg(&c->min_usage, protected);
>                 delta = protected - old_protected;
>                 if (delta)
>                         atomic_long_add(delta, &c->parent->children_min_usage);
>         }
>
> -       low = READ_ONCE(c->low);
> -       if (low || atomic_long_read(&c->low_usage)) {
> -               protected = min(usage, low);
> +       protected = min(usage, READ_ONCE(c->low));
> +       old_protected = atomic_long_read(&c->low_usage);
> +       if (protected != old_protected) {
>                 old_protected = atomic_long_xchg(&c->low_usage, protected);
>                 delta = protected - old_protected;
>                 if (delta)
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22  0:24     ` Soheil Hassas Yeganeh
  -1 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp,
	cgroups, linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              12413.7 Mbps (18.4% improvement)
>
> With the patch, the throughput improved by 18.4%.

Shakeel, for my understanding: is this on top of the gains from the
previous patch?

> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +       char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)       struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
>  struct page_counter {
> +       /*
> +        * Make sure 'usage' does not share cacheline with any other field. The
> +        * memcg->memory.usage is a hot member of struct mem_cgroup.
> +        */
> +       PC_PADDING(_pad1_);
>         atomic_long_t usage;
> -       unsigned long min;
> -       unsigned long low;
> -       unsigned long high;
> -       unsigned long max;
> +       PC_PADDING(_pad2_);
>
>         /* effective memory.min and memory.min usage tracking */
>         unsigned long emin;
> @@ -23,16 +35,16 @@ struct page_counter {
>         atomic_long_t low_usage;
>         atomic_long_t children_low_usage;
>
> -       /* legacy */
>         unsigned long watermark;
>         unsigned long failcnt;
>
> -       /*
> -        * 'parent' is placed here to be far from 'usage' to reduce
> -        * cache false sharing, as 'usage' is written mostly while
> -        * parent is frequently read for cgroup's hierarchical
> -        * counting nature.
> -        */
> +       /* Keep all the read most fields in a separete cacheline. */
> +       PC_PADDING(_pad3_);
> +
> +       unsigned long min;
> +       unsigned long low;
> +       unsigned long high;
> +       unsigned long max;
>         struct page_counter *parent;
>  };
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  0:24     ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:24 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3718 bytes --]

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              12413.7 Mbps (18.4% improvement)
>
> With the patch, the throughput improved by 18.4%.

Shakeel, for my understanding: is this on top of the gains from the
previous patch?

> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +       char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)       struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
>  struct page_counter {
> +       /*
> +        * Make sure 'usage' does not share cacheline with any other field. The
> +        * memcg->memory.usage is a hot member of struct mem_cgroup.
> +        */
> +       PC_PADDING(_pad1_);
>         atomic_long_t usage;
> -       unsigned long min;
> -       unsigned long low;
> -       unsigned long high;
> -       unsigned long max;
> +       PC_PADDING(_pad2_);
>
>         /* effective memory.min and memory.min usage tracking */
>         unsigned long emin;
> @@ -23,16 +35,16 @@ struct page_counter {
>         atomic_long_t low_usage;
>         atomic_long_t children_low_usage;
>
> -       /* legacy */
>         unsigned long watermark;
>         unsigned long failcnt;
>
> -       /*
> -        * 'parent' is placed here to be far from 'usage' to reduce
> -        * cache false sharing, as 'usage' is written mostly while
> -        * parent is frequently read for cgroup's hierarchical
> -        * counting nature.
> -        */
> +       /* Keep all the read most fields in a separete cacheline. */
> +       PC_PADDING(_pad3_);
> +
> +       unsigned long min;
> +       unsigned long low;
> +       unsigned long high;
> +       unsigned long max;
>         struct page_counter *parent;
>  };
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  0:24     ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton,
	lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              12413.7 Mbps (18.4% improvement)
>
> With the patch, the throughput improved by 18.4%.

Shakeel, for my understanding: is this on top of the gains from the
previous patch?

> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
>
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +       char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)       struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
>  struct page_counter {
> +       /*
> +        * Make sure 'usage' does not share cacheline with any other field. The
> +        * memcg->memory.usage is a hot member of struct mem_cgroup.
> +        */
> +       PC_PADDING(_pad1_);
>         atomic_long_t usage;
> -       unsigned long min;
> -       unsigned long low;
> -       unsigned long high;
> -       unsigned long max;
> +       PC_PADDING(_pad2_);
>
>         /* effective memory.min and memory.min usage tracking */
>         unsigned long emin;
> @@ -23,16 +35,16 @@ struct page_counter {
>         atomic_long_t low_usage;
>         atomic_long_t children_low_usage;
>
> -       /* legacy */
>         unsigned long watermark;
>         unsigned long failcnt;
>
> -       /*
> -        * 'parent' is placed here to be far from 'usage' to reduce
> -        * cache false sharing, as 'usage' is written mostly while
> -        * parent is frequently read for cgroup's hierarchical
> -        * counting nature.
> -        */
> +       /* Keep all the read most fields in a separete cacheline. */
> +       PC_PADDING(_pad3_);
> +
> +       unsigned long min;
> +       unsigned long low;
> +       unsigned long high;
> +       unsigned long max;
>         struct page_counter *parent;
>  };
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22  0:24     ` Soheil Hassas Yeganeh
  -1 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp,
	cgroups, linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
>
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
>
> With the patch, the throughput improved by 62.7%.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Nice!

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>
>  extern struct mem_cgroup *root_mem_cgroup;
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22  0:24     ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:24 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
>
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
>
> With the patch, the throughput improved by 62.7%.
>
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Nice!

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>
>  extern struct mem_cgroup *root_mem_cgroup;
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22  0:24     ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22  0:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton,
	lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
>
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
>
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
>
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
>
> With the patch, the throughput improved by 62.7%.
>
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Nice!

Acked-by: Soheil Hassas Yeganeh <soheil-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>
>  extern struct mem_cgroup *root_mem_cgroup;
>
> --
> 2.37.1.595.g718a3a8f04-goog
>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22  2:10     ` Feng Tang
  -1 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutn??,
	Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton,
	lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Looks good to me, with one nit below. 

Reviewed-by: Feng Tang <feng.tang@intel.com>

> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>  
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>  
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)	struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif

There are 2 similar padding definitions in mmzone.h and memcontrol.h:

	struct memcg_padding {
		char x[0];
	} ____cacheline_internodealigned_in_smp;
	#define MEMCG_PADDING(name)      struct memcg_padding name

	struct zone_padding {
		char x[0];
	} ____cacheline_internodealigned_in_smp;
	#define ZONE_PADDING(name)	struct zone_padding name;

Maybe we can generalize them, and lift it into include/cache.h? so
that more places can reuse it in future.

Thanks,
Feng


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  2:10     ` Feng Tang
  0 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:10 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2832 bytes --]

On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Looks good to me, with one nit below. 

Reviewed-by: Feng Tang <feng.tang@intel.com>

> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>  
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>  
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)	struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif

There are 2 similar padding definitions in mmzone.h and memcontrol.h:

	struct memcg_padding {
		char x[0];
	} ____cacheline_internodealigned_in_smp;
	#define MEMCG_PADDING(name)      struct memcg_padding name

	struct zone_padding {
		char x[0];
	} ____cacheline_internodealigned_in_smp;
	#define ZONE_PADDING(name)	struct zone_padding name;

Maybe we can generalize them, and lift it into include/cache.h? so
that more places can reuse it in future.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  2:10     ` Feng Tang
  0 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutn??,
	Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton,
	lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
> 
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Looks good to me, with one nit below. 

Reviewed-by: Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>  
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>  
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)	struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif

There are 2 similar padding definitions in mmzone.h and memcontrol.h:

	struct memcg_padding {
		char x[0];
	} ____cacheline_internodealigned_in_smp;
	#define MEMCG_PADDING(name)      struct memcg_padding name

	struct zone_padding {
		char x[0];
	} ____cacheline_internodealigned_in_smp;
	#define ZONE_PADDING(name)	struct zone_padding name;

Maybe we can generalize them, and lift it into include/cache.h? so
that more places can reuse it in future.

Thanks,
Feng


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22  0:17   ` Shakeel Butt
@ 2022-08-22  2:30     ` Feng Tang
  -1 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutn??,
	Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton,
	lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon, Aug 22, 2022 at 08:17:37AM +0800, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
> 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

This batch number has long been a pain point :) thanks for the work!

Reviewed-by: Feng Tang <feng.tang@intel.com>

- Feng

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>  
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> -- 
> 2.37.1.595.g718a3a8f04-goog
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22  2:30     ` Feng Tang
  0 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:30 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2253 bytes --]

On Mon, Aug 22, 2022 at 08:17:37AM +0800, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
> 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

This batch number has long been a pain point :) thanks for the work!

Reviewed-by: Feng Tang <feng.tang@intel.com>

- Feng

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>  
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> -- 
> 2.37.1.595.g718a3a8f04-goog
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22  2:39     ` Feng Tang
  -1 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutn??,
	Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton,
	lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Reviewed-by: Feng Tang <feng.tang@intel.com>

Thanks!

- Feng

> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_min_usage);
>  	}
>  
> -	low = READ_ONCE(c->low);
> -	if (low || atomic_long_read(&c->low_usage)) {
> -		protected = min(usage, low);
> +	protected = min(usage, READ_ONCE(c->low));
> +	old_protected = atomic_long_read(&c->low_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->low_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
> -- 
> 2.37.1.595.g718a3a8f04-goog
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22  2:39     ` Feng Tang
  0 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:39 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2525 bytes --]

On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Reviewed-by: Feng Tang <feng.tang@intel.com>

Thanks!

- Feng

> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_min_usage);
>  	}
>  
> -	low = READ_ONCE(c->low);
> -	if (low || atomic_long_read(&c->low_usage)) {
> -		protected = min(usage, low);
> +	protected = min(usage, READ_ONCE(c->low));
> +	old_protected = atomic_long_read(&c->low_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->low_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
> -- 
> 2.37.1.595.g718a3a8f04-goog
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22  2:39     ` Feng Tang
  0 siblings, 0 replies; 86+ messages in thread
From: Feng Tang @ 2022-08-22  2:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutn??,
	Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton,
	lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 22, 2022 at 08:17:35AM +0800, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%
> 
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Reviewed-by: Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Thanks!

- Feng

> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_min_usage);
>  	}
>  
> -	low = READ_ONCE(c->low);
> -	if (low || atomic_long_read(&c->low_usage)) {
> -		protected = min(usage, low);
> +	protected = min(usage, READ_ONCE(c->low));
> +	old_protected = atomic_long_read(&c->low_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->low_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
> -- 
> 2.37.1.595.g718a3a8f04-goog
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22  0:24     ` Soheil Hassas Yeganeh
@ 2022-08-22  4:55       ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  4:55 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp,
	Cgroups, linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote:
>
> On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)       10482.7 Mbps
> > With patch              12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
>
> Shakeel, for my understanding: is this on top of the gains from the
> previous patch?
>

No, this is independent of the previous patch. The cover letter has
the numbers for all three optimizations applied together.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  4:55       ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  4:55 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]

On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote:
>
> On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)       10482.7 Mbps
> > With patch              12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
>
> Shakeel, for my understanding: is this on top of the gains from the
> previous patch?
>

No, this is independent of the previous patch. The cover letter has
the numbers for all three optimizations applied together.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22  2:10     ` Feng Tang
  (?)
@ 2022-08-22  4:59       ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  4:59 UTC (permalink / raw)
  To: Feng Tang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutn??,
	Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton,
	lkp, cgroups, linux-mm, netdev, linux-kernel

On Sun, Aug 21, 2022 at 7:12 PM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)     10482.7 Mbps
> > With patch            12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. However for the performance improvement, this additional
> > size is worth it. In addition there are opportunities to reduce the size
> > of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> > and better packing.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
>
> Looks good to me, with one nit below.
>
> Reviewed-by: Feng Tang <feng.tang@intel.com>

Thanks.

>
> > ---
> >  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> >  1 file changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..8ce99bde645f 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,27 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > +     char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name)     struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
>
> There are 2 similar padding definitions in mmzone.h and memcontrol.h:
>
>         struct memcg_padding {
>                 char x[0];
>         } ____cacheline_internodealigned_in_smp;
>         #define MEMCG_PADDING(name)      struct memcg_padding name
>
>         struct zone_padding {
>                 char x[0];
>         } ____cacheline_internodealigned_in_smp;
>         #define ZONE_PADDING(name)      struct zone_padding name;
>
> Maybe we can generalize them, and lift it into include/cache.h? so
> that more places can reuse it in future.
>

This makes sense but let me do that in a separate patch.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  4:59       ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  4:59 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3194 bytes --]

On Sun, Aug 21, 2022 at 7:12 PM Feng Tang <feng.tang@intel.com> wrote:
>
> On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)     10482.7 Mbps
> > With patch            12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. However for the performance improvement, this additional
> > size is worth it. In addition there are opportunities to reduce the size
> > of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> > and better packing.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
>
> Looks good to me, with one nit below.
>
> Reviewed-by: Feng Tang <feng.tang@intel.com>

Thanks.

>
> > ---
> >  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> >  1 file changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..8ce99bde645f 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,27 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > +     char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name)     struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
>
> There are 2 similar padding definitions in mmzone.h and memcontrol.h:
>
>         struct memcg_padding {
>                 char x[0];
>         } ____cacheline_internodealigned_in_smp;
>         #define MEMCG_PADDING(name)      struct memcg_padding name
>
>         struct zone_padding {
>                 char x[0];
>         } ____cacheline_internodealigned_in_smp;
>         #define ZONE_PADDING(name)      struct zone_padding name;
>
> Maybe we can generalize them, and lift it into include/cache.h? so
> that more places can reuse it in future.
>

This makes sense but let me do that in a separate patch.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22  4:59       ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22  4:59 UTC (permalink / raw)
  To: Feng Tang
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutn??,
	Eric Dumazet, Soheil Hassas Yeganeh, Sang, Oliver, Andrew Morton,
	lkp-hn68Rpc1hR1g9hUCZPvPmw, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sun, Aug 21, 2022 at 7:12 PM Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>
> On Mon, Aug 22, 2022 at 08:17:36AM +0800, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)     10482.7 Mbps
> > With patch            12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. However for the performance improvement, this additional
> > size is worth it. In addition there are opportunities to reduce the size
> > of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> > and better packing.
> >
> > Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>
> Looks good to me, with one nit below.
>
> Reviewed-by: Feng Tang <feng.tang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Thanks.

>
> > ---
> >  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> >  1 file changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..8ce99bde645f 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,27 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > +     char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name)     struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
>
> There are 2 similar padding definitions in mmzone.h and memcontrol.h:
>
>         struct memcg_padding {
>                 char x[0];
>         } ____cacheline_internodealigned_in_smp;
>         #define MEMCG_PADDING(name)      struct memcg_padding name
>
>         struct zone_padding {
>                 char x[0];
>         } ____cacheline_internodealigned_in_smp;
>         #define ZONE_PADDING(name)      struct zone_padding name;
>
> Maybe we can generalize them, and lift it into include/cache.h? so
> that more places can reuse it in future.
>

This makes sense but let me do that in a separate patch.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22  0:17   ` Shakeel Butt
@ 2022-08-22  9:55     ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22  9:55 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.

This doesn't really explain why.

> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

I have hard time to really grasp what is the actual setup and why it
matters and why the patch makes any difference. Please elaborate some
more here.

>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {

I have to cache that code back into brain. It is really subtle thing and
it is not really obvious why this is still correct. I will think about
that some more but the changelog could help with that a lot.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22  9:55     ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22  9:55 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2245 bytes --]

On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.

This doesn't really explain why.

> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

I have hard time to really grasp what is the actual setup and why it
matters and why the patch makes any difference. Please elaborate some
more here.

>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {

I have to cache that code back into brain. It is really subtle thing and
it is not really obvious why this is still correct. I will think about
that some more but the changelog could help with that a lot.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22  9:55     ` Michal Hocko
  (?)
@ 2022-08-22 10:18       ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
[...]
> > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > index eb156ff5d603..47711aa28161 100644
> > --- a/mm/page_counter.c
> > +++ b/mm/page_counter.c
> > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> >  				      unsigned long usage)
> >  {
> >  	unsigned long protected, old_protected;
> > -	unsigned long low, min;
> >  	long delta;
> >  
> >  	if (!c->parent)
> >  		return;
> >  
> > -	min = READ_ONCE(c->min);
> > -	if (min || atomic_long_read(&c->min_usage)) {
> > -		protected = min(usage, min);
> > +	protected = min(usage, READ_ONCE(c->min));
> > +	old_protected = atomic_long_read(&c->min_usage);
> > +	if (protected != old_protected) {
> 
> I have to cache that code back into brain. It is really subtle thing and
> it is not really obvious why this is still correct. I will think about
> that some more but the changelog could help with that a lot.

OK, so the this patch will be most useful when the min > 0 && min <
usage because then the protection doesn't really change since the last
call. In other words when the usage grows above the protection and your
workload benefits from this change because that happens a lot as only a
part of the workload is protected. Correct?

Unless I have missed anything this shouldn't break the correctness but I
still have to think about the proportional distribution of the
protection because that adds to the complexity here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 10:18       ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:18 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1587 bytes --]

On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
[...]
> > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > index eb156ff5d603..47711aa28161 100644
> > --- a/mm/page_counter.c
> > +++ b/mm/page_counter.c
> > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> >  				      unsigned long usage)
> >  {
> >  	unsigned long protected, old_protected;
> > -	unsigned long low, min;
> >  	long delta;
> >  
> >  	if (!c->parent)
> >  		return;
> >  
> > -	min = READ_ONCE(c->min);
> > -	if (min || atomic_long_read(&c->min_usage)) {
> > -		protected = min(usage, min);
> > +	protected = min(usage, READ_ONCE(c->min));
> > +	old_protected = atomic_long_read(&c->min_usage);
> > +	if (protected != old_protected) {
> 
> I have to cache that code back into brain. It is really subtle thing and
> it is not really obvious why this is still correct. I will think about
> that some more but the changelog could help with that a lot.

OK, so the this patch will be most useful when the min > 0 && min <
usage because then the protection doesn't really change since the last
call. In other words when the usage grows above the protection and your
workload benefits from this change because that happens a lot as only a
part of the workload is protected. Correct?

Unless I have missed anything this shouldn't break the correctness but I
still have to think about the proportional distribution of the
protection because that adds to the complexity here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 10:18       ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
[...]
> > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > index eb156ff5d603..47711aa28161 100644
> > --- a/mm/page_counter.c
> > +++ b/mm/page_counter.c
> > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> >  				      unsigned long usage)
> >  {
> >  	unsigned long protected, old_protected;
> > -	unsigned long low, min;
> >  	long delta;
> >  
> >  	if (!c->parent)
> >  		return;
> >  
> > -	min = READ_ONCE(c->min);
> > -	if (min || atomic_long_read(&c->min_usage)) {
> > -		protected = min(usage, min);
> > +	protected = min(usage, READ_ONCE(c->min));
> > +	old_protected = atomic_long_read(&c->min_usage);
> > +	if (protected != old_protected) {
> 
> I have to cache that code back into brain. It is really subtle thing and
> it is not really obvious why this is still correct. I will think about
> that some more but the changelog could help with that a lot.

OK, so the this patch will be most useful when the min > 0 && min <
usage because then the protection doesn't really change since the last
call. In other words when the usage grows above the protection and your
workload benefits from this change because that happens a lot as only a
part of the workload is protected. Correct?

Unless I have missed anything this shouldn't break the correctness but I
still have to think about the proportional distribution of the
protection because that adds to the complexity here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22 10:23     ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon 22-08-22 00:17:36, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

Again the workload description is not particularly useful. I guess the
only important aspect is the netserver part below and the number of CPUs
because min and low setup doesn't have much to do with this, right? At
least that is my reading of the memory.high mentioned above.

>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>  
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>  
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)	struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
>  struct page_counter {
> +	/*
> +	 * Make sure 'usage' does not share cacheline with any other field. The
> +	 * memcg->memory.usage is a hot member of struct mem_cgroup.
> +	 */
> +	PC_PADDING(_pad1_);

Why don't you simply require alignment for the structure?

Other than that, looks good to me and it makes sense.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 10:23     ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:23 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2883 bytes --]

On Mon 22-08-22 00:17:36, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

Again the workload description is not particularly useful. I guess the
only important aspect is the netserver part below and the number of CPUs
because min and low setup doesn't have much to do with this, right? At
least that is my reading of the memory.high mentioned above.

>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>  
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>  
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)	struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
>  struct page_counter {
> +	/*
> +	 * Make sure 'usage' does not share cacheline with any other field. The
> +	 * memcg->memory.usage is a hot member of struct mem_cgroup.
> +	 */
> +	PC_PADDING(_pad1_);

Why don't you simply require alignment for the structure?

Other than that, looks good to me and it makes sense.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 10:23     ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 22-08-22 00:17:36, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

Again the workload description is not particularly useful. I guess the
only important aspect is the netserver part below and the number of CPUs
because min and low setup doesn't have much to do with this, right? At
least that is my reading of the memory.high mentioned above.

>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. However for the performance improvement, this additional
> size is worth it. In addition there are opportunities to reduce the size
> of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> and better packing.
> 
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..8ce99bde645f 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,27 @@
>  #define _LINUX_PAGE_COUNTER_H
>  
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>  
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)	struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
>  struct page_counter {
> +	/*
> +	 * Make sure 'usage' does not share cacheline with any other field. The
> +	 * memcg->memory.usage is a hot member of struct mem_cgroup.
> +	 */
> +	PC_PADDING(_pad1_);

Why don't you simply require alignment for the structure?

Other than that, looks good to me and it makes sense.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22 10:47     ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon 22-08-22 00:17:37, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.

Yes, the batch size has always been an arbitrary number. I do not think
there have ever been any solid grounds for the value we have now except
we need something and SWAP_CLUSTER_MAX was a good enough template.

Increasing it to 64 sounds like a reasonable step. It would be great to
have it scale based on the number of CPUs and potentially other factors
but that would be hard to get right and actually hard to evaluate
because it will depend on the specific workload.
 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.

It will have an effect on other stuff as well like high limit reclaim
backoff and stast flushing.
 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

a similar feedback to the test case description as with other patches.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Anyway
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>  
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> -- 
> 2.37.1.595.g718a3a8f04-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 10:47     ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:47 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2853 bytes --]

On Mon 22-08-22 00:17:37, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.

Yes, the batch size has always been an arbitrary number. I do not think
there have ever been any solid grounds for the value we have now except
we need something and SWAP_CLUSTER_MAX was a good enough template.

Increasing it to 64 sounds like a reasonable step. It would be great to
have it scale based on the number of CPUs and potentially other factors
but that would be hard to get right and actually hard to evaluate
because it will depend on the specific workload.
 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.

It will have an effect on other stuff as well like high limit reclaim
backoff and stast flushing.
 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

a similar feedback to the test case description as with other patches.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>

Anyway
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>  
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> -- 
> 2.37.1.595.g718a3a8f04-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 10:47     ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 10:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 22-08-22 00:17:37, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.

Yes, the batch size has always been an arbitrary number. I do not think
there have ever been any solid grounds for the value we have now except
we need something and SWAP_CLUSTER_MAX was a good enough template.

Increasing it to 64 sounds like a reasonable step. It would be great to
have it scale based on the number of CPUs and potentially other factors
but that would be hard to get right and actually hard to evaluate
because it will depend on the specific workload.
 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.

It will have an effect on other stuff as well like high limit reclaim
backoff and stast flushing.
 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.

a similar feedback to the test case description as with other patches.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.
> 
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Anyway
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!

> ---
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>  
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> -- 
> 2.37.1.595.g718a3a8f04-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22  4:55       ` Shakeel Butt
@ 2022-08-22 13:06         ` Soheil Hassas Yeganeh
  -1 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22 13:06 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Michal Koutný,
	Eric Dumazet, Feng Tang, Oliver Sang, Andrew Morton, lkp,
	Cgroups, linux-mm, netdev, linux-kernel

On Mon, Aug 22, 2022 at 12:55 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote:
> >
> > On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > > the workloads doing memcg charging on multiple CPUs concurrently.
> > > Particularly the network intensive workloads. In addition, there is a
> > > false cache sharing between memory.usage and memory.high on the charge
> > > path. This patch moves the usage into a separate cacheline and move all
> > > the read most fields into separate cacheline.
> > >
> > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > ran the following workload in a three level of cgroup hierarchy with top
> > > level having min and low setup appropriately. More specifically
> > > memory.min equal to size of netperf binary and memory.low double of
> > > that.
> > >
> > >  $ netserver -6
> > >  # 36 instances of netperf with following params
> > >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> > >
> > > Results (average throughput of netperf):
> > > Without (6.0-rc1)       10482.7 Mbps
> > > With patch              12413.7 Mbps (18.4% improvement)
> > >
> > > With the patch, the throughput improved by 18.4%.
> >
> > Shakeel, for my understanding: is this on top of the gains from the
> > previous patch?
> >
>
> No, this is independent of the previous patch. The cover letter has
> the numbers for all three optimizations applied together.

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 13:06         ` Soheil Hassas Yeganeh
  0 siblings, 0 replies; 86+ messages in thread
From: Soheil Hassas Yeganeh @ 2022-08-22 13:06 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1677 bytes --]

On Mon, Aug 22, 2022 at 12:55 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Sun, Aug 21, 2022 at 5:24 PM Soheil Hassas Yeganeh <soheil@google.com> wrote:
> >
> > On Sun, Aug 21, 2022 at 8:18 PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > > the workloads doing memcg charging on multiple CPUs concurrently.
> > > Particularly the network intensive workloads. In addition, there is a
> > > false cache sharing between memory.usage and memory.high on the charge
> > > path. This patch moves the usage into a separate cacheline and move all
> > > the read most fields into separate cacheline.
> > >
> > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > ran the following workload in a three level of cgroup hierarchy with top
> > > level having min and low setup appropriately. More specifically
> > > memory.min equal to size of netperf binary and memory.low double of
> > > that.
> > >
> > >  $ netserver -6
> > >  # 36 instances of netperf with following params
> > >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> > >
> > > Results (average throughput of netperf):
> > > Without (6.0-rc1)       10482.7 Mbps
> > > With patch              12413.7 Mbps (18.4% improvement)
> > >
> > > With the patch, the throughput improved by 18.4%.
> >
> > Shakeel, for my understanding: is this on top of the gains from the
> > previous patch?
> >
>
> No, this is independent of the previous patch. The cover letter has
> the numbers for all three optimizations applied together.

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22 10:18       ` Michal Hocko
@ 2022-08-22 14:55         ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 14:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> [...]
> > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > index eb156ff5d603..47711aa28161 100644
> > > --- a/mm/page_counter.c
> > > +++ b/mm/page_counter.c
> > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > >                                   unsigned long usage)
> > >  {
> > >     unsigned long protected, old_protected;
> > > -   unsigned long low, min;
> > >     long delta;
> > >
> > >     if (!c->parent)
> > >             return;
> > >
> > > -   min = READ_ONCE(c->min);
> > > -   if (min || atomic_long_read(&c->min_usage)) {
> > > -           protected = min(usage, min);
> > > +   protected = min(usage, READ_ONCE(c->min));
> > > +   old_protected = atomic_long_read(&c->min_usage);
> > > +   if (protected != old_protected) {
> >
> > I have to cache that code back into brain. It is really subtle thing and
> > it is not really obvious why this is still correct. I will think about
> > that some more but the changelog could help with that a lot.
>
> OK, so the this patch will be most useful when the min > 0 && min <
> usage because then the protection doesn't really change since the last
> call. In other words when the usage grows above the protection and your
> workload benefits from this change because that happens a lot as only a
> part of the workload is protected. Correct?

Yes, that is correct. I hope the experiment setup is clear now.

>
> Unless I have missed anything this shouldn't break the correctness but I
> still have to think about the proportional distribution of the
> protection because that adds to the complexity here.

The patch is not changing any semantics. It is just removing an
unnecessary atomic xchg() for a specific scenario (min > 0 && min <
usage). I don't think there will be any change related to proportional
distribution of the protection.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 14:55         ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 14:55 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2067 bytes --]

On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> [...]
> > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > index eb156ff5d603..47711aa28161 100644
> > > --- a/mm/page_counter.c
> > > +++ b/mm/page_counter.c
> > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > >                                   unsigned long usage)
> > >  {
> > >     unsigned long protected, old_protected;
> > > -   unsigned long low, min;
> > >     long delta;
> > >
> > >     if (!c->parent)
> > >             return;
> > >
> > > -   min = READ_ONCE(c->min);
> > > -   if (min || atomic_long_read(&c->min_usage)) {
> > > -           protected = min(usage, min);
> > > +   protected = min(usage, READ_ONCE(c->min));
> > > +   old_protected = atomic_long_read(&c->min_usage);
> > > +   if (protected != old_protected) {
> >
> > I have to cache that code back into brain. It is really subtle thing and
> > it is not really obvious why this is still correct. I will think about
> > that some more but the changelog could help with that a lot.
>
> OK, so the this patch will be most useful when the min > 0 && min <
> usage because then the protection doesn't really change since the last
> call. In other words when the usage grows above the protection and your
> workload benefits from this change because that happens a lot as only a
> part of the workload is protected. Correct?

Yes, that is correct. I hope the experiment setup is clear now.

>
> Unless I have missed anything this shouldn't break the correctness but I
> still have to think about the proportional distribution of the
> protection because that adds to the complexity here.

The patch is not changing any semantics. It is just removing an
unnecessary atomic xchg() for a specific scenario (min > 0 && min <
usage). I don't think there will be any change related to proportional
distribution of the protection.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22 10:23     ` Michal Hocko
@ 2022-08-22 15:06       ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 15:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon, Aug 22, 2022 at 3:23 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 00:17:36, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
>
> Again the workload description is not particularly useful. I guess the
> only important aspect is the netserver part below and the number of CPUs
> because min and low setup doesn't have much to do with this, right? At
> least that is my reading of the memory.high mentioned above.
>

The experiment numbers below are for only this patch independently
i.e. the unnecessary min/low atomic xchg() is still happening for both
setups. I could run the experiment without setting min and low but I
wanted to keep the setup exactly the same for all three optimizations.

This patch and the following perf numbers shows only the impact of
removing false sharing in struct page_counter for memcg->memory on the
charging code path.

> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)     10482.7 Mbps
> > With patch            12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. However for the performance improvement, this additional
> > size is worth it. In addition there are opportunities to reduce the size
> > of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> > and better packing.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > ---
> >  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> >  1 file changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..8ce99bde645f 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,27 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > +     char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name)     struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
> > +
> >  struct page_counter {
> > +     /*
> > +      * Make sure 'usage' does not share cacheline with any other field. The
> > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > +      */
> > +     PC_PADDING(_pad1_);
>
> Why don't you simply require alignment for the structure?

I don't just want the alignment of the structure. I want different
fields of this structure to not share the cache line. More
specifically the 'high' and 'usage' fields. With this change the usage
will be its own cache line, the read-most fields will be on separate
cache line and the fields which sometimes get updated on charge path
based on some condition will be a different cache line from the
previous two.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 15:06       ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 15:06 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3908 bytes --]

On Mon, Aug 22, 2022 at 3:23 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 00:17:36, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
>
> Again the workload description is not particularly useful. I guess the
> only important aspect is the netserver part below and the number of CPUs
> because min and low setup doesn't have much to do with this, right? At
> least that is my reading of the memory.high mentioned above.
>

The experiment numbers below are for only this patch independently
i.e. the unnecessary min/low atomic xchg() is still happening for both
setups. I could run the experiment without setting min and low but I
wanted to keep the setup exactly the same for all three optimizations.

This patch and the following perf numbers shows only the impact of
removing false sharing in struct page_counter for memcg->memory on the
charging code path.

> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)     10482.7 Mbps
> > With patch            12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. However for the performance improvement, this additional
> > size is worth it. In addition there are opportunities to reduce the size
> > of struct mem_cgroup like deprecation of kmem and tcpmem page counters
> > and better packing.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > ---
> >  include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
> >  1 file changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..8ce99bde645f 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,27 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > +     char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name)     struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
> > +
> >  struct page_counter {
> > +     /*
> > +      * Make sure 'usage' does not share cacheline with any other field. The
> > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > +      */
> > +     PC_PADDING(_pad1_);
>
> Why don't you simply require alignment for the structure?

I don't just want the alignment of the structure. I want different
fields of this structure to not share the cache line. More
specifically the 'high' and 'usage' fields. With this change the usage
will be its own cache line, the read-most fields will be on separate
cache line and the fields which sometimes get updated on charge path
based on some condition will be a different cache line from the
previous two.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22 10:47     ` Michal Hocko
@ 2022-08-22 15:09       ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
[...]
>
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
>
> a similar feedback to the test case description as with other patches.

What more info should I add to the description? Why did I set up min
and low or something else?

> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)       10482.7 Mbps
> > With patch              17064.7 Mbps (62.7% improvement)
> >
> > With the patch, the throughput improved by 62.7%.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
>
> Anyway
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 15:09       ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 15:09 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]

On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
[...]
>
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy with top
> > level having min and low setup appropriately. More specifically
> > memory.min equal to size of netperf binary and memory.low double of
> > that.
>
> a similar feedback to the test case description as with other patches.

What more info should I add to the description? Why did I set up min
and low or something else?

> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)       10482.7 Mbps
> > With patch              17064.7 Mbps (62.7% improvement)
> >
> > With the patch, the throughput improved by 62.7%.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
>
> Anyway
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22 15:06       ` Shakeel Butt
  (?)
@ 2022-08-22 15:15         ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 15:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
[...]
> > >  struct page_counter {
> > > +     /*
> > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > +      */
> > > +     PC_PADDING(_pad1_);
> >
> > Why don't you simply require alignment for the structure?
> 
> I don't just want the alignment of the structure. I want different
> fields of this structure to not share the cache line. More
> specifically the 'high' and 'usage' fields. With this change the usage
> will be its own cache line, the read-most fields will be on separate
> cache line and the fields which sometimes get updated on charge path
> based on some condition will be a different cache line from the
> previous two.

I do not follow. If you make an explicit requirement for the structure
alignement then the first field in the structure will be guarantied to
have that alignement and you achieve the rest to be in the other cache
line by adding padding behind that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 15:15         ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 15:15 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
[...]
> > >  struct page_counter {
> > > +     /*
> > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > +      */
> > > +     PC_PADDING(_pad1_);
> >
> > Why don't you simply require alignment for the structure?
> 
> I don't just want the alignment of the structure. I want different
> fields of this structure to not share the cache line. More
> specifically the 'high' and 'usage' fields. With this change the usage
> will be its own cache line, the read-most fields will be on separate
> cache line and the fields which sometimes get updated on charge path
> based on some condition will be a different cache line from the
> previous two.

I do not follow. If you make an explicit requirement for the structure
alignement then the first field in the structure will be guarantied to
have that alignement and you achieve the rest to be in the other cache
line by adding padding behind that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 15:15         ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 15:15 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM,
	netdev, LKML

On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
[...]
> > >  struct page_counter {
> > > +     /*
> > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > +      */
> > > +     PC_PADDING(_pad1_);
> >
> > Why don't you simply require alignment for the structure?
> 
> I don't just want the alignment of the structure. I want different
> fields of this structure to not share the cache line. More
> specifically the 'high' and 'usage' fields. With this change the usage
> will be its own cache line, the read-most fields will be on separate
> cache line and the fields which sometimes get updated on charge path
> based on some condition will be a different cache line from the
> previous two.

I do not follow. If you make an explicit requirement for the structure
alignement then the first field in the structure will be guarantied to
have that alignement and you achieve the rest to be in the other cache
line by adding padding behind that.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22 14:55         ` Shakeel Butt
@ 2022-08-22 15:20           ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 15:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> > [...]
> > > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > > index eb156ff5d603..47711aa28161 100644
> > > > --- a/mm/page_counter.c
> > > > +++ b/mm/page_counter.c
> > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > >                                   unsigned long usage)
> > > >  {
> > > >     unsigned long protected, old_protected;
> > > > -   unsigned long low, min;
> > > >     long delta;
> > > >
> > > >     if (!c->parent)
> > > >             return;
> > > >
> > > > -   min = READ_ONCE(c->min);
> > > > -   if (min || atomic_long_read(&c->min_usage)) {
> > > > -           protected = min(usage, min);
> > > > +   protected = min(usage, READ_ONCE(c->min));
> > > > +   old_protected = atomic_long_read(&c->min_usage);
> > > > +   if (protected != old_protected) {
> > >
> > > I have to cache that code back into brain. It is really subtle thing and
> > > it is not really obvious why this is still correct. I will think about
> > > that some more but the changelog could help with that a lot.
> >
> > OK, so the this patch will be most useful when the min > 0 && min <
> > usage because then the protection doesn't really change since the last
> > call. In other words when the usage grows above the protection and your
> > workload benefits from this change because that happens a lot as only a
> > part of the workload is protected. Correct?
> 
> Yes, that is correct. I hope the experiment setup is clear now.

Maybe it is just me that it took a bit to grasp but maybe we want to
save our future selfs from going through that mental process again. So
please just be explicit about that in the changelog. It is really the
part that workloads excessing the protection will benefit the most that
would help to understand this patch.

> > Unless I have missed anything this shouldn't break the correctness but I
> > still have to think about the proportional distribution of the
> > protection because that adds to the complexity here.
> 
> The patch is not changing any semantics. It is just removing an
> unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> usage). I don't think there will be any change related to proportional
> distribution of the protection.

Yes, I suspect you are right. I just remembered previous fixes
like 503970e42325 ("mm: memcontrol: fix memory.low proportional
distribution") which just made me nervous that this is a tricky area.

I will have another look tomorrow with a fresh brain and send an ack.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 15:20           ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 15:20 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2832 bytes --]

On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> > [...]
> > > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > > index eb156ff5d603..47711aa28161 100644
> > > > --- a/mm/page_counter.c
> > > > +++ b/mm/page_counter.c
> > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > >                                   unsigned long usage)
> > > >  {
> > > >     unsigned long protected, old_protected;
> > > > -   unsigned long low, min;
> > > >     long delta;
> > > >
> > > >     if (!c->parent)
> > > >             return;
> > > >
> > > > -   min = READ_ONCE(c->min);
> > > > -   if (min || atomic_long_read(&c->min_usage)) {
> > > > -           protected = min(usage, min);
> > > > +   protected = min(usage, READ_ONCE(c->min));
> > > > +   old_protected = atomic_long_read(&c->min_usage);
> > > > +   if (protected != old_protected) {
> > >
> > > I have to cache that code back into brain. It is really subtle thing and
> > > it is not really obvious why this is still correct. I will think about
> > > that some more but the changelog could help with that a lot.
> >
> > OK, so the this patch will be most useful when the min > 0 && min <
> > usage because then the protection doesn't really change since the last
> > call. In other words when the usage grows above the protection and your
> > workload benefits from this change because that happens a lot as only a
> > part of the workload is protected. Correct?
> 
> Yes, that is correct. I hope the experiment setup is clear now.

Maybe it is just me that it took a bit to grasp but maybe we want to
save our future selfs from going through that mental process again. So
please just be explicit about that in the changelog. It is really the
part that workloads excessing the protection will benefit the most that
would help to understand this patch.

> > Unless I have missed anything this shouldn't break the correctness but I
> > still have to think about the proportional distribution of the
> > protection because that adds to the complexity here.
> 
> The patch is not changing any semantics. It is just removing an
> unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> usage). I don't think there will be any change related to proportional
> distribution of the protection.

Yes, I suspect you are right. I just remembered previous fixes
like 503970e42325 ("mm: memcontrol: fix memory.low proportional
distribution") which just made me nervous that this is a tricky area.

I will have another look tomorrow with a fresh brain and send an ack.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22 15:09       ` Shakeel Butt
@ 2022-08-22 15:22         ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 15:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon 22-08-22 08:09:01, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> [...]
> >
> > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > ran the following workload in a three level of cgroup hierarchy with top
> > > level having min and low setup appropriately. More specifically
> > > memory.min equal to size of netperf binary and memory.low double of
> > > that.
> >
> > a similar feedback to the test case description as with other patches.
> 
> What more info should I add to the description? Why did I set up min
> and low or something else?

I do see why you wanted to keep the test consistent over those three
patches. I would just drop the reference to the protection configuration
because it likely doesn't make much of an impact, does it? It is the
multi cpu setup and false sharing that makes the real difference. Or am
I wrong in assuming that?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 15:22         ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 15:22 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 988 bytes --]

On Mon 22-08-22 08:09:01, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> [...]
> >
> > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > ran the following workload in a three level of cgroup hierarchy with top
> > > level having min and low setup appropriately. More specifically
> > > memory.min equal to size of netperf binary and memory.low double of
> > > that.
> >
> > a similar feedback to the test case description as with other patches.
> 
> What more info should I add to the description? Why did I set up min
> and low or something else?

I do see why you wanted to keep the test consistent over those three
patches. I would just drop the reference to the protection configuration
because it likely doesn't make much of an impact, does it? It is the
multi cpu setup and false sharing that makes the real difference. Or am
I wrong in assuming that?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22 15:15         ` Michal Hocko
@ 2022-08-22 16:04           ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
> [...]
> > > >  struct page_counter {
> > > > +     /*
> > > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > > +      */
> > > > +     PC_PADDING(_pad1_);
> > >
> > > Why don't you simply require alignment for the structure?
> >
> > I don't just want the alignment of the structure. I want different
> > fields of this structure to not share the cache line. More
> > specifically the 'high' and 'usage' fields. With this change the usage
> > will be its own cache line, the read-most fields will be on separate
> > cache line and the fields which sometimes get updated on charge path
> > based on some condition will be a different cache line from the
> > previous two.
>
> I do not follow. If you make an explicit requirement for the structure
> alignement then the first field in the structure will be guarantied to
> have that alignement and you achieve the rest to be in the other cache
> line by adding padding behind that.

Oh, you were talking explicitly about _pad1_, yes, we can remove it
and make the struct cache align. I will do it in the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 16:04           ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:04 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]

On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
> [...]
> > > >  struct page_counter {
> > > > +     /*
> > > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > > +      */
> > > > +     PC_PADDING(_pad1_);
> > >
> > > Why don't you simply require alignment for the structure?
> >
> > I don't just want the alignment of the structure. I want different
> > fields of this structure to not share the cache line. More
> > specifically the 'high' and 'usage' fields. With this change the usage
> > will be its own cache line, the read-most fields will be on separate
> > cache line and the fields which sometimes get updated on charge path
> > based on some condition will be a different cache line from the
> > previous two.
>
> I do not follow. If you make an explicit requirement for the structure
> alignement then the first field in the structure will be guarantied to
> have that alignement and you achieve the rest to be in the other cache
> line by adding padding behind that.

Oh, you were talking explicitly about _pad1_, yes, we can remove it
and make the struct cache align. I will do it in the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22 15:20           ` Michal Hocko
  (?)
@ 2022-08-22 16:06             ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> > > [...]
> > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > > > index eb156ff5d603..47711aa28161 100644
> > > > > --- a/mm/page_counter.c
> > > > > +++ b/mm/page_counter.c
> > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > > >                                   unsigned long usage)
> > > > >  {
> > > > >     unsigned long protected, old_protected;
> > > > > -   unsigned long low, min;
> > > > >     long delta;
> > > > >
> > > > >     if (!c->parent)
> > > > >             return;
> > > > >
> > > > > -   min = READ_ONCE(c->min);
> > > > > -   if (min || atomic_long_read(&c->min_usage)) {
> > > > > -           protected = min(usage, min);
> > > > > +   protected = min(usage, READ_ONCE(c->min));
> > > > > +   old_protected = atomic_long_read(&c->min_usage);
> > > > > +   if (protected != old_protected) {
> > > >
> > > > I have to cache that code back into brain. It is really subtle thing and
> > > > it is not really obvious why this is still correct. I will think about
> > > > that some more but the changelog could help with that a lot.
> > >
> > > OK, so the this patch will be most useful when the min > 0 && min <
> > > usage because then the protection doesn't really change since the last
> > > call. In other words when the usage grows above the protection and your
> > > workload benefits from this change because that happens a lot as only a
> > > part of the workload is protected. Correct?
> >
> > Yes, that is correct. I hope the experiment setup is clear now.
>
> Maybe it is just me that it took a bit to grasp but maybe we want to
> save our future selfs from going through that mental process again. So
> please just be explicit about that in the changelog. It is really the
> part that workloads excessing the protection will benefit the most that
> would help to understand this patch.
>

I will add more detail in the commit message in the next version.

> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> >
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
>
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
>
> I will have another look tomorrow with a fresh brain and send an ack.

I will wait for your ack before sending the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 16:06             ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:06 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3120 bytes --]

On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> > > [...]
> > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > > > index eb156ff5d603..47711aa28161 100644
> > > > > --- a/mm/page_counter.c
> > > > > +++ b/mm/page_counter.c
> > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > > >                                   unsigned long usage)
> > > > >  {
> > > > >     unsigned long protected, old_protected;
> > > > > -   unsigned long low, min;
> > > > >     long delta;
> > > > >
> > > > >     if (!c->parent)
> > > > >             return;
> > > > >
> > > > > -   min = READ_ONCE(c->min);
> > > > > -   if (min || atomic_long_read(&c->min_usage)) {
> > > > > -           protected = min(usage, min);
> > > > > +   protected = min(usage, READ_ONCE(c->min));
> > > > > +   old_protected = atomic_long_read(&c->min_usage);
> > > > > +   if (protected != old_protected) {
> > > >
> > > > I have to cache that code back into brain. It is really subtle thing and
> > > > it is not really obvious why this is still correct. I will think about
> > > > that some more but the changelog could help with that a lot.
> > >
> > > OK, so the this patch will be most useful when the min > 0 && min <
> > > usage because then the protection doesn't really change since the last
> > > call. In other words when the usage grows above the protection and your
> > > workload benefits from this change because that happens a lot as only a
> > > part of the workload is protected. Correct?
> >
> > Yes, that is correct. I hope the experiment setup is clear now.
>
> Maybe it is just me that it took a bit to grasp but maybe we want to
> save our future selfs from going through that mental process again. So
> please just be explicit about that in the changelog. It is really the
> part that workloads excessing the protection will benefit the most that
> would help to understand this patch.
>

I will add more detail in the commit message in the next version.

> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> >
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
>
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
>
> I will have another look tomorrow with a fresh brain and send an ack.

I will wait for your ack before sending the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 16:06             ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM,
	netdev, LKML

On Mon, Aug 22, 2022 at 8:20 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> > >
> > > On Mon 22-08-22 11:55:33, Michal Hocko wrote:
> > > > On Mon 22-08-22 00:17:35, Shakeel Butt wrote:
> > > [...]
> > > > > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > > > > index eb156ff5d603..47711aa28161 100644
> > > > > --- a/mm/page_counter.c
> > > > > +++ b/mm/page_counter.c
> > > > > @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
> > > > >                                   unsigned long usage)
> > > > >  {
> > > > >     unsigned long protected, old_protected;
> > > > > -   unsigned long low, min;
> > > > >     long delta;
> > > > >
> > > > >     if (!c->parent)
> > > > >             return;
> > > > >
> > > > > -   min = READ_ONCE(c->min);
> > > > > -   if (min || atomic_long_read(&c->min_usage)) {
> > > > > -           protected = min(usage, min);
> > > > > +   protected = min(usage, READ_ONCE(c->min));
> > > > > +   old_protected = atomic_long_read(&c->min_usage);
> > > > > +   if (protected != old_protected) {
> > > >
> > > > I have to cache that code back into brain. It is really subtle thing and
> > > > it is not really obvious why this is still correct. I will think about
> > > > that some more but the changelog could help with that a lot.
> > >
> > > OK, so the this patch will be most useful when the min > 0 && min <
> > > usage because then the protection doesn't really change since the last
> > > call. In other words when the usage grows above the protection and your
> > > workload benefits from this change because that happens a lot as only a
> > > part of the workload is protected. Correct?
> >
> > Yes, that is correct. I hope the experiment setup is clear now.
>
> Maybe it is just me that it took a bit to grasp but maybe we want to
> save our future selfs from going through that mental process again. So
> please just be explicit about that in the changelog. It is really the
> part that workloads excessing the protection will benefit the most that
> would help to understand this patch.
>

I will add more detail in the commit message in the next version.

> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> >
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
>
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
>
> I will have another look tomorrow with a fresh brain and send an ack.

I will wait for your ack before sending the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22 15:22         ` Michal Hocko
  (?)
@ 2022-08-22 16:07           ` Shakeel Butt
  -1 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 08:09:01, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > [...]
> > >
> > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > > ran the following workload in a three level of cgroup hierarchy with top
> > > > level having min and low setup appropriately. More specifically
> > > > memory.min equal to size of netperf binary and memory.low double of
> > > > that.
> > >
> > > a similar feedback to the test case description as with other patches.
> >
> > What more info should I add to the description? Why did I set up min
> > and low or something else?
>
> I do see why you wanted to keep the test consistent over those three
> patches. I would just drop the reference to the protection configuration
> because it likely doesn't make much of an impact, does it? It is the
> multi cpu setup and false sharing that makes the real difference. Or am
> I wrong in assuming that?
>

No, you are correct. I will cleanup the commit message in the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 16:07           ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:07 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1152 bytes --]

On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 22-08-22 08:09:01, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > [...]
> > >
> > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > > ran the following workload in a three level of cgroup hierarchy with top
> > > > level having min and low setup appropriately. More specifically
> > > > memory.min equal to size of netperf binary and memory.low double of
> > > > that.
> > >
> > > a similar feedback to the test case description as with other patches.
> >
> > What more info should I add to the description? Why did I set up min
> > and low or something else?
>
> I do see why you wanted to keep the test consistent over those three
> patches. I would just drop the reference to the protection configuration
> because it likely doesn't make much of an impact, does it? It is the
> multi cpu setup and false sharing that makes the real difference. Or am
> I wrong in assuming that?
>

No, you are correct. I will cleanup the commit message in the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 16:07           ` Shakeel Butt
  0 siblings, 0 replies; 86+ messages in thread
From: Shakeel Butt @ 2022-08-22 16:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM,
	netdev, LKML

On Mon, Aug 22, 2022 at 8:22 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> On Mon 22-08-22 08:09:01, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:47 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> > >
> > [...]
> > >
> > > > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > > > ran the following workload in a three level of cgroup hierarchy with top
> > > > level having min and low setup appropriately. More specifically
> > > > memory.min equal to size of netperf binary and memory.low double of
> > > > that.
> > >
> > > a similar feedback to the test case description as with other patches.
> >
> > What more info should I add to the description? Why did I set up min
> > and low or something else?
>
> I do see why you wanted to keep the test consistent over those three
> patches. I would just drop the reference to the protection configuration
> because it likely doesn't make much of an impact, does it? It is the
> multi cpu setup and false sharing that makes the real difference. Or am
> I wrong in assuming that?
>

No, you are correct. I will cleanup the commit message in the next version.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22 18:23     ` Roman Gushchin
  -1 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%

Nice savings!

> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_min_usage);

What if there is a concurrent update of c->min_usage? Then the patched version
can miss an update. I can't imagine a case when it will lead to bad consequences,
so probably it's ok. But not super obvious.
I think the way to think of it is that a missed update will be fixed by the next
one, so it's ok to run some time with old numbers.

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 18:23     ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:23 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2448 bytes --]

On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%

Nice savings!

> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_min_usage);

What if there is a concurrent update of c->min_usage? Then the patched version
can miss an update. I can't imagine a case when it will lead to bad consequences,
so probably it's ok. But not super obvious.
I think the way to think of it is that a missed update will be fixed by the next
one, so it's ok to run some time with old numbers.

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-22 18:23     ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 22, 2022 at 12:17:35AM +0000, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. It only needs to do that operation if the new value of
> protection is different from older one. This patch does that.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%

Nice savings!

> 
> Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Reported-by: kernel test robot <oliver.sang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_min_usage);

What if there is a concurrent update of c->min_usage? Then the patched version
can miss an update. I can't imagine a case when it will lead to bad consequences,
so probably it's ok. But not super obvious.
I think the way to think of it is that a missed update will be fixed by the next
one, so it's ok to run some time with old numbers.

Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-22 16:04           ` Shakeel Butt
  (?)
@ 2022-08-22 18:27             ` Roman Gushchin
  -1 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon, Aug 22, 2022 at 09:04:59AM -0700, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
> > [...]
> > > > >  struct page_counter {
> > > > > +     /*
> > > > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > > > +      */
> > > > > +     PC_PADDING(_pad1_);
> > > >
> > > > Why don't you simply require alignment for the structure?
> > >
> > > I don't just want the alignment of the structure. I want different
> > > fields of this structure to not share the cache line. More
> > > specifically the 'high' and 'usage' fields. With this change the usage
> > > will be its own cache line, the read-most fields will be on separate
> > > cache line and the fields which sometimes get updated on charge path
> > > based on some condition will be a different cache line from the
> > > previous two.
> >
> > I do not follow. If you make an explicit requirement for the structure
> > alignement then the first field in the structure will be guarantied to
> > have that alignement and you achieve the rest to be in the other cache
> > line by adding padding behind that.
> 
> Oh, you were talking explicitly about _pad1_, yes, we can remove it
> and make the struct cache align. I will do it in the next version.

Yes, please, it caught my eyes too.
With this change:
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Also, can you, please, include the numbers on the additional memory overhead?
I think it still worth it, just think we need to include them for a record.

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 18:27             ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:27 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1724 bytes --]

On Mon, Aug 22, 2022 at 09:04:59AM -0700, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
> > [...]
> > > > >  struct page_counter {
> > > > > +     /*
> > > > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > > > +      */
> > > > > +     PC_PADDING(_pad1_);
> > > >
> > > > Why don't you simply require alignment for the structure?
> > >
> > > I don't just want the alignment of the structure. I want different
> > > fields of this structure to not share the cache line. More
> > > specifically the 'high' and 'usage' fields. With this change the usage
> > > will be its own cache line, the read-most fields will be on separate
> > > cache line and the fields which sometimes get updated on charge path
> > > based on some condition will be a different cache line from the
> > > previous two.
> >
> > I do not follow. If you make an explicit requirement for the structure
> > alignement then the first field in the structure will be guarantied to
> > have that alignement and you achieve the rest to be in the other cache
> > line by adding padding behind that.
> 
> Oh, you were talking explicitly about _pad1_, yes, we can remove it
> and make the struct cache align. I will do it in the next version.

Yes, please, it caught my eyes too.
With this change:
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Also, can you, please, include the numbers on the additional memory overhead?
I think it still worth it, just think we need to include them for a record.

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields
@ 2022-08-22 18:27             ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM,
	netdev, LKML

On Mon, Aug 22, 2022 at 09:04:59AM -0700, Shakeel Butt wrote:
> On Mon, Aug 22, 2022 at 8:15 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >
> > On Mon 22-08-22 08:06:14, Shakeel Butt wrote:
> > [...]
> > > > >  struct page_counter {
> > > > > +     /*
> > > > > +      * Make sure 'usage' does not share cacheline with any other field. The
> > > > > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > > > > +      */
> > > > > +     PC_PADDING(_pad1_);
> > > >
> > > > Why don't you simply require alignment for the structure?
> > >
> > > I don't just want the alignment of the structure. I want different
> > > fields of this structure to not share the cache line. More
> > > specifically the 'high' and 'usage' fields. With this change the usage
> > > will be its own cache line, the read-most fields will be on separate
> > > cache line and the fields which sometimes get updated on charge path
> > > based on some condition will be a different cache line from the
> > > previous two.
> >
> > I do not follow. If you make an explicit requirement for the structure
> > alignement then the first field in the structure will be guarantied to
> > have that alignement and you achieve the rest to be in the other cache
> > line by adding padding behind that.
> 
> Oh, you were talking explicitly about _pad1_, yes, we can remove it
> and make the struct cache align. I will do it in the next version.

Yes, please, it caught my eyes too.
With this change:
Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>

Also, can you, please, include the numbers on the additional memory overhead?
I think it still worth it, just think we need to include them for a record.

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22  0:17   ` Shakeel Butt
  (?)
@ 2022-08-22 18:37     ` Roman Gushchin
  -1 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
> 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.

This is pretty significant!

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

I wonder only if we want to make it configurable (Idk a sysctl or maybe
a config option) and close the topic.

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 18:37     ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:37 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1409 bytes --]

On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
> 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.

This is pretty significant!

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

I wonder only if we want to make it configurable (Idk a sysctl or maybe
a config option) and close the topic.

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 18:37     ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-22 18:37 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Michal Hocko, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 22, 2022 at 12:17:37AM +0000, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
> 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately. More specifically
> memory.min equal to size of netperf binary and memory.low double of
> that.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.

This is pretty significant!

Acked-by: Roman Gushchin <roman.gushchin-fxUVXftIFDnyG1zEObXtfA@public.gmane.org>

I wonder only if we want to make it configurable (Idk a sysctl or maybe
a config option) and close the topic.

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22 18:37     ` Roman Gushchin
  (?)
@ 2022-08-22 19:34       ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 19:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
[...]
> I wonder only if we want to make it configurable (Idk a sysctl or maybe
> a config option) and close the topic.

I do not think this is a good idea. We have other examples where we have
outsourced internal tunning to the userspace and it has mostly proven
impractical and long term more problematic than useful (e.g.
lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
name some that come to my mind). I have seen more often these to be used
incorrectly than useful.

In this case, I guess we should consider either moving to per memcg
charge batching and see whether the pcp overhead x memcg_count is worth
that or some automagic tuning of the batch size depending on how
effectively the batch is used. Certainly a lot of room for
experimenting.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 19:34       ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 19:34 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
[...]
> I wonder only if we want to make it configurable (Idk a sysctl or maybe
> a config option) and close the topic.

I do not think this is a good idea. We have other examples where we have
outsourced internal tunning to the userspace and it has mostly proven
impractical and long term more problematic than useful (e.g.
lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
name some that come to my mind). I have seen more often these to be used
incorrectly than useful.

In this case, I guess we should consider either moving to per memcg
charge batching and see whether the pcp overhead x memcg_count is worth
that or some automagic tuning of the batch size depending on how
effectively the batch is used. Certainly a lot of room for
experimenting.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-22 19:34       ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-22 19:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
[...]
> I wonder only if we want to make it configurable (Idk a sysctl or maybe
> a config option) and close the topic.

I do not think this is a good idea. We have other examples where we have
outsourced internal tunning to the userspace and it has mostly proven
impractical and long term more problematic than useful (e.g.
lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
name some that come to my mind). I have seen more often these to be used
incorrectly than useful.

In this case, I guess we should consider either moving to per memcg
charge batching and see whether the pcp overhead x memcg_count is worth
that or some automagic tuning of the batch size depending on how
effectively the batch is used. Certainly a lot of room for
experimenting.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-22 19:34       ` Michal Hocko
  (?)
@ 2022-08-23  2:22         ` Roman Gushchin
  -1 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-23  2:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote:
> On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
> [...]
> > I wonder only if we want to make it configurable (Idk a sysctl or maybe
> > a config option) and close the topic.
> 
> I do not think this is a good idea. We have other examples where we have
> outsourced internal tunning to the userspace and it has mostly proven
> impractical and long term more problematic than useful (e.g.
> lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
> name some that come to my mind). I have seen more often these to be used
> incorrectly than useful.

A agree, not a strong opinion here. But I wonder if somebody will
complain on Shakeel's change because of the reduced accuracy.
I know some users are using memory cgroups to track the size of various
workloads (including relatively small) and 32->64 pages per cpu change
can be noticeable for them. But we can wait for an actual bug report :)

> 
> In this case, I guess we should consider either moving to per memcg
> charge batching and see whether the pcp overhead x memcg_count is worth
> that or some automagic tuning of the batch size depending on how
> effectively the batch is used. Certainly a lot of room for
> experimenting.

I'm not a big believer into the automagic tuning here because it's a fundamental
trade-off of accuracy vs performance and various users might make a different
choice depending on their needs, not on the cpu count or something else.

Per-memcg batching sounds interesting though. For example, we can likely
batch updates on leaf cgroups and have a single atomic update instead of
multiple most of the times. Or do you mean something different?

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-23  2:22         ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-23  2:22 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1750 bytes --]

On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote:
> On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
> [...]
> > I wonder only if we want to make it configurable (Idk a sysctl or maybe
> > a config option) and close the topic.
> 
> I do not think this is a good idea. We have other examples where we have
> outsourced internal tunning to the userspace and it has mostly proven
> impractical and long term more problematic than useful (e.g.
> lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
> name some that come to my mind). I have seen more often these to be used
> incorrectly than useful.

A agree, not a strong opinion here. But I wonder if somebody will
complain on Shakeel's change because of the reduced accuracy.
I know some users are using memory cgroups to track the size of various
workloads (including relatively small) and 32->64 pages per cpu change
can be noticeable for them. But we can wait for an actual bug report :)

> 
> In this case, I guess we should consider either moving to per memcg
> charge batching and see whether the pcp overhead x memcg_count is worth
> that or some automagic tuning of the batch size depending on how
> effectively the batch is used. Certainly a lot of room for
> experimenting.

I'm not a big believer into the automagic tuning here because it's a fundamental
trade-off of accuracy vs performance and various users might make a different
choice depending on their needs, not on the cpu count or something else.

Per-memcg batching sounds interesting though. For example, we can likely
batch updates on leaf cgroups and have a single atomic update instead of
multiple most of the times. Or do you mean something different?

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-23  2:22         ` Roman Gushchin
  0 siblings, 0 replies; 86+ messages in thread
From: Roman Gushchin @ 2022-08-23  2:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote:
> On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
> [...]
> > I wonder only if we want to make it configurable (Idk a sysctl or maybe
> > a config option) and close the topic.
> 
> I do not think this is a good idea. We have other examples where we have
> outsourced internal tunning to the userspace and it has mostly proven
> impractical and long term more problematic than useful (e.g.
> lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
> name some that come to my mind). I have seen more often these to be used
> incorrectly than useful.

A agree, not a strong opinion here. But I wonder if somebody will
complain on Shakeel's change because of the reduced accuracy.
I know some users are using memory cgroups to track the size of various
workloads (including relatively small) and 32->64 pages per cpu change
can be noticeable for them. But we can wait for an actual bug report :)

> 
> In this case, I guess we should consider either moving to per memcg
> charge batching and see whether the pcp overhead x memcg_count is worth
> that or some automagic tuning of the batch size depending on how
> effectively the batch is used. Certainly a lot of room for
> experimenting.

I'm not a big believer into the automagic tuning here because it's a fundamental
trade-off of accuracy vs performance and various users might make a different
choice depending on their needs, not on the cpu count or something else.

Per-memcg batching sounds interesting though. For example, we can likely
batch updates on leaf cgroups and have a single atomic update instead of
multiple most of the times. Or do you mean something different?

Thanks!

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-23  2:22         ` Roman Gushchin
@ 2022-08-23  4:49           ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-23  4:49 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Johannes Weiner, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, cgroups, linux-mm, netdev, linux-kernel

On Mon 22-08-22 19:22:26, Roman Gushchin wrote:
> On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote:
> > On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
> > [...]
> > > I wonder only if we want to make it configurable (Idk a sysctl or maybe
> > > a config option) and close the topic.
> > 
> > I do not think this is a good idea. We have other examples where we have
> > outsourced internal tunning to the userspace and it has mostly proven
> > impractical and long term more problematic than useful (e.g.
> > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
> > name some that come to my mind). I have seen more often these to be used
> > incorrectly than useful.
> 
> A agree, not a strong opinion here. But I wonder if somebody will
> complain on Shakeel's change because of the reduced accuracy.
> I know some users are using memory cgroups to track the size of various
> workloads (including relatively small) and 32->64 pages per cpu change
> can be noticeable for them. But we can wait for an actual bug report :)

Yes, that would be my approach. I have seen reports like that already
but that was mostly because of heavy caching on the SLUB side on older
kernels. So there surely are workloads with small limits configured
(e.g. 20MB). On the other hand those users were receptive to adapt their
limits as they were kinda arbitrary anyway.
 
> > In this case, I guess we should consider either moving to per memcg
> > charge batching and see whether the pcp overhead x memcg_count is worth
> > that or some automagic tuning of the batch size depending on how
> > effectively the batch is used. Certainly a lot of room for
> > experimenting.
> 
> I'm not a big believer into the automagic tuning here because it's a fundamental
> trade-off of accuracy vs performance and various users might make a different
> choice depending on their needs, not on the cpu count or something else.

Yes, this not an easy thing to get right. I was mostly thinking some
auto scaling based on the limit size or growing the stock if cache hits
are common and decrease when stocks get flushed often because multiple
memcgs compete over the same pcp stock. But to me it seems like a per
memcg approach might lead better results without too many heuristics
(albeit more memory hungry).

> Per-memcg batching sounds interesting though. For example, we can likely
> batch updates on leaf cgroups and have a single atomic update instead of
> multiple most of the times. Or do you mean something different?

No, that was exactly my thinking as well.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
@ 2022-08-23  4:49           ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-23  4:49 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2643 bytes --]

On Mon 22-08-22 19:22:26, Roman Gushchin wrote:
> On Mon, Aug 22, 2022 at 09:34:59PM +0200, Michal Hocko wrote:
> > On Mon 22-08-22 11:37:30, Roman Gushchin wrote:
> > [...]
> > > I wonder only if we want to make it configurable (Idk a sysctl or maybe
> > > a config option) and close the topic.
> > 
> > I do not think this is a good idea. We have other examples where we have
> > outsourced internal tunning to the userspace and it has mostly proven
> > impractical and long term more problematic than useful (e.g.
> > lowmem_reserve_ratio, percpu_pagelist_high_fraction, swappiness just to
> > name some that come to my mind). I have seen more often these to be used
> > incorrectly than useful.
> 
> A agree, not a strong opinion here. But I wonder if somebody will
> complain on Shakeel's change because of the reduced accuracy.
> I know some users are using memory cgroups to track the size of various
> workloads (including relatively small) and 32->64 pages per cpu change
> can be noticeable for them. But we can wait for an actual bug report :)

Yes, that would be my approach. I have seen reports like that already
but that was mostly because of heavy caching on the SLUB side on older
kernels. So there surely are workloads with small limits configured
(e.g. 20MB). On the other hand those users were receptive to adapt their
limits as they were kinda arbitrary anyway.
 
> > In this case, I guess we should consider either moving to per memcg
> > charge batching and see whether the pcp overhead x memcg_count is worth
> > that or some automagic tuning of the batch size depending on how
> > effectively the batch is used. Certainly a lot of room for
> > experimenting.
> 
> I'm not a big believer into the automagic tuning here because it's a fundamental
> trade-off of accuracy vs performance and various users might make a different
> choice depending on their needs, not on the cpu count or something else.

Yes, this not an easy thing to get right. I was mostly thinking some
auto scaling based on the limit size or growing the stock if cache hits
are common and decrease when stocks get flushed often because multiple
memcgs compete over the same pcp stock. But to me it seems like a per
memcg approach might lead better results without too many heuristics
(albeit more memory hungry).

> Per-memcg batching sounds interesting though. For example, we can likely
> batch updates on leaf cgroups and have a single atomic update instead of
> multiple most of the times. Or do you mean something different?

No, that was exactly my thinking as well.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-22 15:20           ` Michal Hocko
  (?)
@ 2022-08-23  9:42             ` Michal Hocko
  -1 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-23  9:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp, Cgroups, Linux MM, netdev, LKML

On Mon 22-08-22 17:20:02, Michal Hocko wrote:
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> > 
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
> 
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
> 
> I will have another look tomorrow with a fresh brain and send an ack.

I cannot spot any problem. But I guess it would be good to have a little
comment to explain that races on the min_usage update (mentioned by Roman)
are acceptable and savings from atomic update are preferred.

The worst case I can imagine would be something like uncharge 4kB racing
with charge 2MB. The first reduces the protection (min_usage) while the other one
misses that update and doesn't increase it. But even then the effect
shouldn't be really large. At least I have hard time imagine this would
throw things off too much.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-23  9:42             ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-23  9:42 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Mon 22-08-22 17:20:02, Michal Hocko wrote:
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> > 
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
> 
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
> 
> I will have another look tomorrow with a fresh brain and send an ack.

I cannot spot any problem. But I guess it would be good to have a little
comment to explain that races on the min_usage update (mentioned by Roman)
are acceptable and savings from atomic update are preferred.

The worst case I can imagine would be something like uncharge 4kB racing
with charge 2MB. The first reduces the protection (min_usage) while the other one
misses that update and doesn't increase it. But even then the effect
shouldn't be really large. At least I have hard time imagine this would
throw things off too much.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min
@ 2022-08-23  9:42             ` Michal Hocko
  0 siblings, 0 replies; 86+ messages in thread
From: Michal Hocko @ 2022-08-23  9:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Muchun Song, Michal Koutný,
	Eric Dumazet, Soheil Hassas Yeganeh, Feng Tang, Oliver Sang,
	Andrew Morton, lkp-hn68Rpc1hR1g9hUCZPvPmw, Cgroups, Linux MM,
	netdev, LKML

On Mon 22-08-22 17:20:02, Michal Hocko wrote:
> On Mon 22-08-22 07:55:58, Shakeel Butt wrote:
> > On Mon, Aug 22, 2022 at 3:18 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
[...]
> > > Unless I have missed anything this shouldn't break the correctness but I
> > > still have to think about the proportional distribution of the
> > > protection because that adds to the complexity here.
> > 
> > The patch is not changing any semantics. It is just removing an
> > unnecessary atomic xchg() for a specific scenario (min > 0 && min <
> > usage). I don't think there will be any change related to proportional
> > distribution of the protection.
> 
> Yes, I suspect you are right. I just remembered previous fixes
> like 503970e42325 ("mm: memcontrol: fix memory.low proportional
> distribution") which just made me nervous that this is a tricky area.
> 
> I will have another look tomorrow with a fresh brain and send an ack.

I cannot spot any problem. But I guess it would be good to have a little
comment to explain that races on the min_usage update (mentioned by Roman)
are acceptable and savings from atomic update are preferred.

The worst case I can imagine would be something like uncharge 4kB racing
with charge 2MB. The first reduces the protection (min_usage) while the other one
misses that update and doesn't increase it. But even then the effect
shouldn't be really large. At least I have hard time imagine this would
throw things off too much.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2022-08-23 12:24 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-22  0:17 [PATCH 0/3] memcg: optimizatize charge codepath Shakeel Butt
2022-08-22  0:17 ` Shakeel Butt
2022-08-22  0:17 ` Shakeel Butt
2022-08-22  0:17 ` [PATCH 1/3] mm: page_counter: remove unneeded atomic ops for low/min Shakeel Butt
2022-08-22  0:17   ` Shakeel Butt
2022-08-22  0:17   ` Shakeel Butt
2022-08-22  0:20   ` Soheil Hassas Yeganeh
2022-08-22  0:20     ` Soheil Hassas Yeganeh
2022-08-22  0:20     ` Soheil Hassas Yeganeh
2022-08-22  2:39   ` Feng Tang
2022-08-22  2:39     ` Feng Tang
2022-08-22  2:39     ` Feng Tang
2022-08-22  9:55   ` Michal Hocko
2022-08-22  9:55     ` Michal Hocko
2022-08-22 10:18     ` Michal Hocko
2022-08-22 10:18       ` Michal Hocko
2022-08-22 10:18       ` Michal Hocko
2022-08-22 14:55       ` Shakeel Butt
2022-08-22 14:55         ` Shakeel Butt
2022-08-22 15:20         ` Michal Hocko
2022-08-22 15:20           ` Michal Hocko
2022-08-22 16:06           ` Shakeel Butt
2022-08-22 16:06             ` Shakeel Butt
2022-08-22 16:06             ` Shakeel Butt
2022-08-23  9:42           ` Michal Hocko
2022-08-23  9:42             ` Michal Hocko
2022-08-23  9:42             ` Michal Hocko
2022-08-22 18:23   ` Roman Gushchin
2022-08-22 18:23     ` Roman Gushchin
2022-08-22 18:23     ` Roman Gushchin
2022-08-22  0:17 ` [PATCH 2/3] mm: page_counter: rearrange struct page_counter fields Shakeel Butt
2022-08-22  0:17   ` Shakeel Butt
2022-08-22  0:24   ` Soheil Hassas Yeganeh
2022-08-22  0:24     ` Soheil Hassas Yeganeh
2022-08-22  0:24     ` Soheil Hassas Yeganeh
2022-08-22  4:55     ` Shakeel Butt
2022-08-22  4:55       ` Shakeel Butt
2022-08-22 13:06       ` Soheil Hassas Yeganeh
2022-08-22 13:06         ` Soheil Hassas Yeganeh
2022-08-22  2:10   ` Feng Tang
2022-08-22  2:10     ` Feng Tang
2022-08-22  2:10     ` Feng Tang
2022-08-22  4:59     ` Shakeel Butt
2022-08-22  4:59       ` Shakeel Butt
2022-08-22  4:59       ` Shakeel Butt
2022-08-22 10:23   ` Michal Hocko
2022-08-22 10:23     ` Michal Hocko
2022-08-22 10:23     ` Michal Hocko
2022-08-22 15:06     ` Shakeel Butt
2022-08-22 15:06       ` Shakeel Butt
2022-08-22 15:15       ` Michal Hocko
2022-08-22 15:15         ` Michal Hocko
2022-08-22 15:15         ` Michal Hocko
2022-08-22 16:04         ` Shakeel Butt
2022-08-22 16:04           ` Shakeel Butt
2022-08-22 18:27           ` Roman Gushchin
2022-08-22 18:27             ` Roman Gushchin
2022-08-22 18:27             ` Roman Gushchin
2022-08-22  0:17 ` [PATCH 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 Shakeel Butt
2022-08-22  0:17   ` Shakeel Butt
2022-08-22  0:24   ` Soheil Hassas Yeganeh
2022-08-22  0:24     ` Soheil Hassas Yeganeh
2022-08-22  0:24     ` Soheil Hassas Yeganeh
2022-08-22  2:30   ` Feng Tang
2022-08-22  2:30     ` Feng Tang
2022-08-22 10:47   ` Michal Hocko
2022-08-22 10:47     ` Michal Hocko
2022-08-22 10:47     ` Michal Hocko
2022-08-22 15:09     ` Shakeel Butt
2022-08-22 15:09       ` Shakeel Butt
2022-08-22 15:22       ` Michal Hocko
2022-08-22 15:22         ` Michal Hocko
2022-08-22 16:07         ` Shakeel Butt
2022-08-22 16:07           ` Shakeel Butt
2022-08-22 16:07           ` Shakeel Butt
2022-08-22 18:37   ` Roman Gushchin
2022-08-22 18:37     ` Roman Gushchin
2022-08-22 18:37     ` Roman Gushchin
2022-08-22 19:34     ` Michal Hocko
2022-08-22 19:34       ` Michal Hocko
2022-08-22 19:34       ` Michal Hocko
2022-08-23  2:22       ` Roman Gushchin
2022-08-23  2:22         ` Roman Gushchin
2022-08-23  2:22         ` Roman Gushchin
2022-08-23  4:49         ` Michal Hocko
2022-08-23  4:49           ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.