oe-lkp.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] memcg: optimize charge codepath
@ 2022-08-25  0:05 Shakeel Butt
  2022-08-25  0:05 ` [PATCH v2 1/3] mm: page_counter: remove unneeded atomic ops for low/min Shakeel Butt
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-08-25  0:05 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2079 bytes --]

Recently Linux networking stack has moved from a very old per socket
pre-charge caching to per-cpu caching to avoid pre-charge fragmentation
and unwarranted OOMs. One impact of this change is that for network
traffic workloads, memcg charging codepath can become a bottleneck. The
kernel test robot has also reported this regression[1]. This patch
series tries to improve the memcg charging for such workloads.

This patch series implement three optimizations:
(A) Reduce atomic ops in page counter update path.
(B) Change layout of struct page_counter to eliminate false sharing
    between usage and high.
(C) Increase the memcg charge batch to 64.

To evaluate the impact of these optimizations, on a 72 CPUs machine, we
ran the following workload in root memcg and then compared with scenario
where the workload is run in a three level of cgroup hierarchy with top
level having min and low setup appropriately.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
1. root memcg		21694.8 Mbps
2. 6.0-rc1		10482.7 Mbps (-51.6%)
3. 6.0-rc1 + (A)	14542.5 Mbps (-32.9%)
4. 6.0-rc1 + (B)	12413.7 Mbps (-42.7%)
5. 6.0-rc1 + (C)	17063.7 Mbps (-21.3%)
6. 6.0-rc1 + (A+B+C)	20120.3 Mbps (-7.2%)

With all three optimizations, the memcg overhead of this workload has
been reduced from 51.6% to just 7.2%.

[1] https://lore.kernel.org/linux-mm/20220619150456.GB34471(a)xsang-OptiPlex-9020/

Changes since v1:
- Commit message updates
- Instead of explicit padding add align compiler option with struct

Shakeel Butt (3):
  mm: page_counter: remove unneeded atomic ops for low/min
  mm: page_counter: rearrange struct page_counter fields
  memcg: increase MEMCG_CHARGE_BATCH to 64

 include/linux/memcontrol.h   |  7 ++++---
 include/linux/page_counter.h | 34 +++++++++++++++++++++++-----------
 mm/page_counter.c            | 13 ++++++-------
 3 files changed, 33 insertions(+), 21 deletions(-)

-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-25  0:05 [PATCH v2 0/3] memcg: optimize charge codepath Shakeel Butt
@ 2022-08-25  0:05 ` Shakeel Butt
  2022-08-25  6:43   ` Michal Hocko
  2022-08-25  0:05 ` [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields Shakeel Butt
  2022-08-25  0:05 ` [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 Shakeel Butt
  2 siblings, 1 reply; 13+ messages in thread
From: Shakeel Butt @ 2022-08-25  0:05 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]

For cgroups using low or min protections, the function
propagate_protected_usage() was doing an atomic xchg() operation
irrespectively. We can optimize out this atomic operation for one
specific scenario where the workload is using the protection (i.e.
min > 0) and the usage is above the protection (i.e. usage > min).

This scenario is actually very common where the users want a part of
their workload to be protected against the external reclaim. Though this
optimization does introduce a race when the usage is around the
protection and concurrent charges and uncharged trip it over or under
the protection. In such cases, we might see lower effective protection
but the subsequent charge/uncharge will correct it.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy with top
level having min and low setup appropriately to see if this optimization
is effective for the mentioned case.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		14542.5 Mbps (38.7% improvement)

With the patch, the throughput improved by 38.7%

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
Changes since v1:
- Commit message update with more detail on which scenario is getting
  optimized and possible race condition.

 mm/page_counter.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index eb156ff5d603..47711aa28161 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
 				      unsigned long usage)
 {
 	unsigned long protected, old_protected;
-	unsigned long low, min;
 	long delta;
 
 	if (!c->parent)
 		return;
 
-	min = READ_ONCE(c->min);
-	if (min || atomic_long_read(&c->min_usage)) {
-		protected = min(usage, min);
+	protected = min(usage, READ_ONCE(c->min));
+	old_protected = atomic_long_read(&c->min_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->min_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
 			atomic_long_add(delta, &c->parent->children_min_usage);
 	}
 
-	low = READ_ONCE(c->low);
-	if (low || atomic_long_read(&c->low_usage)) {
-		protected = min(usage, low);
+	protected = min(usage, READ_ONCE(c->low));
+	old_protected = atomic_long_read(&c->low_usage);
+	if (protected != old_protected) {
 		old_protected = atomic_long_xchg(&c->low_usage, protected);
 		delta = protected - old_protected;
 		if (delta)
-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-25  0:05 [PATCH v2 0/3] memcg: optimize charge codepath Shakeel Butt
  2022-08-25  0:05 ` [PATCH v2 1/3] mm: page_counter: remove unneeded atomic ops for low/min Shakeel Butt
@ 2022-08-25  0:05 ` Shakeel Butt
  2022-08-25  0:33   ` Andrew Morton
  2022-08-25  6:47   ` Michal Hocko
  2022-08-25  0:05 ` [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 Shakeel Butt
  2 siblings, 2 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-08-25  0:05 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3458 bytes --]

With memcg v2 enabled, memcg->memory.usage is a very hot member for
the workloads doing memcg charging on multiple CPUs concurrently.
Particularly the network intensive workloads. In addition, there is a
false cache sharing between memory.usage and memory.high on the charge
path. This patch moves the usage into a separate cacheline and move all
the read most fields into separate cacheline.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)	10482.7 Mbps
With patch		12413.7 Mbps (18.4% improvement)

With the patch, the throughput improved by 18.4%.

One side-effect of this patch is the increase in the size of struct
mem_cgroup. For example with this patch on 64 bit build, the size of
struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for
the performance improvement, this additional size is worth it. In
addition there are opportunities to reduce the size of struct
mem_cgroup like deprecation of kmem and tcpmem page counters and
better packing.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
Changes since v1:
- Updated the commit message
- Make struct page_counter cache align.

 include/linux/page_counter.h | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 679591301994..78a1c934e416 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -3,15 +3,26 @@
 #define _LINUX_PAGE_COUNTER_H
 
 #include <linux/atomic.h>
+#include <linux/cache.h>
 #include <linux/kernel.h>
 #include <asm/page.h>
 
+#if defined(CONFIG_SMP)
+struct pc_padding {
+	char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define PC_PADDING(name)	struct pc_padding name
+#else
+#define PC_PADDING(name)
+#endif
+
 struct page_counter {
+	/*
+	 * Make sure 'usage' does not share cacheline with any other field. The
+	 * memcg->memory.usage is a hot member of struct mem_cgroup.
+	 */
 	atomic_long_t usage;
-	unsigned long min;
-	unsigned long low;
-	unsigned long high;
-	unsigned long max;
+	PC_PADDING(_pad1_);
 
 	/* effective memory.min and memory.min usage tracking */
 	unsigned long emin;
@@ -23,18 +34,18 @@ struct page_counter {
 	atomic_long_t low_usage;
 	atomic_long_t children_low_usage;
 
-	/* legacy */
 	unsigned long watermark;
 	unsigned long failcnt;
 
-	/*
-	 * 'parent' is placed here to be far from 'usage' to reduce
-	 * cache false sharing, as 'usage' is written mostly while
-	 * parent is frequently read for cgroup's hierarchical
-	 * counting nature.
-	 */
+	/* Keep all the read most fields in a separete cacheline. */
+	PC_PADDING(_pad2_);
+
+	unsigned long min;
+	unsigned long low;
+	unsigned long high;
+	unsigned long max;
 	struct page_counter *parent;
-};
+} ____cacheline_internodealigned_in_smp;
 
 #if BITS_PER_LONG == 32
 #define PAGE_COUNTER_MAX LONG_MAX
-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-25  0:05 [PATCH v2 0/3] memcg: optimize charge codepath Shakeel Butt
  2022-08-25  0:05 ` [PATCH v2 1/3] mm: page_counter: remove unneeded atomic ops for low/min Shakeel Butt
  2022-08-25  0:05 ` [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields Shakeel Butt
@ 2022-08-25  0:05 ` Shakeel Butt
  2022-08-25  6:49   ` Michal Hocko
  2022-08-25  8:30   ` Muchun Song
  2 siblings, 2 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-08-25  0:05 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2080 bytes --]

For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
machines and the network intensive workloads requiring througput in
Gbps, 32 is too small and makes the memcg charging path a bottleneck.
For now, increase it to 64 for easy acceptance to 6.0. We will need to
revisit this in future for ever increasing demand of higher performance.

Please note that the memcg charge path drain the per-cpu memcg charge
stock, so there should not be any oom behavior change. Though it does
have impact on rstat flushing and high limit reclaim backoff.

To evaluate the impact of this optimization, on a 72 CPUs machine, we
ran the following workload in a three level of cgroup hierarchy.

 $ netserver -6
 # 36 instances of netperf with following params
 $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K

Results (average throughput of netperf):
Without (6.0-rc1)       10482.7 Mbps
With patch              17064.7 Mbps (62.7% improvement)

With the patch, the throughput improved by 62.7%.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
Changes since v1:
- Updated commit message

 include/linux/memcontrol.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4d31ce55b1c0..70ae91188e16 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,10 +354,11 @@ struct mem_cgroup {
 };
 
 /*
- * size of first charge trial. "32" comes from vmscan.c's magic value.
- * TODO: maybe necessary to use big numbers in big irons.
+ * size of first charge trial.
+ * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
+ * workload.
  */
-#define MEMCG_CHARGE_BATCH 32U
+#define MEMCG_CHARGE_BATCH 64U
 
 extern struct mem_cgroup *root_mem_cgroup;
 
-- 
2.37.1.595.g718a3a8f04-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-25  0:05 ` [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields Shakeel Butt
@ 2022-08-25  0:33   ` Andrew Morton
  2022-08-25  4:41     ` Shakeel Butt
  2022-08-25  6:47   ` Michal Hocko
  1 sibling, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2022-08-25  0:33 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1467 bytes --]

On Thu, 25 Aug 2022 00:05:05 +0000 Shakeel Butt <shakeelb@google.com> wrote:

> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. For example with this patch on 64 bit build, the size of
> struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for
> the performance improvement, this additional size is worth it. In
> addition there are opportunities to reduce the size of struct
> mem_cgroup like deprecation of kmem and tcpmem page counters and
> better packing.

Did you evaluate the effects of using a per-cpu counter of some form?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-25  0:33   ` Andrew Morton
@ 2022-08-25  4:41     ` Shakeel Butt
  2022-08-25  5:21       ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Shakeel Butt @ 2022-08-25  4:41 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1911 bytes --]

On Wed, Aug 24, 2022 at 5:33 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 25 Aug 2022 00:05:05 +0000 Shakeel Butt <shakeelb@google.com> wrote:
>
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy.
> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)     10482.7 Mbps
> > With patch            12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. For example with this patch on 64 bit build, the size of
> > struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for
> > the performance improvement, this additional size is worth it. In
> > addition there are opportunities to reduce the size of struct
> > mem_cgroup like deprecation of kmem and tcpmem page counters and
> > better packing.
>
> Did you evaluate the effects of using a per-cpu counter of some form?

Do you mean per-cpu counter for usage or something else? The usage
needs to be compared against the limits and accumulating per-cpu is
costly particularly on larger machines, so, no easy way to make usage
a per-cpu counter. Or maybe I misunderstood you and you meant
something else.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-25  4:41     ` Shakeel Butt
@ 2022-08-25  5:21       ` Andrew Morton
  2022-08-25 15:24         ` Shakeel Butt
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2022-08-25  5:21 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 754 bytes --]

On Wed, 24 Aug 2022 21:41:42 -0700 Shakeel Butt <shakeelb@google.com> wrote:

> > Did you evaluate the effects of using a per-cpu counter of some form?
> 
> Do you mean per-cpu counter for usage or something else?

percpu_counter, perhaps.  Or some hand-rolled thing if that's more suitable.

> The usage
> needs to be compared against the limits and accumulating per-cpu is
> costly particularly on larger machines,

Well, there are tricks one can play.  For example, only run
__percpu_counter_sum() when `usage' is close to its limit.  

I'd suggest flinging together a prototype which simply uses
percpu_counter_read() all the time.  If the performance testing results
are sufficiently promising, then look into the accuracy issues.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/3] mm: page_counter: remove unneeded atomic ops for low/min
  2022-08-25  0:05 ` [PATCH v2 1/3] mm: page_counter: remove unneeded atomic ops for low/min Shakeel Butt
@ 2022-08-25  6:43   ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2022-08-25  6:43 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3316 bytes --]

On Thu 25-08-22 00:05:04, Shakeel Butt wrote:
> For cgroups using low or min protections, the function
> propagate_protected_usage() was doing an atomic xchg() operation
> irrespectively. We can optimize out this atomic operation for one
> specific scenario where the workload is using the protection (i.e.
> min > 0) and the usage is above the protection (i.e. usage > min).
> 
> This scenario is actually very common where the users want a part of
> their workload to be protected against the external reclaim. Though this
> optimization does introduce a race when the usage is around the
> protection and concurrent charges and uncharged trip it over or under
> the protection. In such cases, we might see lower effective protection
> but the subsequent charge/uncharge will correct it.

Thanks this is much more useful

> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy with top
> level having min and low setup appropriately to see if this optimization
> is effective for the mentioned case.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		14542.5 Mbps (38.7% improvement)
> 
> With the patch, the throughput improved by 38.7%
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Reviewed-by: Feng Tang <feng.tang@intel.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
> Changes since v1:
> - Commit message update with more detail on which scenario is getting
>   optimized and possible race condition.
> 
>  mm/page_counter.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index eb156ff5d603..47711aa28161 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -17,24 +17,23 @@ static void propagate_protected_usage(struct page_counter *c,
>  				      unsigned long usage)
>  {
>  	unsigned long protected, old_protected;
> -	unsigned long low, min;
>  	long delta;
>  
>  	if (!c->parent)
>  		return;
>  
> -	min = READ_ONCE(c->min);
> -	if (min || atomic_long_read(&c->min_usage)) {
> -		protected = min(usage, min);
> +	protected = min(usage, READ_ONCE(c->min));
> +	old_protected = atomic_long_read(&c->min_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->min_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
>  			atomic_long_add(delta, &c->parent->children_min_usage);
>  	}
>  
> -	low = READ_ONCE(c->low);
> -	if (low || atomic_long_read(&c->low_usage)) {
> -		protected = min(usage, low);
> +	protected = min(usage, READ_ONCE(c->low));
> +	old_protected = atomic_long_read(&c->low_usage);
> +	if (protected != old_protected) {
>  		old_protected = atomic_long_xchg(&c->low_usage, protected);
>  		delta = protected - old_protected;
>  		if (delta)
> -- 
> 2.37.1.595.g718a3a8f04-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-25  0:05 ` [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields Shakeel Butt
  2022-08-25  0:33   ` Andrew Morton
@ 2022-08-25  6:47   ` Michal Hocko
  2022-08-25 15:25     ` Shakeel Butt
  1 sibling, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2022-08-25  6:47 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3450 bytes --]

On Thu 25-08-22 00:05:05, Shakeel Butt wrote:
> With memcg v2 enabled, memcg->memory.usage is a very hot member for
> the workloads doing memcg charging on multiple CPUs concurrently.
> Particularly the network intensive workloads. In addition, there is a
> false cache sharing between memory.usage and memory.high on the charge
> path. This patch moves the usage into a separate cacheline and move all
> the read most fields into separate cacheline.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)	10482.7 Mbps
> With patch		12413.7 Mbps (18.4% improvement)
> 
> With the patch, the throughput improved by 18.4%.
> 
> One side-effect of this patch is the increase in the size of struct
> mem_cgroup. For example with this patch on 64 bit build, the size of
> struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for
> the performance improvement, this additional size is worth it. In
> addition there are opportunities to reduce the size of struct
> mem_cgroup like deprecation of kmem and tcpmem page counters and
> better packing.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Reviewed-by: Feng Tang <feng.tang@intel.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

One nit below

> ---
> Changes since v1:
> - Updated the commit message
> - Make struct page_counter cache align.
> 
>  include/linux/page_counter.h | 35 +++++++++++++++++++++++------------
>  1 file changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 679591301994..78a1c934e416 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -3,15 +3,26 @@
>  #define _LINUX_PAGE_COUNTER_H
>  
>  #include <linux/atomic.h>
> +#include <linux/cache.h>
>  #include <linux/kernel.h>
>  #include <asm/page.h>
>  
> +#if defined(CONFIG_SMP)
> +struct pc_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;
> +#define PC_PADDING(name)	struct pc_padding name
> +#else
> +#define PC_PADDING(name)
> +#endif
> +
>  struct page_counter {
> +	/*
> +	 * Make sure 'usage' does not share cacheline with any other field. The
> +	 * memcg->memory.usage is a hot member of struct mem_cgroup.
> +	 */
>  	atomic_long_t usage;
> -	unsigned long min;
> -	unsigned long low;
> -	unsigned long high;
> -	unsigned long max;
> +	PC_PADDING(_pad1_);
>  
>  	/* effective memory.min and memory.min usage tracking */
>  	unsigned long emin;
> @@ -23,18 +34,18 @@ struct page_counter {
>  	atomic_long_t low_usage;
>  	atomic_long_t children_low_usage;
>  
> -	/* legacy */
>  	unsigned long watermark;
>  	unsigned long failcnt;

These two are also touched in the charging path so we could squeeze them
into the same cache line as usage.

0-day machinery was quite good at hitting noticeable regression anytime
we have changed layout so let's see what they come up with after this
patch ;)
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-25  0:05 ` [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 Shakeel Butt
@ 2022-08-25  6:49   ` Michal Hocko
  2022-08-25  8:30   ` Muchun Song
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2022-08-25  6:49 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 2324 bytes --]

On Thu 25-08-22 00:05:06, Shakeel Butt wrote:
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
> 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change. Though it does
> have impact on rstat flushing and high limit reclaim backoff.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy.
> 
>  $ netserver -6
>  # 36 instances of netperf with following params
>  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Reviewed-by: Feng Tang <feng.tang@intel.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
> Changes since v1:
> - Updated commit message
> 
>  include/linux/memcontrol.h | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d31ce55b1c0..70ae91188e16 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,10 +354,11 @@ struct mem_cgroup {
>  };
>  
>  /*
> - * size of first charge trial. "32" comes from vmscan.c's magic value.
> - * TODO: maybe necessary to use big numbers in big irons.
> + * size of first charge trial.
> + * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> + * workload.
>   */
> -#define MEMCG_CHARGE_BATCH 32U
> +#define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
>  
> -- 
> 2.37.1.595.g718a3a8f04-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64
  2022-08-25  0:05 ` [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 Shakeel Butt
  2022-08-25  6:49   ` Michal Hocko
@ 2022-08-25  8:30   ` Muchun Song
  1 sibling, 0 replies; 13+ messages in thread
From: Muchun Song @ 2022-08-25  8:30 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 1503 bytes --]



> On Aug 25, 2022, at 08:05, Shakeel Butt <shakeelb@google.com> wrote:
> 
> For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
> machines and the network intensive workloads requiring througput in
> Gbps, 32 is too small and makes the memcg charging path a bottleneck.
> For now, increase it to 64 for easy acceptance to 6.0. We will need to
> revisit this in future for ever increasing demand of higher performance.
> 
> Please note that the memcg charge path drain the per-cpu memcg charge
> stock, so there should not be any oom behavior change. Though it does
> have impact on rstat flushing and high limit reclaim backoff.
> 
> To evaluate the impact of this optimization, on a 72 CPUs machine, we
> ran the following workload in a three level of cgroup hierarchy.
> 
> $ netserver -6
> # 36 instances of netperf with following params
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> 
> Results (average throughput of netperf):
> Without (6.0-rc1)       10482.7 Mbps
> With patch              17064.7 Mbps (62.7% improvement)
> 
> With the patch, the throughput improved by 62.7%.

This is very impressive.

> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Reviewed-by: Feng Tang <feng.tang@intel.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-25  5:21       ` Andrew Morton
@ 2022-08-25 15:24         ` Shakeel Butt
  0 siblings, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-08-25 15:24 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]

On Wed, Aug 24, 2022 at 10:21 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 24 Aug 2022 21:41:42 -0700 Shakeel Butt <shakeelb@google.com> wrote:
>
> > > Did you evaluate the effects of using a per-cpu counter of some form?
> >
> > Do you mean per-cpu counter for usage or something else?
>
> percpu_counter, perhaps.  Or some hand-rolled thing if that's more suitable.
>
> > The usage
> > needs to be compared against the limits and accumulating per-cpu is
> > costly particularly on larger machines,
>
> Well, there are tricks one can play.  For example, only run
> __percpu_counter_sum() when `usage' is close to its limit.
>
> I'd suggest flinging together a prototype which simply uses
> percpu_counter_read() all the time.  If the performance testing results
> are sufficiently promising, then look into the accuracy issues.
>

Thanks, I will take a stab at that in a week or so.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields
  2022-08-25  6:47   ` Michal Hocko
@ 2022-08-25 15:25     ` Shakeel Butt
  0 siblings, 0 replies; 13+ messages in thread
From: Shakeel Butt @ 2022-08-25 15:25 UTC (permalink / raw)
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 3930 bytes --]

On Wed, Aug 24, 2022 at 11:47 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 25-08-22 00:05:05, Shakeel Butt wrote:
> > With memcg v2 enabled, memcg->memory.usage is a very hot member for
> > the workloads doing memcg charging on multiple CPUs concurrently.
> > Particularly the network intensive workloads. In addition, there is a
> > false cache sharing between memory.usage and memory.high on the charge
> > path. This patch moves the usage into a separate cacheline and move all
> > the read most fields into separate cacheline.
> >
> > To evaluate the impact of this optimization, on a 72 CPUs machine, we
> > ran the following workload in a three level of cgroup hierarchy.
> >
> >  $ netserver -6
> >  # 36 instances of netperf with following params
> >  $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > Results (average throughput of netperf):
> > Without (6.0-rc1)     10482.7 Mbps
> > With patch            12413.7 Mbps (18.4% improvement)
> >
> > With the patch, the throughput improved by 18.4%.
> >
> > One side-effect of this patch is the increase in the size of struct
> > mem_cgroup. For example with this patch on 64 bit build, the size of
> > struct mem_cgroup increased from 4032 bytes to 4416 bytes. However for
> > the performance improvement, this additional size is worth it. In
> > addition there are opportunities to reduce the size of struct
> > mem_cgroup like deprecation of kmem and tcpmem page counters and
> > better packing.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Reviewed-by: Feng Tang <feng.tang@intel.com>
> > Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> > Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>

Thanks.

> One nit below
>
> > ---
> > Changes since v1:
> > - Updated the commit message
> > - Make struct page_counter cache align.
> >
> >  include/linux/page_counter.h | 35 +++++++++++++++++++++++------------
> >  1 file changed, 23 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> > index 679591301994..78a1c934e416 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -3,15 +3,26 @@
> >  #define _LINUX_PAGE_COUNTER_H
> >
> >  #include <linux/atomic.h>
> > +#include <linux/cache.h>
> >  #include <linux/kernel.h>
> >  #include <asm/page.h>
> >
> > +#if defined(CONFIG_SMP)
> > +struct pc_padding {
> > +     char x[0];
> > +} ____cacheline_internodealigned_in_smp;
> > +#define PC_PADDING(name)     struct pc_padding name
> > +#else
> > +#define PC_PADDING(name)
> > +#endif
> > +
> >  struct page_counter {
> > +     /*
> > +      * Make sure 'usage' does not share cacheline with any other field. The
> > +      * memcg->memory.usage is a hot member of struct mem_cgroup.
> > +      */
> >       atomic_long_t usage;
> > -     unsigned long min;
> > -     unsigned long low;
> > -     unsigned long high;
> > -     unsigned long max;
> > +     PC_PADDING(_pad1_);
> >
> >       /* effective memory.min and memory.min usage tracking */
> >       unsigned long emin;
> > @@ -23,18 +34,18 @@ struct page_counter {
> >       atomic_long_t low_usage;
> >       atomic_long_t children_low_usage;
> >
> > -     /* legacy */
> >       unsigned long watermark;
> >       unsigned long failcnt;
>
> These two are also touched in the charging path so we could squeeze them
> into the same cache line as usage.
>
> 0-day machinery was quite good at hitting noticeable regression anytime
> we have changed layout so let's see what they come up with after this
> patch ;)

I will try this locally first (after some cleanups) to see if there is
any positive or negative impact and report here.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-25 15:25 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-25  0:05 [PATCH v2 0/3] memcg: optimize charge codepath Shakeel Butt
2022-08-25  0:05 ` [PATCH v2 1/3] mm: page_counter: remove unneeded atomic ops for low/min Shakeel Butt
2022-08-25  6:43   ` Michal Hocko
2022-08-25  0:05 ` [PATCH v2 2/3] mm: page_counter: rearrange struct page_counter fields Shakeel Butt
2022-08-25  0:33   ` Andrew Morton
2022-08-25  4:41     ` Shakeel Butt
2022-08-25  5:21       ` Andrew Morton
2022-08-25 15:24         ` Shakeel Butt
2022-08-25  6:47   ` Michal Hocko
2022-08-25 15:25     ` Shakeel Butt
2022-08-25  0:05 ` [PATCH v2 3/3] memcg: increase MEMCG_CHARGE_BATCH to 64 Shakeel Butt
2022-08-25  6:49   ` Michal Hocko
2022-08-25  8:30   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).